摘要
网络流量加密在增强了通信安全与隐私保护的同时,也为恶意流量检测带来了新的挑战.近年来随着机器学习在各领域成功应用,其也被应用于加密流量分类中,但传统特征提取方法可能会导致流量中重要信息丢失或无效信息冗余,阻碍了分类精度与效率的进一步提升.本文提出一种基于低维二阶马尔可夫矩阵的加密流量分类方法LDSM,用以筛选表征能力强的流量特征,从而优化模型分类效果.首先,提取加密流量中有效负载,根据其十六进制字符空间分布构建二阶马尔可夫矩阵;其次,通过计算状态转移概率矩阵中各特征的基尼增益,迭代删除对模型训练贡献最低的特征,取模型分类准确率最高的特征集合作为低维二阶马尔可夫矩阵特征;最后,通过实验验证低维二阶马尔可夫矩阵特征的模型训练能力.实验中构建了Scikit-learn的实验环境,采用两个公开数据集CTU-13和CIC-IDS2017,实现对加密流量的分类任务,特征降维实验结果表明,LDSM方法将二阶马尔可夫矩阵特征降维至256个特征时分类效果最佳,特征降维后仅为原特征数量的6.25%,保证模型分类精度的同时提升了模型训练效率;与其他方法对比实验结果表明,LDSM方法流量分类的平均准确率达到98.51%,与其他方法相比,分类准确率提高3%以上,所以LDSM方法对于加密流量分类是可行且有效的.
Network traffic encryption enhances communication security and privacy protection,but also poses new challenges for malicious traffic detection.Machine learning has been successfully applied in various fields,including encrypted traffic classification.However,traditional feature extraction methods may cause important information loss or invalid information redundancy in traffic,which hinders the further improvement of classification accuracy and efficiency.This paper proposes an encrypted traffic classification method based on a Low-Dimensional Second-order Markov matrix(LDSM),which selects traffic features with high representational abilities to improve the model classification performance.Firstly,the payload of encrypted traffic is extracted and a second-order Markov matrix is constructed according to its hexadecimal character space distribution.Secondly,by computing the Gini gain of each feature in the state transition probability matrix,the feature with the lowest contribution to model training is iteratively deleted,and the feature set with the highest classification accuracy is selected as the low-dimensional second-order Markov matrix feature.Finally,the effectiveness of the low-dimensional second-order Markov matrix features in model training is verified through experiments.In the experiments,a Scikit-learn experimental environment is built and three public datasets:CTU-13,CIC-ISD2017,and CIC IoT Dataset 2023 are used,along with self-collected real network traffic,to accomplish the task of encrypted traffic classification.The feature dimensionality reduction experiment results show that the LDSM method achieves the best performance with a reduction of the dimensionality of secondorder Markov matrix features to 256.After feature dimensionality reduction,the number of original features is only 6.25%,which ensures the model classification accuracy while improving the model training efficiency.Compared with other methods,the experimental results demonstrate that the average accuracy of the LDSM method for traffic classification reaches 98.52%,which is more than 3%higher than other methods.Thus,the LDSM is a feasible and effective method for encrypted traffic classification.
作者
郭昊
陈周国
刘智
冷涛
郭先超
张岩峰
GUO Hao;CHEN Zhou-Guo;LIU Zhi;LENG Tao;GUO Xian-Chao;ZHANG Yan-Feng(School of Computer Science,Southwest Petroleum University,Chengdu 610500,China;China Electronics Technology Corporation 30th Research Institute,Chengdu 610041,China;Sichuan Police College,Intelligent Policing Key Laboratory of Sichuan Province,Luzhou 646000,China)
出处
《四川大学学报(自然科学版)》
CAS
CSCD
北大核心
2024年第3期30-37,共8页
Journal of Sichuan University(Natural Science Edition)
基金
智能警务四川省重点实验室资助项目(ZNJW2022KFQN003)。
关键词
加密流量
机器学习
马尔可夫
基尼增益
特征降维
Encrypted traffic
Machine learning
Markov
Gini gain
Feature dimensionality reduction