摘要
大规模的netflow训练数据集是构建高质量、高稳定网络流量分类器的必然要求。但随着网络流特征维数的提高和数据集规模的扩大,无论是网络流的分析处理还是基于支持向量机(SVM)的分类器模型的训练,都无法在有效的时间内得到有效的处理结果。本文基于Hadoop云计算平台,采用MapReduce技术对SVM网络流量分类器进行分布式学习和训练,构建CloudSVM网络流量分类器。通过对来自校园网出口镜像的近2 T的大规模网络流量的跟踪文件的分布式存储和处理,对抽取的样本数据集进行分类,实验验证了基于Hadoop平台分布式存储和并行处理大规模网络数据集的高效率性,也验证了CloudSVM分类器在不降低分类准确度的情况下可以快速收敛到最佳,并随着大规模网络流样本的增加,SVM分类器训练的时间趋近平稳。
Large-scale net flow training data sets are inevitable requirements for building highquality,highly stable network traffic classifiers.However,with the increase of the feature dimension of the network stream and the expansion of the data set size,neither the analysis processing of the network stream nor the training of the classifier model based on Support Vector Machin(SVM)can obtain effective processing results in effective time.A distributed and parallel large-scale network flow based on Hadoop cloud computing platform is proposed.Distributed learning and training of SVM network traffic classifier is implemented by MapReduce technology on Hadoop cloud computing platform,and CloudSVM network traffic classifier is constructed.Through the distributed storage and processing of trace files of large-scale network traffic from the campus network export mirror,the sample data sets are classified,and the distributed storage and parallel processing of large-scale network data based on Hadoop platform is experimentally verified.The high efficiency of the set also verifies that the CloudSVM classifier can quickly converge to the best without reducing the accuracy of the classification,and with the increase of large-scale network flow samples,the training time of the SVM classifier is approaching constant.
作者
邓河
唐一韬
贺宗梅
袁爱平
DENG He;TANG Yitao;HE Zongmei;YUAN Aiping(School of Software,Changsha Social Work College,Changsha Hunan 410000,China)
出处
《太赫兹科学与电子信息学报》
北大核心
2020年第5期918-923,共6页
Journal of Terahertz Science and Electronic Information Technology
基金
湖南省教育厅科研资助项目(15C0081)
湖南省教育厅科研资助项目(14C0064)
湖南省教育厅科研资助项目(19C0103)。