摘要
作为软聚类的代表性算法,模糊聚类算法FCM能客观地处理带有模糊性的聚类问题。为了适应对大数据进行实时和准确地聚类的需求,提高FCM算法对大数据的聚类效率,设计了FCM基于大数据计算平台Spark的并行化实现方法。该方案用HDFS对底层数据进行分布式存储,用RDD机制进行计算过程中的数据转换,用持久化技术实现中间结果的重用。为了检验所设计的并行化FCM算法的有效性,将其应用于入侵检测系统中,首先对KDD CUP 99数据集进行预处理,然后分别在单机和Spark集群上,针对预处理前后的KDD CUP 99数据集,用该算法实施聚类来检测入侵,并比较检测的准确性和时效性。应用结果表明,基于Spark的并行化FCM算法有良好的聚类鲁棒性、收敛速度和准确率,尤其在处理大规模样本数据时,优势更加明显。
As a typical soft clustering algorithm,fuzzy c-means(FCM)can deal with the clustering problem with fuzziness objectively.In order to adapt to the need for real-time and accurate clustering of big data and improve the clustering efficiency of FCM algorithm for big data,we design a parallel implementation method of FCM based on Spark,a big data computing platform.HDFS is used to store the underlying data,RDD is used for realizing data conversion in the computing process,and persistence technology for the reuse of intermediate results.To test the effectiveness of the designed parallel FCM,it is applied in the intrusion detection system.First KDD CUP 99 data set is preprocessed,and then intrusions are detected by using the algorithm to cluster KDD CUP 99 data sets before and after pretreatment respectively and on the single machine and Spark cluster respectively.In addition,the accuracy and timeliness of the detection are compared.The application results show that the parallel FCM algorithm based on Spark has better clustering robustness,convergence speed and accuracy,especially more significant advantages when dealing with large sample data.
作者
吴云龙
李玲娟
WU Yun-long;LI Ling-juan(School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China)
出处
《计算机技术与发展》
2019年第1期130-134,共5页
Computer Technology and Development
基金
国家自然科学基金(61302158
61571238)