期刊文献+

基于Spark的模糊聚类算法实现及其应用 被引量:2

Implementation and Application of Fuzzy Clustering Algorithm Based on Spark
下载PDF
导出
摘要 作为软聚类的代表性算法,模糊聚类算法FCM能客观地处理带有模糊性的聚类问题。为了适应对大数据进行实时和准确地聚类的需求,提高FCM算法对大数据的聚类效率,设计了FCM基于大数据计算平台Spark的并行化实现方法。该方案用HDFS对底层数据进行分布式存储,用RDD机制进行计算过程中的数据转换,用持久化技术实现中间结果的重用。为了检验所设计的并行化FCM算法的有效性,将其应用于入侵检测系统中,首先对KDD CUP 99数据集进行预处理,然后分别在单机和Spark集群上,针对预处理前后的KDD CUP 99数据集,用该算法实施聚类来检测入侵,并比较检测的准确性和时效性。应用结果表明,基于Spark的并行化FCM算法有良好的聚类鲁棒性、收敛速度和准确率,尤其在处理大规模样本数据时,优势更加明显。 As a typical soft clustering algorithm,fuzzy c-means(FCM)can deal with the clustering problem with fuzziness objectively.In order to adapt to the need for real-time and accurate clustering of big data and improve the clustering efficiency of FCM algorithm for big data,we design a parallel implementation method of FCM based on Spark,a big data computing platform.HDFS is used to store the underlying data,RDD is used for realizing data conversion in the computing process,and persistence technology for the reuse of intermediate results.To test the effectiveness of the designed parallel FCM,it is applied in the intrusion detection system.First KDD CUP 99 data set is preprocessed,and then intrusions are detected by using the algorithm to cluster KDD CUP 99 data sets before and after pretreatment respectively and on the single machine and Spark cluster respectively.In addition,the accuracy and timeliness of the detection are compared.The application results show that the parallel FCM algorithm based on Spark has better clustering robustness,convergence speed and accuracy,especially more significant advantages when dealing with large sample data.
作者 吴云龙 李玲娟 WU Yun-long;LI Ling-juan(School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China)
出处 《计算机技术与发展》 2019年第1期130-134,共5页 Computer Technology and Development
基金 国家自然科学基金(61302158 61571238)
关键词 聚类分析 模糊C均值 SPARK 入侵检测 cluster analysis fuzzy c-means Spark intrusion detection
  • 相关文献

参考文献4

二级参考文献113

  • 1陈跃国,王京春.数据集成综述[J].计算机科学,2004,31(5):48-51. 被引量:139
  • 2史美林,钱俊,许超.入侵检测系统数据集评测研究[J].计算机科学,2006,33(8):1-8. 被引量:24
  • 3梅建明.论反恐数据挖掘[J].中国人民公安大学学报(社会科学版),2007,23(2):24-29. 被引量:16
  • 4祖宝明,詹永照,卿林.一种针对MANET入侵检测Agent分布的分簇方法[J].微计算机信息,2007,23(05X):41-43. 被引量:1
  • 5Labrinidis A, Jagadish H V. Challenges and Opportunities with Big Data. Proc of the VLDB Endowment, 2012, 5(12) : 2032-2033.
  • 6Bizer C, Boncz P, Brodie M L, et al. The Meaningful Use of Big Data : Four Perspectives-Four Challenges. ACM SIGMOD Record, 2012, 40(4) : 56-60.
  • 7Wang F Y. A Big-Data Perspective on AI: Newton, Merton, and An- alytics Intelligence. IEEE Intelligent Systems, 2012, 27 (5) : 2-4.
  • 8Simon H A. Why Should Machines Learn?//Michalski R S, Car- bonell J G, Mitchell T M, et al. , eds. Machine Learning: An Arti- ficial Intelligence Approach. Berlin, Germany: Springer, 1983: 25 -37.
  • 9Hart P. The Condensed Nearest Neighbor Rule. IEEE Trans on In- formation Theory, 1968, 14(3) : 515-516.
  • 10Gates G. The Reduced Nearest Neighbor Rule. IEEE Trans on In- formation Theory, 1972, 18(3) : 431-433.

共引文献477

同被引文献21

引证文献2

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部