摘要
为提高大数据聚类效率,提出一种基于Hadoop框架的K均值聚类算法。采用Hadoop框架所用的MapReduce模型,将大数据划分成许多数据块。在Map阶段提出权重K均值聚类算法,对每一个数据块独立聚类,得到聚类中心和权重;在Reduce阶段提出加权融合K均值聚类算法,对Map阶段得到的聚类中心和权重进行融合,得到最终的聚类结果。在HIGGS数据集上进行聚类实验,实验结果表明,该算法在保持聚类准确率的前提下大幅提升了大数据聚类时K均值聚类算法的运算效率。
To improve the efficiency of big data clustering,a K-means clustering algorithm based on Hadoop framework was proposed.The MapReduce model of Hadoop framework was used to divide big data into many data blocks.In the Map phase,a weighted K-means clustering algorithm was proposed to cluster independently for each data block,and the clustering centers and weights were obtained.In the Reduce phase,the weighted fusing K-means clustering algorithm was proposed,to fuse the clustering centers and weights obtained in the Map phase,and the final clustering results were obtained.The clustering experiment was executed on HIGGS dataset,the results show that the proposed algorithm can greatly improve the efficiency of K-means clustering algorithm for big data clustering on the premise of keeping the accuracy of clustering.
作者
李爽
陈瑞瑞
林楠
LI Shuang;CHEN Rui-rui;LIN Nan(School of Information Engineering, Zhengzhou University of Industrial Technology, Zhengzhou 451199, China;College of Software and Application of Science and Technology, Zhengzhou University, Zhengzhou 451199, China)
出处
《计算机工程与设计》
北大核心
2018年第12期3734-3738,共5页
Computer Engineering and Design
基金
国家自然科学基金项目(61502204)