期刊文献+

基于Hadoop平台Canopy-Kmeans聚类算法优化改进研究 被引量:2

Research on Optimization and Improvement of Canopy-Kmeans Clustering Algorithm Based on Hadoop Platform
下载PDF
导出
摘要 在分析Hadoop平台架构和Canopy-Kmeans聚类算法的基础上,对Canopy-Kmeans算法进行了并行化优化改进,通过统计学思维对数据分组抽样后聚类以方便并行化和降低时间复杂度,利用最小最大原则优化Canopy初始中心点选取,用数据异度均值抽样法保证从原数据中均匀提取数据样本,并对Kmeans迭代计算过程进行优化。结合Hadoop平台下MapReduce框架将改进算法进行并行化设计实现。实验表明,对海量数值数据进行聚类时,改进的Canopy-Kmeans并行算法是有效的、收敛的,在聚类准确率和时效性上都有一定程度的提升。 Based on the analysis of Hadoop platform architecture and Canopy kmeans clustering algorithm, the Canopy-kmeans algorithm is optimized for parallelization. The data is sampled and clustered by statistical thinking to facilitate parallelization and reduce time complexity. The minimum and maximum principle optimizes the Canopy initial center point selection, and the data heterogeneous mean sampling method is used to ensure uniform extraction of data samples from the original data, and the Kmeans iterative calculation process is optimized. Combined with the MapReduce framework under the Hadoop platform, the improved algorithm is designed and implemented in parallel. Experiments show that the improved Canopy Kmeans parallel algorithm is effective and convergent when clustering massive numerical data, and has a certain degree of improvement in clustering accuracy and timeliness.
作者 周功建 ZHOU Gongjian(Xiamen University Tan Kah Kee College,Zhangzhou Fujian 363105,China)
出处 《安徽广播电视大学学报》 2018年第4期117-122,128,共7页 Journal of Anhui Radio & TV University
基金 福建省教育科学"十三五"规划重点课题(项目编号:FJJKCGZ16-165)
关键词 HADOOP MAPREDUCE 聚类分析 Kmeans算法 Canopy-Kmeans算法 加速比 Hadoop MapReduce cluster analysis Kmeans algorithm Canopy-Kmeans algorithm speedup
  • 相关文献

参考文献4

二级参考文献50

  • 1江小平,李成华,向文,张新访,颜海涛.k-means聚类算法的MapReduce并行化实现[J].华中科技大学学报(自然科学版),2011,39(S1):120-124. 被引量:79
  • 2刘远超,王晓龙,刘秉权.一种改进的k-means文档聚类初值选择算法[J].高技术通讯,2006,16(1):11-15. 被引量:23
  • 3Han Jiawei,Kamber M.Data mining:concepts and tech- niques[M].San Francisco:Morgan Kaufmann Publishers, 2000.
  • 4Januzaj E, Kriegel H P, Pfeifle M.DBDC : Density-Based Distributed Clustering[C]//Proceedings of 9th International Conference on Extending Database Technology(EDBT). Oakland: IEEE Computer Press, 2004 : 88-105.
  • 5Samatova N F, Ostrouchov G.RACHET : an efficient cov- er-based merging of clustering hierarchies from distribut- ed datasets[J].Distributed and Parallel Databases,2002, 11 (2) : 157-180.
  • 6Johoson E, KarguPta H.Collective, hierarchical clustering from distributed, heterogeneous data[C]//Lecture Notes in Computer Science.Berlin: Springer, 2000 : 221-244.
  • 7Kargupta H.Sclable, distributed data mining using an agent based architecture[C]//Proceedings of 3rd Interna- tional Conference on Knowledge Discovery and Data Mining.Oakland .. AAAI Press, 1997 .. 211-214.
  • 8Hearst M A.Texttiling: segmenting text into multi-para- graph subtopic passages[J].Computational Linguistics, 1997,23(1) :33-64.
  • 9Dean J, Ghemawat S.MapReduce-simplified data process- ing on large clusters[C]//Proceedings of the 6th Inter- national Conference on Operation Systems Design & Im- plementation(OSDI), Berkeley, CA, USA, 2004 : 137-150.
  • 10WhiteT.Hadoop权威指南[M].曾大聃,周傲英,译.北京清华大学出版社,2010.

共引文献78

同被引文献20

引证文献2

二级引证文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部