摘要
在分析Hadoop平台架构和Canopy-Kmeans聚类算法的基础上,对Canopy-Kmeans算法进行了并行化优化改进,通过统计学思维对数据分组抽样后聚类以方便并行化和降低时间复杂度,利用最小最大原则优化Canopy初始中心点选取,用数据异度均值抽样法保证从原数据中均匀提取数据样本,并对Kmeans迭代计算过程进行优化。结合Hadoop平台下MapReduce框架将改进算法进行并行化设计实现。实验表明,对海量数值数据进行聚类时,改进的Canopy-Kmeans并行算法是有效的、收敛的,在聚类准确率和时效性上都有一定程度的提升。
Based on the analysis of Hadoop platform architecture and Canopy kmeans clustering algorithm, the Canopy-kmeans algorithm is optimized for parallelization. The data is sampled and clustered by statistical thinking to facilitate parallelization and reduce time complexity. The minimum and maximum principle optimizes the Canopy initial center point selection, and the data heterogeneous mean sampling method is used to ensure uniform extraction of data samples from the original data, and the Kmeans iterative calculation process is optimized. Combined with the MapReduce framework under the Hadoop platform, the improved algorithm is designed and implemented in parallel. Experiments show that the improved Canopy Kmeans parallel algorithm is effective and convergent when clustering massive numerical data, and has a certain degree of improvement in clustering accuracy and timeliness.
作者
周功建
ZHOU Gongjian(Xiamen University Tan Kah Kee College,Zhangzhou Fujian 363105,China)
出处
《安徽广播电视大学学报》
2018年第4期117-122,128,共7页
Journal of Anhui Radio & TV University
基金
福建省教育科学"十三五"规划重点课题(项目编号:FJJKCGZ16-165)