摘要
针对传统k-均值聚类方法不能有效处理海量数据聚类的问题,该文提出一种基于并行计算的加速k-均值聚类(K-means clustering based on parallel computing,Pk-means)方法。该方法首先将海量的聚类样本随机划分为多个独立同分布的聚类工作集,并在每个工作集上并行进行传统k-均值聚类,并得到相应的聚类中心和半径,通过衡量不同子集聚类结果的关系,对每个工作集中聚类得到的子类进行合并,并对特殊数据进行二次归并以校正聚类结果,从而有效处理海量数据的聚类问题。实验结果表明,Pk_means方法在大规模数据集上在保持聚类效果的同时大幅度提高了聚类效率。
To solve problems that traditional k-means clustering algorithm can not solve the large scale dataset clustering,this pa per presents a speeding k-means clustering method based on parallel computing,called PK-means clustering algorithm,in order to solve the low efficiency clustering problem of traditional k-means algorithm.The large scale samples set is divided into some clustering working sets with independent identical distribution and the traditional k-means clustering method is executed on ev ery working set.Then the center and radius of every cluster is computed,and the clustering results of all working sets are com bined by the relationship of different working set.At last,the remaining small special samples are clustered by the former results.The parallel computing way is used in this process,so the clustering efficiency is improved largely and it can be used to solve the large scale clustering problems.Simulation results demonstrate that the excellent clustering efficiency is obtained by this parallel speeding k-means method.
出处
《电脑知识与技术》
2013年第6X期4299-4302,共4页
Computer Knowledge and Technology
关键词
K-均值聚类
并行计算
并行k-均值聚类
工作集
效率
k-means clustering
parallel computing
parallel k-means clustering
working set
efficiency