摘要
针对大数据背景下基于划分的聚类算法中存在初始中心敏感,节点间通信开销大以及集群效率低下等问题,提出了基于网格密度和局部敏感哈希函数的PBGDLSH-MR并行化聚类算法。首先,对初始数据集提出网格密度策略(GDS)获取初始中心点,有效避免了随机选取引起的初始中心敏感的问题;其次,提出基于局部敏感哈希函数的数据分区(DP-LSH)用于投射关联性较大的数据对象到同一子数据集中,得到map上的数据分区,并设计相似性度量公式(SI)对数据分区结果进行评价,从而降低了节点间的通信开销;接着设计自适应分组策略(AGS)处理数据分区中数据倾斜的问题,进而有效地提高了集群效率;最后,结合MapReduce计算模型并行挖掘簇中心,生成最终聚类结果。实验结果表明,PBGDLSH-MR算法的聚类效果更佳,同时在大数据环境下能有效地提高并行计算的效率。
Aiming at the problems of sensitivity of initial center,high communication overhead of nodes and low efficiency of cluster in big data clustering algorithm based on partitioning,this paper proposed a partitioning-based clustering algorithm using grid density and locality sensitive hash function based on MapReduce,named PBGDLSH-MR.Firstly,based on the initial dataset,it proposed the GDS(grid density strategy)to get the initial clustering center,which avoided the sensitivity of initial center caused by random selection of initial cluster center.Secondly,it proposed the DP-LSH(data partitioning based on locality sensitive hash functions)to map more closely related data objects into the same subdataset and get data partitions on the map.Meanwhile,it designed a formula SI(similarity improvement)to evaluate the data partitioning results,reduced the communication overhead between nodes.In addition,this paper designed an AGS(adaptive grouping strategy)to handle data skew in data partitions,which improved the cluster efficiency.Finally,based on MapReduce,it mined the cluster centers in parallel to gene-rate the final clustering results.The experimental results show that the PBGDLSH-MR has better clustering results and performs better parallelization in big data.
作者
毛伊敏
陶涛
曹文梁
Mao Yimin;Tao Tao;Cao Wenliang(School of Information Engineering,Jiangxi University of Science&Technology,Ganzhou Jiangxi 341000,China;Dept.of Computer Engineering,Dongguan Polytechnic,Dongguan Guangdong 518172,China)
出处
《计算机应用研究》
CSCD
北大核心
2021年第5期1422-1427,共6页
Application Research of Computers
基金
国家重点研发计划资助项目(2018YFC1504705)
国家自然科学基金资助项目(41562019)
广东省普通高校特色创新(自然科学)资助项目(2019GKTSCX142,2017GKTSCX101)。