期刊文献+

基于网格密度和局部敏感哈希函数的并行化聚类算法 被引量:6

Partitioning-based clustering algorithm using grid density and locality sensitive hash function based on MapReduce
下载PDF
导出
摘要 针对大数据背景下基于划分的聚类算法中存在初始中心敏感,节点间通信开销大以及集群效率低下等问题,提出了基于网格密度和局部敏感哈希函数的PBGDLSH-MR并行化聚类算法。首先,对初始数据集提出网格密度策略(GDS)获取初始中心点,有效避免了随机选取引起的初始中心敏感的问题;其次,提出基于局部敏感哈希函数的数据分区(DP-LSH)用于投射关联性较大的数据对象到同一子数据集中,得到map上的数据分区,并设计相似性度量公式(SI)对数据分区结果进行评价,从而降低了节点间的通信开销;接着设计自适应分组策略(AGS)处理数据分区中数据倾斜的问题,进而有效地提高了集群效率;最后,结合MapReduce计算模型并行挖掘簇中心,生成最终聚类结果。实验结果表明,PBGDLSH-MR算法的聚类效果更佳,同时在大数据环境下能有效地提高并行计算的效率。 Aiming at the problems of sensitivity of initial center,high communication overhead of nodes and low efficiency of cluster in big data clustering algorithm based on partitioning,this paper proposed a partitioning-based clustering algorithm using grid density and locality sensitive hash function based on MapReduce,named PBGDLSH-MR.Firstly,based on the initial dataset,it proposed the GDS(grid density strategy)to get the initial clustering center,which avoided the sensitivity of initial center caused by random selection of initial cluster center.Secondly,it proposed the DP-LSH(data partitioning based on locality sensitive hash functions)to map more closely related data objects into the same subdataset and get data partitions on the map.Meanwhile,it designed a formula SI(similarity improvement)to evaluate the data partitioning results,reduced the communication overhead between nodes.In addition,this paper designed an AGS(adaptive grouping strategy)to handle data skew in data partitions,which improved the cluster efficiency.Finally,based on MapReduce,it mined the cluster centers in parallel to gene-rate the final clustering results.The experimental results show that the PBGDLSH-MR has better clustering results and performs better parallelization in big data.
作者 毛伊敏 陶涛 曹文梁 Mao Yimin;Tao Tao;Cao Wenliang(School of Information Engineering,Jiangxi University of Science&Technology,Ganzhou Jiangxi 341000,China;Dept.of Computer Engineering,Dongguan Polytechnic,Dongguan Guangdong 518172,China)
出处 《计算机应用研究》 CSCD 北大核心 2021年第5期1422-1427,共6页 Application Research of Computers
基金 国家重点研发计划资助项目(2018YFC1504705) 国家自然科学基金资助项目(41562019) 广东省普通高校特色创新(自然科学)资助项目(2019GKTSCX142,2017GKTSCX101)。
关键词 大数据 并行化聚类 网格密度 哈希函数 MAPREDUCE big data parallelize clustering grid density hash functions MapReduce
  • 相关文献

参考文献9

二级参考文献92

  • 1周家帅,王琦,高军.一种基于动态划分的MapReduce负载均衡方法[J].计算机研究与发展,2013,50(S1):369-377. 被引量:11
  • 2刘靖明,韩丽川,侯立文.基于粒子群的K均值聚类算法[J].系统工程理论与实践,2005,25(6):54-58. 被引量:122
  • 3[OL].<http://hadoop.apache.org.>.
  • 4WinterCorp: 2005 TopTen Program Summary. http:// www. wintercorp, com/WhitePapers/WC TopTenWP. pdf.
  • 5TDWI Checklist Report: Big Data Analytics. http://tdwi. org/research/2010/08/Big-Data-Analytics, aspx.
  • 6Chaudhuri S, Dayal U. An overview of data warehousing and OLAP technology. SIGMOD Rec, 1997,26(1): 65-74.
  • 7Madden S, DeWitt D J, Stonebraker M. Database parallelism choices greatly impact scalability. DatabaseColumn Blog. http://www, databasecolumn, com/2007/10/database-parallelism-choices, html.
  • 8Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters//Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI ' 04). San Francisco, California, USA, 2004: 137-150.
  • 9DeWitt D J, Gerber R H, Graefe G, Heytens M L, Kumar K B, Muralikrishna M. GAMMA--A high performance dataflow database machine//Proceedings of the 12th International Conference on Very Large Data Bases (VLDB' 86). Kyoto, Japan, 1986:228-237.
  • 10Fushimi S, Kitsuregawa M, Tanaka H. An overview of the system software of a parallel relational database machine// Proceedings of the 12th International Conference on Very Large DataBases(VLDB'86). Kyoto, Japan, 1986:209-219.

共引文献775

同被引文献74

引证文献6

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部