期刊文献+

基于spark框架的DBSCAN文本聚类算法 被引量:2

Parallel DBSCAN Algorithm Based on Spark Framework in Text Classification
下载PDF
导出
摘要 针对DBSCAN算法性能上的瓶颈以及内存和I/O上的消耗严重,提出了一种大数据计算框架的并行聚类方案.选用Spark计算框架对DBSCAN算法进行并行化改进,利用SNN相似度图解决DBSCAN算法对高维数据密度定义模糊的问题,并且将DBSCAN算法运行在spark计算平台上,缓解了内存的不足.实验结果证明,该解决方案相对于单机的DBSCAN算法,聚类精度没有下降,并且通过横向的添加节点增加了运行内存,在缓解内存紧张的前提下降低了算法运行时间,和基于Hadoop的DBSCAN算法相比也有较好的加速比. The memory cost and the disk's IO of the Density-Based Spatial Clustering of Applications with Noise(DBSCAN) algorithm is very hard. With the data size grows, the performance of DBSCAN algorithm declines much. To solve these problems, a parallel algorithm of DBSCAN based on Spark framework was proposed. First, because the density in high-dimensional data is hard to define, SNN similarity is used to measure two data in high dimension. To parallelize the algorithm in the spark, solving the memory problem is inadequate. The experiment results show that the parallel algorithm of DBSCAN does not decline the precision of clustering and does not reduce the time of running using adding runtime memory transversely. Compared to the parallel algorithm based in Hadoop, it has better speedup ratio.
作者 宁建飞 NING Jianfei(Electronic Information Department,Luoding Polytechnic,Luoding 527200, Guangdong, Chin)
出处 《汕头大学学报(自然科学版)》 2018年第2期73-80,共8页 Journal of Shantou University:Natural Science Edition
基金 广东职业教育信息化研究会基金资助项目(YZJY161724)
关键词 DBSCAN聚类 大数据 并行算法 SNN相似度 Spark计算平台 DBSCAN clustering big data parallel algorithm SNN similarity Spark Computing platform
  • 相关文献

参考文献5

二级参考文献30

  • 1周水庚,周傲英,金文,范晔,钱卫宁.FDBSCAN:一种快速 DBSCAN算法(英文)[J].软件学报,2000,11(6):735-744. 被引量:42
  • 2刘远超,王晓龙,刘秉权,钟彬彬.信息检索中的聚类分析技术[J].电子与信息学报,2006,28(4):606-609. 被引量:9
  • 3苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:387
  • 4傅华忠,茅剑.基于DBSCAN聚类算法的Web文本挖掘[J].科技信息,2007(1):55-56. 被引量:5
  • 5[1]Beachmann N,et al.The R*-tree:An Efficient and Robust Access Method for Points and Rectanggles[C].Proc.of ACM SIGMOD Int'l Conf.on Management of Data,Atlantic:ACM Press,1998.73-84.
  • 6[2]Ester M,et al.A Densith-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise[C].Proc.of 2nd Int'l Conf.on Knowledge Discovering in Databases and Data Mining (KDD-96),Portland:AAA I Press,1996.
  • 7[3]Guha S,Rastogi R,Shimk.CURE:An Efficient Clustering Algorithm for Large Databases[C].Proc.of the ACM SIGMOD Int'l Conf.on Morgan Kaufmann,1997.186-195.
  • 8[4]Paul Stolorz,et al.Scalable High Performance Computing for Knowledge Discovery and Data Mining[M].Kluwer Academic Publishers,1997.
  • 9[6]Paul Stolorz,Ron Musick.Scalable High Performance Computing for Knowledge Discovery and Data Mining[M].Kluwer Academic Publishers,1997.
  • 10[8]T Zhang,R Ramakrishnan.Birch:An Efficent Data Clustering Method for Very Large Databases[C].Proceedings of the ACM SIGMOD Conference on Management of Data,Montreal,Canada,1996.

共引文献136

同被引文献9

引证文献2

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部