摘要
针对DBSCAN算法性能上的瓶颈以及内存和I/O上的消耗严重,提出了一种大数据计算框架的并行聚类方案.选用Spark计算框架对DBSCAN算法进行并行化改进,利用SNN相似度图解决DBSCAN算法对高维数据密度定义模糊的问题,并且将DBSCAN算法运行在spark计算平台上,缓解了内存的不足.实验结果证明,该解决方案相对于单机的DBSCAN算法,聚类精度没有下降,并且通过横向的添加节点增加了运行内存,在缓解内存紧张的前提下降低了算法运行时间,和基于Hadoop的DBSCAN算法相比也有较好的加速比.
The memory cost and the disk's IO of the Density-Based Spatial Clustering of Applications with Noise(DBSCAN) algorithm is very hard. With the data size grows, the performance of DBSCAN algorithm declines much. To solve these problems, a parallel algorithm of DBSCAN based on Spark framework was proposed. First, because the density in high-dimensional data is hard to define, SNN similarity is used to measure two data in high dimension. To parallelize the algorithm in the spark, solving the memory problem is inadequate. The experiment results show that the parallel algorithm of DBSCAN does not decline the precision of clustering and does not reduce the time of running using adding runtime memory transversely. Compared to the parallel algorithm based in Hadoop, it has better speedup ratio.
作者
宁建飞
NING Jianfei(Electronic Information Department,Luoding Polytechnic,Luoding 527200, Guangdong, Chin)
出处
《汕头大学学报(自然科学版)》
2018年第2期73-80,共8页
Journal of Shantou University:Natural Science Edition
基金
广东职业教育信息化研究会基金资助项目(YZJY161724)