基于MapReduce的DBSCAN聚类算法的并行实现

The Realization of MapReduce-based DBSCAN Density-base Clustering Method

下载PDF

导出

摘要 DBSCAN是一种简单、有效的基于密度的聚类算法，用于寻找被低密度区域分离的高密度区域。DBSCAN是最经常被使用、在科学文献中被引用最多的聚类算法之一。在数据维度比较高的情况下，DBSCAN的时间复杂度为0（n2）。然而，在现实世界中，数据集的大小已经增长到超大规模。对此，一个有效率的并行的DBSCAN算法被提出，并在MapRe-duce平台下实现它。首先，对已经预处理过的数据进行划分。接下来，局部的DBSCAN算法将对每一块划分好的数据空间实现聚类。最终，利用合并算法对上一阶段的聚类结果进行合并。实验结果验证了并行算法的有效性。 DBSCAN is an effective density-based clustering method which is designed to find high-density regions which are sep- arated by low-density regions. DBSCAN is one of the most common clustering algorithms and also most cited in scientific litera- ture. In the case of the data of high dimension, the computation complexity of DBSCAN is O（n2） . However, it is challenging due to the size of datasets has been growing rapidly to extra-large scale in the real world. In this paper, an efficient parallel density- based clustering algorithm is proposed and implemented by using MapReduce. Furthermore, we adopt a quick partitioning strate- gy for data which has been preprocessed is adopted. Then, Local DBSCAN process for each subspace divided by the partition pro- file is implemented to generate clusters. At last, the clusters which are generated in the previous phase are merged.

作者林阿弟陈晓锋 LIN A-di, CHEN Xiao-feng （Department of Computer Science, Xiamen University, Xiamen 361005, China）

机构地区厦门大学计算机科学系

出处《电脑知识与技术》 2015年第4期161-164,共4页 Computer Knowledge and Technology

关键词 DBSCAN MAPREDUCE 聚类算法并行算法:数据挖掘 DBSCAN mapreduce clustering algorithms parallel algorithms data mining

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献7

1Ester M, Kriegel H P, Sander J, et al. A density based algo- rithm for discovering clusters in large spatial databases[C]. Proc. 2nd Int. Conf. on Knowledge Discovery and Data Min- ing. Portland, 1996:226 - 231.
2Bentley J L. Multidimensional binary search trees used for as- sociative searching[J]. Communications of the ACM, 1975, 18 (9): 509-517.
3Guttman A. R-trees: a dynamic index structure for spatial searching[J]. ACM, 1984, 14(2): 47-57.
4Beekmann N, Kriegel H P, Schneider R, et aL The R*-tree: an efficient and robust access method for points and rectangles [M]. ACM, 1990,19(2): 322-331.
5Xu X, Jager J, Kriegel H P. A fast parallel clustering algo- rithm for large spatial databases[J]. Data Min. Knowl. Diseov, 1999(3): 263 - 290.
6Yaobin He, Tan Haoyu, Luo Wuman, Huajian Mao, Di Ma, Shengzhong Feng, Jianping Fan.MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce [J]. 2011 IEEE 17th International Conference on Parallel and Distributed Systems,2011: 473-480.
7Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters[J]. Proceedings of the 6th conference on Sym- posium on Opearting Systems Design & Implementation - Vol- ume 6. Berkeley,CA, USA: USENIX Association, 2004: 10.

1成鹏飞,吕建平.Hough变换和区域分离-合并相结合的分割算法[J].西安邮电学院学报,2013,18(3):42-45. 被引量：5
2户卫东,丁军娣.基于颜色通道比较的显著性检测[J].计算机系统应用,2016,25(8):35-40.
3赵京胜,韩凌霄,孙宇航.一种优化初始中心的改进K-means算法[J].青岛理工大学学报,2015,36(6):99-102. 被引量：2
4周慧芳.自适应的k-means聚类算法SA-K-means[J].科技创新导报,2009,6(34):4-5. 被引量：3
5Alicia Valdez Menchaca,Griselda Cortes,Sergio Castaneda,Alejandro Luna.Standards for Enterprise Architectures[J].通讯和计算机（中英文版）,2015,12(2):49-56.
6李玲玲,方帅,辛浩.改进的基于层次聚类的模糊聚类算法[J].合肥工业大学学报（自然科学版）,2010,33(6):859-862. 被引量：8
7赖玉霞,刘建平.K-means算法的初始聚类中心的优化[J].计算机工程与应用,2008,44(10):147-149. 被引量：75
8《中国网络空间安全发展报告（2015）》发布[J].中国信息安全,2015,0(5):18-18.
9傅德胜,周辰.基于密度的改进K均值算法及实现[J].计算机应用,2011,31(2):432-434. 被引量：76
10Hot;button Issues[J].Women of China,2015(5):13-13.

电脑知识与技术

2015年第4期

浏览历史

内容加载中请稍等...

基于MapReduce的DBSCAN聚类算法的并行实现

参考文献7

相关作者

相关机构

相关主题

浏览历史