摘要
基于密度峰值的聚类算法(DPC)是最近提出的一种高效密度聚类算法。该算法可以对非球形分布的数据聚类,有待调节参数少、聚类速度快等优点,但在计算每个数据对象的密度值和高密度最邻近距离时,需要进行距离度量,其时间复杂度为。在大数据时代,尤其是处理海量高维数据时,该算法的效率会受到很大的影响。为了提高该算法的效率和扩展性,利用Spark在内存计算以及迭代计算上的优势,提出一种高效的基于E2LSH分区的聚类算法ELSDPC(an efficient distributed density peak clustering algorithm based on E2LSH partition with spark)。算法利用DPC算法的局部特性,引入局部敏感哈希算法LSH实现将邻近点集划分到一个区域。通过实验分析表明:该算法可在满足较高准确率的同时有效提高聚类算法的扩展性和时间效率。
The density peak clustering algorithm (DPC) is a recently proposed efficient density clustering algorithm. The algorithm can cluster the data of non-spherical distribution,which needs less adjustment parameters and fast clustering speed. But when calculating the density and exclusion value of each data object,the distance measure needs to be measured,and its time complexity is . When dealing with big data,especially high-dimension data ,the efficiency of the algorithm will be greatly affected. In order to improve the efficiency and scalability of the algorithm,take the advantages of Spark in memory calculation and iterative computing,we propose an efficient clustering algorithm based on E2LSH partition-ELSDPC. Using the local characteristics of the DPC algorithm,the LSH implementation is introduced to divide the adjacent point set into a region. The experimental analysis shows that the algorithm can effectively improve the scalability and time efficiency of the clustering algorithm while satisfying the high accuracy.
作者
何仝
徐蔚鸿
马红华
曾水玲
HE Tong;XU Wei-hong;MA Hong-hua;ZENG Shui-ling(School of Computer and Communication Engineering,Changsha University of Science and Technology,Changsha,Hunan 440114,China;Hunan Provincial Key Laboratory of Intelligent Processing of Big Data on Transportation,Changsha University of Science and Technology,Changsha,Hunan 440114,China;Zixing City Science Bureau,Chenzhou,Hunan 23400,China)
出处
《计算技术与自动化》
2019年第2期64-71,共8页
Computing Technology and Automation
基金
国家自然科学基金资助项目(61363033)
湖南省科技服务平台专项资助项目(2012TP1001)
湖南省教育厅重点项目(17A007)
综合交通运输大数据智能处理湖南省重点实验室项目(2015TP1005)
长沙市科技计划项目(KQ1703018,KQ1706064)