摘要
针对目前聚类算法对大数据集的聚类分析中存在时间花费过大的问题,提出了一种基于最近邻相似性的数据集压缩算法。通过将若干个相似性最近邻的数据点划分成一个数据簇并随机选择簇头构成新的数据集,大大缩减了数据的规模。然后分别采用K-means算法和AP算法对压缩后的数据集进行聚类分析。实验结果表明,压缩后的数据集与原始数据集的聚类分析相比,在保证聚类准确率基本一致的前提下,有效降低了聚类的花费时长,提高了算法的聚类性能,证明了该数据集压缩算法在聚类分析中的有效性和可靠性。
This paper proposed a data set compression algorithm based on nearest neighbor similarity to solve the problem that the clustering algorithm is too expensive in the large data clustering analysis.It greatly reduced the size of the data set by dividing several data points nearest to each other into a data cluster and forming new data set with randomly selecting cluster heads.Then it used the K-means algorithm and the AP algorithm to cluster the compressed datasets respectively.The experimental results show that compared with the original data set clustering analysis,the compressed dataset can reduce the time of clustering and improve the clustering performance of the algorithm in the case of the clustering accuracy is basically the same,which proves that the validity and reliability of data set compression algorithm in cluster analysis.
作者
赵延龙
滑楠
Zhao Yanlong;Hua Nan(College of Information&Navigation,Air Force Engineering University,Xi’an 710077,China)
出处
《计算机应用研究》
CSCD
北大核心
2018年第5期1450-1453,共4页
Application Research of Computers
关键词
聚类
数据压缩
聚类性能
clustering
data compression
clustering performance