摘要
聚类是数据挖掘领域的一项重要课题,高维空间聚类以数据分布稀疏、噪声数据多、“差距趋零现象”而成为难点.在分析现有聚类算法不足的基础上,引入k邻域点集、k邻域半径等概念,提出一种高维空间单参数k邻域局部密度聚类算法kPCLDHD;为了提高算法的效率,进一步定义了参考距离等概念,并采用“双参考数据点”对数据集中的数据对象进行预处理,以减少扫描数据集的开销,提出kPCLDHD的优化算法kLDCHD.理论分析和实验结果表明,算法可以有效解决高维空间聚类问题,算法是有效可行的.
Clustering is an important research in data mining Clustering in high dimensional space is especially difficult for the spatial distribution of the data, too much noise data points, and the phenomenon that the distance between the distances to the nearest and farthest neighbors of a data point goes to zero By analyzing limitations of the existing algorithms, definitions such as k-neighborhood set and k-radius are introduced A local density based k-neighborhood clustering algorithm k-PCLDHD is proposed to solve this problem To improve the algorithm's efficiency, the optimized algorithm k-LDCHD is proposed The definition of reference distance is applied to make a pretreatment to the data set, thus avoiding quite a lot of scans to the data set after using double reference points, and the effectiveness is improved greatly The theoretical analysis and experimental results indicate that the algorithm can solve the problem of clustering in high dimensional space It's effective and efficient
出处
《计算机研究与发展》
EI
CSCD
北大核心
2005年第5期784-791,共8页
Journal of Computer Research and Development
基金
国家自然科学基金项目(70371015)
教育部高等学校博士学科点专项科研基金项目(20040286009)
关键词
k邻域半径
双参考数据点
参考半径
高维空间
k-neighbor radius
double reference point
reference radius
high dimensional space