摘要
层次聚类分析在数据挖掘与机器学习等领域是一种广泛使用的无监督学习技术,但是,由于层次聚类分析算法主要是依赖于人为设定的相似度阈值来实现聚类簇的合并或分裂,因此在没有任何先验知识时,难以设定相似度阈值。采用相似度均值以及边界数据对象分配策略,提出了一种基于相似度均值的分类数据层次聚类分析算法。该算法利用相似度均值刻画数据集中数据对象分布的集中趋势以及平稳相似性度量,作为层次聚类簇合并或分裂的重要依据,给出了一种相似度均值的计算公式,从而可以自动确定相似度阈值,解决了层次聚类分析中相似度阈值参数的人为设定问题;利用相似度均值,给出了一种边界数据对象的分配策略,有效提高了边界数据对象分配的准确性及聚类质量。在UCI与人工合成数据集上的实验验证了该算法具有良好的聚类性能和抗噪性,以及相似度均值的稳定性和有效性。
Hierarchical clustering analysis is a widely used unsupervised learning technology in the fields of data mining and machine learning.However,it is difficult to set the similarity threshold without any prior knowledge,since the hierarchical clustering analysis algorithm mainly relies on the similarity thresholds by artificial setting to realize the merging or splitting of clusters.Based on the mean of similarity and boundary data object allocation strategy,a hierarchical clustering analysis algorithm of categorical data using the mean of similarity is proposed.As an important basis for the merging or splitting of clusters in hierarchical clustering,the algorithm uses the steady similarity measure and the mean of similarity can capture the central tendency of the distribution of data objects in the data sets.A calculation formula of the mean of similarity is given,which can automatically determine the similarity threshold and solve the artificial setting of the similarity threshold parameters in the hierarchical clustering analysis.A boundary data object allocation strategy is presented by using the mean of similarity,which can effectively improve the accuracy of boundary data objects allocation and clustering quality.Experimental results validate the excellent clustering performance and anti-noise,as well as the stability and effectiveness of the algorithm’s mean of similarity on UCI and artificial data sets.
作者
褚轲欣
荀亚玲
CHU Ke-xin;XUN Ya-ling(School of Computer Science and Technology,Taiyuan University of Science and Technology,Taiyuan 030024,China)
出处
《计算机技术与发展》
2022年第11期154-163,共10页
Computer Technology and Development
基金
国家自然科学基金项目(61602335)
山西省自然科学基金(201901D211302)。
关键词
层次聚类
分类数据
相似度均值
平稳相似性度量
分配策略
hierarchical clustering
categorical data
mean of similarity
steady similarity measure
allocation strategy