摘要
为解决混合属性中数值属性与分类属性相似性度量的差异造成的聚类效果不佳问题,分析混合属性聚类相似性度量的问题,提出基于熵的混合属性聚类算法。引入熵离散化技术将数值属性离散化,仅使用二元化距离度量混合属性对象之间的相似性,在聚类过程中随机选取k个初始簇中心,将其它对象按照距离k个簇中心的最小距离划分到相应的簇中,选择每个簇中每个数据属性中频率最高的属性值形成新的簇中心继续划分对象,迭代此步当满足目标条件时停止,形成最终聚类。在UCI数据集上的实验结果验证了该算法的有效性。
To solve the problem of poor clustering effects caused by the difference between the similarity measures of numerical attribute and categorical attribute in mixed attribute,the problem of similarity measures of mixed attribute clusters was analyzed,and entropy-based clustering algorithm for mixed data was proposed.Entropy discretization technology was introduced to discretize numerical attributes and only binary distances were used to measure the similarity between mixed attribute objects.During the clustering process,k initial cluster centers were randomly selected,and other objects were divided into corresponding clusters according to the minimum distance from the k cluster centers.The most frequent attribute value of each data attribute in each cluster was selected to form a new cluster center and continue to divide objects.Iterating this step stopped when the target conditions were met to form the final cluster.Experimental results on the UCI dataset verify the effectiveness of the algorithm.
作者
邱保志
王志林
QIU Bao-zhi;WANG Zhi-lin(School of Information Engineering,Zhengzhou University,Zhengzhou 450001,China)
出处
《计算机工程与设计》
北大核心
2021年第4期957-962,共6页
Computer Engineering and Design
基金
国家自然科学基金项目(61602154)。
关键词
聚类
混合属性
熵
离散化
仅
clustering
mixed attribute
entropy
discretization
only