期刊文献+

基于密度峰值的混合型数据聚类算法设计 被引量:6

Design of mixed data clustering algorithm based on density peak
下载PDF
导出
摘要 针对k-prototypes算法无法自动识别簇数以及无法发现任意形状的簇的问题,提出一种针对混合型数据的新方法:寻找密度峰值的聚类算法。首先,把CFSFDP(Clustering by Fast Search and Find of Density Peaks)聚类算法扩展到混合型数据集,定义混合型数据对象之间的距离后利用CFSFDP算法确定出簇中心,这样也就自动确定了簇的个数,然后其余的点按照密度从大到小的顺序进行分配。其次,研究了该算法中阈值(截断距离)及权值的选取问题:对于密度公式中的阈值,通过计算数据场中的势熵来自动提取;对于距离公式中的权值,利用度量数值型数据集和分类型数据集聚类趋势的统计量来定义。最后通过在三个实际混合型数据集上的测试发现:与传统k-prototypes算法相比,寻找密度峰值的聚类算法能有效提高聚类的精度。 Focusing on the issue that k-prototypes algorithm is incapable of identifying automatically the number of clusters and discovering clusters with arbitrary shape, a mixed data clustering algorithm based on searching for density peaks was proposed. Firstly, CFSFDP( Clustering by fast Search and Find of Density Peaks) clustering algorithm was extended to mixed datasets in which the distances between mixed data objects were calculated to determine the cluster centers by using CFSFDP algorithm, that is, the number of clusters was determined automatically. The rest points were then assigned to the cluster in order of their density from large to small. Secondly, the selection method of threshold and weight in the proposed algorithm was introduced. In the density formula, the threshold( cutoff distance) was extracted automatically by calculating potential entropy of data field; in the distance formula, the weight was defined through certain statistic which can measure clustering tendency of numeric datasets and categorical datasets. Finally, experimental results on three real mixed datasets show that compared with k-prototypes algorithm, the proposed algorithm can effectively improve the accuracy of clustering.
出处 《计算机应用》 CSCD 北大核心 2018年第2期483-490,496,共9页 journal of Computer Applications
基金 河北省数据科学与应用重点实验室开放课题资助项目(20170320002).
关键词 聚类分析 混合型数据 数据场 聚类趋势 密度峰值 cluster analysis mixed data data field clustering trendency density peak
  • 相关文献

参考文献8

二级参考文献66

  • 1陈孝新.熵权法在股票市场的应用[J].商业研究,2004(16):139-140. 被引量:9
  • 2汪加才,朱艺华.模糊K-Prototypes算法中的加权指数研究[J].计算机应用,2005,25(2):348-351. 被引量:4
  • 3HUANG Zhe-xue. Extensions to the k-means algorithm for clustering large data sets with categorical values [J]. Data Mining and Knowl Discovery, 1998, 2(1) :283-304.
  • 4HUANG Zhe-xue. Clustering large data sets with mixed numeric and categorical values [A].Proceedings of the Fisrt Pacific-Asia Conference on Knowledge Discovery and Data Mining [C].Singapore: World Scientific, 1997. 21-34.
  • 5HANJia—wei KAMBERM.Data Mining Concepts and Techniques[M].北京:高等教育出版社,2001..
  • 6Huang Zhexue,IEEE Transactions Fuzzy Systems,1999年,7卷,4期,446页
  • 7Huang Zhexue,Data Mining and Knowledge Discovery,1998年,2卷,283页
  • 8Huang Zhexue,Proc the 1st Pacific Asia Conference on Knowledge Discovery and Data Mining,1997年,21页
  • 9Han J Kamber M 范明 孟小峰译.Data Mining Concepts and Techniques[M].北京:机械工业出版社,2001-08..
  • 10BersonA SmithT Thur1ingK.构建面向CRM的数据挖掘应用[M].北京:人民邮电出版社,2001..

共引文献223

同被引文献64

引证文献6

二级引证文献89

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部