摘要
为解决传统模糊C-均值算法无法适应大规模数据集体量大、冗余属性的问题,提出了一种面向大数据集的混合聚类算法。将大数据集划分为多个子集,对各子集进行聚类,通过合并得到最终聚类结果。对于子集采用基于基因表达式编程(GEP)和模糊C-均值的混合算法进行聚类,以改善聚类的质量和效率;基于相似性选取初始聚类中心,使用信息熵体现属性重要程度,从而进一步优化聚类性能。实验仿真及分析结果表明,该算法具有较好地全局收敛性,得到的聚类效果也更好。
To solve the problem that traditional fuzzy C-means algorithm could not adopt to large scale datasets with large size and redundant attribute,a hybrid clustering algorithm for large data sets was proposed.The large data sets were divided into subsets,and each subset was first clustered,and then final clustering result was obtained by merging.The subset was clustered by a mixed algorithm based on gene expression programming (GEP) and fuzzy C-means.The quality and efficiency of clustering was improved.While initial clustering center was selected based on similarity,and the importance of data attribute was embedded by information entropy,thereby the clustering performance was optimized further.Simulation experiments showed that the algorithm had better global convergence,and could get even better clustering result.
出处
《计算机工程与设计》
CSCD
北大核心
2014年第6期2183-2187,共5页
Computer Engineering and Design
关键词
大数据集
模糊C-均值
基因表达式编程
属性信息熵
聚类
large data sets
fuzzy C-means
gene expression programming
attribute information entropy
clustering