摘要
现有的数据流聚类算法无法处理高维混合属性的数据流。针对该问题,对HPStream算法的脱机聚类和联机聚类过程进行改进,利用频度矩阵处理名词属性,通过基于信息熵的名词属性选择方法降低数据维度。实验结果表明,该算法能有效处理混合属性和维度较高的数据集,与HPStream算法相比,聚类精度有5%~15%的提高。
Existed data stream clustering algorithms can not deal with the data stream with high-dimensional heterogeneous attributes.To address the problem,this paper improves the off-line process and the on-line process of HPStream algorithm,which uses frequency matrix to handle the categorical attributes and uses the principle of information entropy to handle the problem of high dimension.Experimental results show that the algorithm can manipulate heterogeneous attributes and high-dimensional data sets.Compared with the HPStream algorithm,its clustering precision is increased by 5% ~15%.
出处
《计算机工程》
CAS
CSCD
北大核心
2011年第19期82-84,87,共4页
Computer Engineering
关键词
数据流挖掘
混合属性
频度矩阵
信息熵
降维
data stream mining
heterogeneous attributes
frequency matrix
information entropy
dimension reduction