摘要
数据流聚类分析是数据流挖掘领域的重要分支。由于数据流海量、快速、动态到达,传统的静态数据挖掘技术不能满足在线分析的需求。数据流聚类的核心是设计单遍数据集扫描算法,在有限的内存中存储少量概要特征信息,实现数据流实时、在线聚类分析。采用数据流处理中广泛应用的滑动窗口模型,提出一种新的基于增量傅立叶变换(DFT)的数据流概要算法,并在此基础上运用k-均值(k-means)聚类,实现数据流的在线挖掘。基于增量DFT概要的数据流聚类算法可减少运行时间,节省内存空间,实际用电负荷数据证明了算法的有效性。
Clustering data streams is one of the important branches in mining data streams. Because of dynamic and massive characteristics of data streams, traditional data mining algorithnks could not satisfy the requirement of online analysis. The focus on data stream technologies is to design one-pass scan algorithmover data set, and maintain an effective synopsis data structure (digest) in memory incrementally which is far smaller than the size of whole data set, A novel algorithm for clustering data streams is presented in this paper. In this algorithm, means method is used for the subset division, sliding window model is used for the data changing and updating, DFT digest is used for data reduction and can be incrementally maintained. This algorithm can save main memory and run time, it is suitable for online clustering. Experiment of clustering real electrical consumption data verify the effectiveness of the presented algorithm.
出处
《华北电力大学学报(自然科学版)》
CAS
北大核心
2007年第5期85-89,共5页
Journal of North China Electric Power University:Natural Science Edition