摘要
近年来,随着计算机技术、信息处理技术在工业生产、信息处理等领域的广泛应用,会连续不断地产生大量随时间演变的序列型数据,构成时间序列数据流,如互联网新闻语料分析、网络入侵检测、股市行情分析和传感器网络数据分析等。实时数据流聚类分析是当前数据流挖掘研究的热点问题。单遍扫描算法虽然满足数据流高速、数据规模较大和实时分析的需求,但因缺乏有效的聚类算法来识别和区分模式而限制了其有效性和可扩展性。为了解决以上问题,提出云环境下基于LSH的分布式数据流聚类算法DLCStream,通过引入Map-Reduce框架和位置敏感哈希机制,DLCStream算法能够快速找到数据流中的聚类模式。通过详细的理论分析和实验验证表明,与传统的数据流聚类框架CluStream算法相比,DLCStream算法在高效并行处理、可扩展性和聚类结果质量方面更有优势。
In recent years,with the wide application of computer technology and internet technology in the field of industrial production and information processing,these applications will continuously produce large amounts of sequence data evolved over time and constitute time series data stream,such as internet news feed analysis,network intrusion detection system,stock markets analysis and sensor networks data analysis.The real-time clustering analysis of data stream is a hot issue of the current data stream mining.However,due to the high speed,large-scale data and real-time analysis,data must often be analyzed on the fly.Although the one-pass-through scanning algorithm is able to meet the needs,the lack of efficient clustering algorithms to identify and distinguish patterns limits the effectivity and scalability of this method.In order to solve the above problems,we proposed a novel stream clustering algorithm called DLCStream,which is based on LSH on cloud environments.It is a distributed data stream clustering approach that uses the Map-Reduce framework and LSH mechanism to quickly find the clustering pattern in the data stream.Finally,the theoretical analysis and experiment results illustrate that the DLCStream algorithm results is significantly more efficient in efficient parallel processing,scalablity,and quality of the clustering results compared with traditional data stream clustering framework CluStream algorithm.
出处
《计算机科学》
CSCD
北大核心
2014年第11期195-202,共8页
Computer Science
基金
国家"九七三"重点基础研究发展规划项目基金(2007CB310803)
国家自然科学基金重点项目(61035004)
国家自然科学基金(60875029)
国家科技部博士后基金(2013M541005)资助