摘要
传统聚类算法由于单机内存和运算能力的限制已经不能满足当前大数据处理的要求,因而迫切需要寻找新的解决方法。针对单机内存运算问题,结合聚类算法的迭代计算特点,提出并实现了一种基于Spark平台的聚类系统。针对稀疏集和密集集两种不同类型的数据集,系统首先采用不同策略实现数据预处理;其次分析比较了不同聚类算法在Spark平台下的聚类性能,并给出最佳方案;最后利用数据持久化技术提高了计算速度。实验结果表明,所提系统能够有效满足海量数据聚类分析的任务要求。
Traditional clustering algorithms can not meet the requirements of current big data processing because of the limitations of stand-alone memory and computing power.Therefore it is urgent to find new solutions.Aiming at problems occurred in stand-alone memory calculating,combined with iterative computing features of clustering algorithms,a clustering system based on Spark platform is proposed.For the two different types of data sets,which are sparse sets and dense sets,the system firstly uses different strategies to achieve data preprocessing.Secondly,the performance of different clustering algorithms on Spark platform is analyzed and the best solution is given.Finally,the computing speed is improved with data persistence technology.Experimental results show that the proposed system can effectively meet the requirements of massive data clustering analysis.
作者
王磊
邹恩岑
曾诚
奚雪峰
陆悠
Wang Lei;Zou Encen;Zeng Cheng;Xi Xuefeng;Lu You(School of Electronic and Information Engineering,Suzhou University of Science and Technology,Suzhou,215009,China;Virtual Reality Key Laboratory of Intelligent Interaction and Application Technology of Suzhou,Suzhou,215009,China;Big Data Key Laboratory of PuKai,Suzhou University of Science and Technology,Suzhou,215009,China;Kunshan Public Security Bureau Command Center,Suzhou,215300,China)
出处
《数据采集与处理》
CSCD
北大核心
2018年第6期1077-1085,共9页
Journal of Data Acquisition and Processing
基金
国家自然科学基金(61673290
61750110534
61728205)资助项目
苏州市科技发展计划(SYG201707
SYG201817)资助项目