摘要
文章基于Spark分布式计算框架设计并实现了并行KMeans聚类模型,并通过该模型在不同规模的Movie Lens数据集上进行训练比对实验,结果表明,该并行KMeans聚类模型适合运行在分布式集群环境下,且并行化计算效率也有不俗的表现;其次通过repartition算子设计分片加载数据,优化并行方案,有效减少了模型的训练时间。
Distributed computing framework based on spark is designed and implemented in parallel KMeans clustering model,and through the model in different sizes of MovieLens data set for training on the comparison experiment,the results show that the parallel KMeans clustering model is suitable for operation under the large distributed data environment,and parallel computa tion efficiency is also doing well.Secondly through the repartition operator load data,parallel scheme is optimized,the training time of the model is reduced effectively.
作者
侯敬儒
吴晟
李英娜
HOU Jingru;WU Sheng;LI Yingna(School of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500)
出处
《计算机与数字工程》
2018年第3期537-540,555,共5页
Computer & Digital Engineering