摘要
The sharp increase of the amount of Internet Chinese text data has significantly prolonged the processing time of classification on these data.In order to solve this problem,this paper proposes and implements a parallel naive Bayes algorithm(PNBA)for Chinese text classification based on Spark,a parallel memory computing platform for big data.This algorithm has implemented parallel operation throughout the entire training and prediction process of naive Bayes classifier mainly by adopting the programming model of resilient distributed datasets(RDD).For comparison,a PNBA based on Hadoop is also implemented.The test results show that in the same computing environment and for the same text sets,the Spark PNBA is obviously superior to the Hadoop PNBA in terms of key indicators such as speedup ratio and scalability.Therefore,Spark-based parallel algorithms can better meet the requirement of large-scale Chinese text data mining.
针对互联网中中文文本数据量激增使得对其作分类运算的处理时间显著延长的问题,提出并实现了一种基于内存计算模型Spark的并行朴素贝叶斯中文文本分类算法,主要利用弹性分布数据集编程模型,实现了朴素贝叶斯分类器训练过程和预测过程的全程并行化算法。为便于比较,同时实现了基于Hadoop-MapReduce的并行朴素贝叶斯版本。实验结果表明,在相同计算环境下,对同一数据量的中文文本集,基于Spark的朴素贝叶斯中文文本分类并行化算法在加速比、扩展性等主要指标上明显优于基于Hadoop的实现,因此能更好地满足大规模中文文本数据挖掘的要求。
基金
Project(KC18071)supported by the Application Foundation Research Program of Xuzhou,China
Projects(2017YFC0804401,2017YFC0804409)supported by the National Key R&D Program of China