期刊文献+

Parallel naive Bayes algorithm for large-scale Chinese text classification based on spark 被引量:22

面向大规模中文文本分类的朴素贝叶斯并行Spark算法(英文)
下载PDF
导出
摘要 The sharp increase of the amount of Internet Chinese text data has significantly prolonged the processing time of classification on these data.In order to solve this problem,this paper proposes and implements a parallel naive Bayes algorithm(PNBA)for Chinese text classification based on Spark,a parallel memory computing platform for big data.This algorithm has implemented parallel operation throughout the entire training and prediction process of naive Bayes classifier mainly by adopting the programming model of resilient distributed datasets(RDD).For comparison,a PNBA based on Hadoop is also implemented.The test results show that in the same computing environment and for the same text sets,the Spark PNBA is obviously superior to the Hadoop PNBA in terms of key indicators such as speedup ratio and scalability.Therefore,Spark-based parallel algorithms can better meet the requirement of large-scale Chinese text data mining. 针对互联网中中文文本数据量激增使得对其作分类运算的处理时间显著延长的问题,提出并实现了一种基于内存计算模型Spark的并行朴素贝叶斯中文文本分类算法,主要利用弹性分布数据集编程模型,实现了朴素贝叶斯分类器训练过程和预测过程的全程并行化算法。为便于比较,同时实现了基于Hadoop-MapReduce的并行朴素贝叶斯版本。实验结果表明,在相同计算环境下,对同一数据量的中文文本集,基于Spark的朴素贝叶斯中文文本分类并行化算法在加速比、扩展性等主要指标上明显优于基于Hadoop的实现,因此能更好地满足大规模中文文本数据挖掘的要求。
作者 LIU Peng ZHAO Hui-han TENG Jia-yu YANG Yan-yan LIU Ya-feng ZHU Zong-wei 刘鹏;赵慧含;滕家雨;仰彦妍;刘亚峰;朱宗卫
出处 《Journal of Central South University》 SCIE EI CAS CSCD 2019年第1期1-12,共12页 中南大学学报(英文版)
基金 Project(KC18071)supported by the Application Foundation Research Program of Xuzhou,China Projects(2017YFC0804401,2017YFC0804409)supported by the National Key R&D Program of China
关键词 Chinese text classification naive Bayes SPARK HADOOP resilient distributed dataset PARALLELIZATION 中文文本分类 朴素贝叶斯 Spark Hadoop 弹性分布式数据集 并行化
  • 相关文献

参考文献2

二级参考文献30

共引文献26

同被引文献169

引证文献22

二级引证文献151

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部