摘要
针对现有面向大数据的计算框架在可扩展性机器学习研究中面临的挑战,提出了基于MapReduce和Apache Spark框架的分布式朴素贝叶斯文本分类方法。通过研究MapReduce和Apache Spark框架的适应性来探索朴素贝叶斯分类器(NBC),并研究了现有面向大数据的计算框架。首先,基于朴素贝叶斯文本分类模型将训练样本数据集分为m类;进一步在训练阶段中,将前一个MapReduce的输出作为后一个MapReduce的输入,采用四个MapReduce作业得出模型。该设计过程充分利用了MapReduce的并行优势,最后在分类器测试时取出最大值所属的类标签值。在Newgroups数据集进行实验,在所有五类新闻数据组上的分类都取得了99%以上的结果,并且均高于对比算法,证明了提出方法的准确性。
Aiming at the challenges faced by the existing big data-oriented computing framework in the study of extensible machine learning,this paper proposed a distributed naive Bayesian text classification method based on MapReduce and Apache Spark framework. This method explored the Bayesian network classifier by studying the adaptability of MapReduce and Apache Spark frameworks,and studied the existing computing framework for big data. First,it divided the training sample data set into m classes based on the naive Bayes text classification model. In the training phase,it used the output of the previous MapReduce as the input of the next MapReduce,and used four MapReduce jobs to derive the model. This design process made full use of the parallel advantages of MapReduce. Finally,when the classifier was tested,it obtained the value of the class label which the maximum value belonged. Experiments in the Newgroup’s dataset show the proposed method achieves more than99% of the results on all five types of news data sets,and is all higher than the comparison algorithms,which prove the accuracy of the method.
作者
臧艳辉
赵雪章
席运江
Zang Yanhui;Zhao Xuezhang;Xi Yunjiang(Foshan Polytechnic,Foshan Guangdong 528137,China;South China University of Technology,Guangzhou 510641,China)
出处
《计算机应用研究》
CSCD
北大核心
2019年第12期3705-3708,3712,共5页
Application Research of Computers
基金
国家自然科学基金资助项目(71371077)
佛山市科技计划项目(2015AB004241)