期刊文献+

Spark框架下利用分布式NBC的大数据文本分类方法 被引量:6

Text classification of big data using distributed NBC under Spark framework
下载PDF
导出
摘要 针对现有面向大数据的计算框架在可扩展性机器学习研究中面临的挑战,提出了基于MapReduce和Apache Spark框架的分布式朴素贝叶斯文本分类方法。通过研究MapReduce和Apache Spark框架的适应性来探索朴素贝叶斯分类器(NBC),并研究了现有面向大数据的计算框架。首先,基于朴素贝叶斯文本分类模型将训练样本数据集分为m类;进一步在训练阶段中,将前一个MapReduce的输出作为后一个MapReduce的输入,采用四个MapReduce作业得出模型。该设计过程充分利用了MapReduce的并行优势,最后在分类器测试时取出最大值所属的类标签值。在Newgroups数据集进行实验,在所有五类新闻数据组上的分类都取得了99%以上的结果,并且均高于对比算法,证明了提出方法的准确性。 Aiming at the challenges faced by the existing big data-oriented computing framework in the study of extensible machine learning,this paper proposed a distributed naive Bayesian text classification method based on MapReduce and Apache Spark framework. This method explored the Bayesian network classifier by studying the adaptability of MapReduce and Apache Spark frameworks,and studied the existing computing framework for big data. First,it divided the training sample data set into m classes based on the naive Bayes text classification model. In the training phase,it used the output of the previous MapReduce as the input of the next MapReduce,and used four MapReduce jobs to derive the model. This design process made full use of the parallel advantages of MapReduce. Finally,when the classifier was tested,it obtained the value of the class label which the maximum value belonged. Experiments in the Newgroup’s dataset show the proposed method achieves more than99% of the results on all five types of news data sets,and is all higher than the comparison algorithms,which prove the accuracy of the method.
作者 臧艳辉 赵雪章 席运江 Zang Yanhui;Zhao Xuezhang;Xi Yunjiang(Foshan Polytechnic,Foshan Guangdong 528137,China;South China University of Technology,Guangzhou 510641,China)
出处 《计算机应用研究》 CSCD 北大核心 2019年第12期3705-3708,3712,共5页 Application Research of Computers
基金 国家自然科学基金资助项目(71371077) 佛山市科技计划项目(2015AB004241)
关键词 文本分类 MAPREDUCE Spark框架 分布式 朴素贝叶斯分类器 机器学习 text classification MapReduce Spark framework distributed naive Bayesian classifier(NBC) machine learning
  • 相关文献

参考文献8

二级参考文献50

  • 1周家帅,王琦,高军.一种基于动态划分的MapReduce负载均衡方法[J].计算机研究与发展,2013,50(S1):369-377. 被引量:11
  • 2谢纪刚,裘正定,韩彦俊,莫莉.上市公司财务困境预测模型比较研究[J].系统工程理论与实践,2005,25(9):29-35. 被引量:14
  • 3张晓利,贺国光.基于主成分分析和组合神经网络的短时交通流预测方法[J].系统工程理论与实践,2007,27(8):167-171. 被引量:26
  • 4王黎明,张卓.基于iceberg概念格并置集成的闭频繁项集挖掘算法[J].计算机研究与发展,2007,44(7):1184-1190. 被引量:25
  • 5Cortes C, Vapnik V. Support vector networks [ J ]. Machine Learning, 1995,20:273 - 297.
  • 6Vapnik V. The Nature of Statistical Learning Theory [ J ]. New York : Springer-Verlag, 1995.
  • 7Gammerman A, Vapnik V, Vowk V. Learning by transduction [ C ]// Proceedings of the 14th Conference on Uncertainty in Artificial Intelli- gence. Wisconsin, 1998 : 148 - 156.
  • 8Blake C, Keogh E, Merz C J. UCI repository of machine learning data- bases[ EB/OL]. Department of Information and Computer Science, U- niversity of California, Irvine, CA, 1998. http ://www. ics. uci. edu/ mleam/MLRepository, html.
  • 9Lang K. Newsweeder:Learning to filter net news [ C ]//Proceedings of the Twelfth International Conference on Machine Learning,1995,331 -339.
  • 10Joachims T. Transductive inference for text classification using support vector machines[ C ]//Proceedings of the 16th International Conference on Machine Learning (ICML). San Francisco:Morgan Kaufmann Pub- lishers, 1999:200 - 209.

共引文献121

同被引文献73

引证文献6

二级引证文献23

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部