摘要
针对朴素贝叶斯分类算法对文本分类性能不高的问题,提出一种基于改进信息增益的ACO-WNB分类算法。首先,根据特征词在数据集中的词频分布情况加入调节因子,对特征词的贡献/干扰作用进行增强/抑制的调节,选择具有强区分度的特征形成特征子集,提高IG处理非均衡数据集的准确率。然后,将蚁群优化算法(ACO)和加权朴素贝叶斯模型相结合,利用ACO对权重进行迭代和全局寻优,生成ACO-WNB分类器,提高对文本数据的分类效率。使用典型新闻数据集将改进前后的算法对比分析,实验表明IG (可以有效去除冗余的高频特征,对非均衡数据集有更好的特征选择能力,ACO-WNB分类器具有更高的准确率,使得对实际的文本数据有更好的分类效率。
Aiming at the problem that the textbook classification performance is not high for naive Bayesian classification algorithm,this paper presents an ACO-WNB classification algorithm based on improved information gain.First,the adjustment factor was added according to the word frequency distribution of the feature word in the data set,the contribution/disturbance effect of the feature word was enhanced/suppressed,and a feature-forming feature subset was selected for a strongly discriminant feature,to increase the accuracy of IG’s processing of unbalanced data sets. Then,the ant colony optimization algorithm and the weighted naive Bayesian model were combined,and the weights were subjected to iterations and global optimization using ACO,tu generate ACO-WNB classifier and improve the classification efficiency of text data. The use of typical news data sets can improve the comparison of algorithms before and after. The experiments show that IG(can effectively remove redundant high frequency characteristics,and has better feature selection ability for unbalanced data sets;while ACO-WNB classifier has a higher accuracy,so that the actual text data has better classification efficiency.
作者
邱宁佳
高鹏
王鹏
陶跃
QIU-Ning-jia;GAO Peng;WANG Peng;TAO Yue(College of Computer Science and Technology,Changchun University of Science and Technology,Changchun Jilin 130022,China)
出处
《计算机仿真》
北大核心
2019年第1期295-299,共5页
Computer Simulation
基金
吉林省科技发展计划重点科技攻关项目(20150204036GX)
吉林省省级产业创新专项资金项目(2017C051)
关键词
朴素贝叶斯
信息增益
特征子集
蚁群算法
Naive Bayesian(NB)
Information gain(IG)
Feature subset
Ant colony optimization(AVO)