摘要
传统的文本分类方法仅使用一种模型进行分类,容易忽略不同类别特征词出现交叉的情况,影响分类性能。为提高文本分类的准确率,提出基于主题相似性聚类的文本分类算法。通过CHI和WordCount相结合的方法提取类特征词,利用K-means算法进行聚类并提取簇特征词构成簇特征词库。在此基础上,通过Adaptive Strategy算法自适应地选择fasttext、TextCNN或RCNN模型进行分类,得到最终分类结果。在AG News数据集上的实验结果表明,该算法可较好地解决不同类别特征词交叉的问题,与单独使用的fasttext、TextCNN、RCNN模型相比,其文本分类性能显著提升。
Traditional text classification method only uses one model for classification,so it is easy to ignore the overlapping of different categories of feature words,which affects the classification performance.To improve accuracy of text classification,this paper proposes a text classification algorithm based on topic similarity clustering.The algorithm combines CHI with WordCount to extract category feature words.Then it performs clustering using the K-means algorithm and extracts cluster feature words to constructs a cluster feature word library.On this basis,the Adaptive Strategy algorithm is used to adaptively choose fasttext,TextCNN or RCNN model for classification to obtain the final classification result.Experimental results on the AG News dataset show that the proposed algorithm can better solve overlapping of different categories of feature words,and significantly improves text classification performance compared with fasttext,TextCNN and RCNN models used alone.
作者
康雁
杨其越
李浩
梁文韬
李晋源
崔国荣
王沛尧
KANG Yan;YANG Qiyue;LI Hao;LIANG Wentao;LI Jinyuan;CUI Guorong;WANG Peiyao(School of Software,Yunnan University,Kunming 650500,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2020年第3期93-98,共6页
Computer Engineering
基金
国家自然科学基金(61762092,61762089)
云南省软件工程重点实验室开放基金(2017SE204)。