期刊文献+

基于主题相似性聚类的自适应文本分类 被引量:7

Adaptive Text Classification Based on Topic Similarity Clustering
下载PDF
导出
摘要 传统的文本分类方法仅使用一种模型进行分类,容易忽略不同类别特征词出现交叉的情况,影响分类性能。为提高文本分类的准确率,提出基于主题相似性聚类的文本分类算法。通过CHI和WordCount相结合的方法提取类特征词,利用K-means算法进行聚类并提取簇特征词构成簇特征词库。在此基础上,通过Adaptive Strategy算法自适应地选择fasttext、TextCNN或RCNN模型进行分类,得到最终分类结果。在AG News数据集上的实验结果表明,该算法可较好地解决不同类别特征词交叉的问题,与单独使用的fasttext、TextCNN、RCNN模型相比,其文本分类性能显著提升。 Traditional text classification method only uses one model for classification,so it is easy to ignore the overlapping of different categories of feature words,which affects the classification performance.To improve accuracy of text classification,this paper proposes a text classification algorithm based on topic similarity clustering.The algorithm combines CHI with WordCount to extract category feature words.Then it performs clustering using the K-means algorithm and extracts cluster feature words to constructs a cluster feature word library.On this basis,the Adaptive Strategy algorithm is used to adaptively choose fasttext,TextCNN or RCNN model for classification to obtain the final classification result.Experimental results on the AG News dataset show that the proposed algorithm can better solve overlapping of different categories of feature words,and significantly improves text classification performance compared with fasttext,TextCNN and RCNN models used alone.
作者 康雁 杨其越 李浩 梁文韬 李晋源 崔国荣 王沛尧 KANG Yan;YANG Qiyue;LI Hao;LIANG Wentao;LI Jinyuan;CUI Guorong;WANG Peiyao(School of Software,Yunnan University,Kunming 650500,China)
出处 《计算机工程》 CAS CSCD 北大核心 2020年第3期93-98,共6页 Computer Engineering
基金 国家自然科学基金(61762092,61762089) 云南省软件工程重点实验室开放基金(2017SE204)。
关键词 文本分类 CHI方法 特征提取 K-MEANS算法 自适应算法 text classification CHI method feature extraction K-means algorithm adaptive algorithm
  • 相关文献

参考文献5

二级参考文献48

  • 1何勤华.计量法律学[J].法学,1985(10):38-38. 被引量:7
  • 2韩伟,沈霄凤,王云.信息系统的属性重要性度量及知识约简算法比较[J].华东师范大学学报(自然科学版),2004(3):131-134. 被引量:2
  • 3周涓,熊忠阳,张玉芳,任芳.基于最大最小距离法的多中心聚类算法[J].计算机应用,2006,26(6):1425-1427. 被引量:72
  • 4董乐红,耿国华,周明全.基于Boosting算法的文本自动分类器设计[J].计算机应用,2007,27(2):384-386. 被引量:13
  • 5Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model. The Journal of Ma- chine Learning Research, 2003, 3; 1137-1155.
  • 6Mikolov T, Karaficit M, Burget L, et al. Recurrent neural network based language model[C]//Proceed- ings of the llth Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010. 2010. 1045-1048.
  • 7Socher R, Pennington J, Huang E H, et al. Semi-su- pervised recursive autoencoders for predicting senti- ment distributions[C]//Proeeedings of the Conference on Empirical Methods in Natural Language Process- ing. Association for Computational Linguistics, 2011:151-161.
  • 8Hochreiter S, Bengio Y, Frasconi P, et al. Gradient flow in recurrent nets: the difficulty of learning long- term dependencies M. Wiley-IEEE Press, 2001: 237-243.
  • 9Hochreiter S, Schmidhuber J. Long short-term memo- ry. Neural computation, 1997, 9(8): 1735-1780.
  • 10Socher R, Lin C C, Manning C, et al. Parsing natural scenes and natural language with recursive neural net- works[C//Proceedings of the 28th international con- ference on machine learning (ICML-11). 2011 : 129- 136.

共引文献200

同被引文献96

引证文献7

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部