期刊文献+

结合语义扩展和卷积神经网络的中文短文本分类方法 被引量:19

Chinese short text classification method by combining semantic expansion and convolutional neural network
下载PDF
导出
摘要 中文新闻标题通常包含一个或几十个词,由于字符数少、特征稀疏,在分类问题中难以提升正确率。为解决此问题,提出了基于Word Embedding的文本语义扩展方法。首先,将新闻标题扩展为(标题、副标题、主题词)构成的三元组,用标题的同义词结合词性过滤方法构造副标题,对多尺度滑动窗口内的词进行语义组合,提取主题词;然后,针对扩展文本构造卷积神经网络(CNN)分类模型,该模型通过max pooling及随机dropout进行特征过滤及防止过拟合;最后,将标题、副标题拼接为双词表示,与多主题词集分别作为模型的输入。在2017自然语言处理与中文计算评测(NLP&CC2017)的新闻标题分类数据集上进行实验。实验结果表明,用三元组扩展结合相应的CNN模型在18个类别新闻标题上分类的正确率为79.42%,比未经扩展的CNN模型提高了9.5%,且主题词扩展加快了模型的收敛速度,验证了三元组扩展方法及所构建CNN分类模型的有效性。 Chinese news title usually consists of a single word to dozens of words. It is difficult to improve the accuracy of news title classification due to the problems such as few characters and sparse features. In order to solve the problems, a new method for text semantic expansion based on word embedding was proposed. Firstly, the news title was expanded into triples consisting of title, subtitle and keywords. The subtitle was constructed by combining the synonym of title and the part of speech filtering method, and the keywords were extracted from the semantic composition of words in multi-scale sliding windows. Then, the Convolutional Neural Network (CNN) model was constructed for categorizing the expanded text. Max pooling and random dropout were used for feature filtering and avoidance of overfitting. Finally, the double-word spliced by title and subtitle, and the muhi-keyword set were fed into the model respectively. Experiments were conducted on the news title classification dataset of the Natural Language Processing & Chinese Computing in 2017 (NLP&CC2017). The experimental results show that, the classification precision of the combination model of expanding news title to triples and CNN is 79.42% in 18 categories of news titles, which is 9.5% higher than the original CNN model without expanding, and the convergence rate of model is improved by keywords expansion. The proposed expansion method of triples and the constructed CNN model are verified to be effective.
出处 《计算机应用》 CSCD 北大核心 2017年第12期3498-3503,共6页 journal of Computer Applications
基金 国家社会科学基金西部项目(17XXW005) 重庆市教委科学技术研究项目(KJ1500903)~~
关键词 新闻标题分类 语义扩展 卷积神经网络 同义词 语义组合 news title classification semantic expansion Convolutional Neural Network (CNN) synonym semanticcomposition
  • 相关文献

参考文献4

二级参考文献32

  • 1JOACHIMS T. Text categorization with support vector ma- chines: learning with many relevant features [ J ]. Lecture Notes in Computer Science, 1998, 1398: 137-142.
  • 2KWON O W, LEE J H. Text categorization based on < i > k </i > -nearest neighbor approach for Web site classifica- tion [ J ]. Information Processing & Management, 2003, 39 ( 1 ) :25-44.
  • 3NIGAM K, LAFFERTY J, MCCALLUM A. Using max- imum entropy for text classification [ C ]//Proceedings of the IJCAI-99 Workshop on Machine Learning for Informa- tion Filtering. [ S. 1. ] : [ s. n. ], 1999 : 61-67.
  • 4SEBASTIANI F. Machine learning in automated text cate- gorization [ J]. ACM Computing Surveys (CSUR), 2002, 34 ( 1 ) : 1-47.
  • 5ZELIKOVITZ S, HIRSH H. Improving short text classifi- cation using unlabeled background knowledge to assess document similarity [ C]//Proceedings of the 17th Inter- national Conference on Machine Learning. [ S. 1. ] : [ s. n. ], 2000: 1183-1190.
  • 6BOLLEGALA D, MATSUO Y, ISHIZUKA M, Measur- ing semantic similarity between words using web search engines [ C ]//Proceedings of World Wide Web Confer- ence Committee ( IW3C2 ). Banff, Alberta, Canada, 2007:757-766.
  • 7GABRILOVICH E, MARKOVITCH S. Computing se- mantic relatedness using wikipedia-based explicit semantic analysis [ C]//Proceedings of the 20th International Joint Conference On Artificial Intelligence (LICAI). Freiburg, Germany: IJCAI-INT, 2007 : 1606-1611.
  • 8BANERJEE S, RAMANATHAN K, GUPTA A. Cluste- ring short texts using wikipedia [C ]//Proceedings of the 30th Annual International ACM SIGIR Conference on Re- search and Developrnent in Information Retrieval. New York: ACM, 2007: 787-788.
  • 9PHAN X H, NGUYEN L M, HORIGUCHI S. Learning to classify short and sparse text & web with hidden topics from large-scale data collections [ C]//Proceedings of the 17th International Conference on World Wide Web. New York: ACM, 2008: 91-100.
  • 10TURIAN J, RATINOV L, BENGIO Y. Word represen- tations: a simple and general method for semi-supervised learning [ C ]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Phila- delphia, PA, USA: Association for Computational Lin- guistics, 2010: 384-394.

共引文献24

同被引文献151

引证文献19

二级引证文献98

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部