摘要
中文新闻标题通常包含一个或几十个词,由于字符数少、特征稀疏,在分类问题中难以提升正确率。为解决此问题,提出了基于Word Embedding的文本语义扩展方法。首先,将新闻标题扩展为(标题、副标题、主题词)构成的三元组,用标题的同义词结合词性过滤方法构造副标题,对多尺度滑动窗口内的词进行语义组合,提取主题词;然后,针对扩展文本构造卷积神经网络(CNN)分类模型,该模型通过max pooling及随机dropout进行特征过滤及防止过拟合;最后,将标题、副标题拼接为双词表示,与多主题词集分别作为模型的输入。在2017自然语言处理与中文计算评测(NLP&CC2017)的新闻标题分类数据集上进行实验。实验结果表明,用三元组扩展结合相应的CNN模型在18个类别新闻标题上分类的正确率为79.42%,比未经扩展的CNN模型提高了9.5%,且主题词扩展加快了模型的收敛速度,验证了三元组扩展方法及所构建CNN分类模型的有效性。
Chinese news title usually consists of a single word to dozens of words. It is difficult to improve the accuracy of news title classification due to the problems such as few characters and sparse features. In order to solve the problems, a new method for text semantic expansion based on word embedding was proposed. Firstly, the news title was expanded into triples consisting of title, subtitle and keywords. The subtitle was constructed by combining the synonym of title and the part of speech filtering method, and the keywords were extracted from the semantic composition of words in multi-scale sliding windows. Then, the Convolutional Neural Network (CNN) model was constructed for categorizing the expanded text. Max pooling and random dropout were used for feature filtering and avoidance of overfitting. Finally, the double-word spliced by title and subtitle, and the muhi-keyword set were fed into the model respectively. Experiments were conducted on the news title classification dataset of the Natural Language Processing & Chinese Computing in 2017 (NLP&CC2017). The experimental results show that, the classification precision of the combination model of expanding news title to triples and CNN is 79.42% in 18 categories of news titles, which is 9.5% higher than the original CNN model without expanding, and the convergence rate of model is improved by keywords expansion. The proposed expansion method of triples and the constructed CNN model are verified to be effective.
出处
《计算机应用》
CSCD
北大核心
2017年第12期3498-3503,共6页
journal of Computer Applications
基金
国家社会科学基金西部项目(17XXW005)
重庆市教委科学技术研究项目(KJ1500903)~~
关键词
新闻标题分类
语义扩展
卷积神经网络
同义词
语义组合
news title classification
semantic expansion
Convolutional Neural Network (CNN)
synonym
semanticcomposition