摘要
【目的/意义】随着互联网产业的快速发展,各种社会化媒体应用应运而生,伴随着这些应用的发展,口语化短文本形式的信息也急速膨胀。如何从这些信息资源中挖掘出关键内容并实现自动分类已经成为文本挖掘领域的重要课题之一。【方法/过程】本文以微博为例,设置词和字两种特征粒度,选择信息增益、信息增益率、Word2vec和特征频度降低特征维度,重点探讨两种特征在口语化短文本分类中的特点和作用。【结果/结论】实验结果表明,对词特征进行筛选和提取之后的分类效果仍然不如字特征在微博文本分类中的表现。因此,在口语化短文本分类中选择字特征或许是一个较实用的、效果较好的方法。
[Purpose/significance]With the rapid development of the Internet industry,various social media applications have emerged.Along with the development of these applications,the information in the form of colloquial short texts has also expanded rapidly.How to mine the key content from these information resources and achieve automatic classification has become one of the important topics in the field of text mining.[Method/process]This paper takes Microblog as an example, sets the granularity of Word and character features and selects Information Gain,Information Gain Ratio,Word2vee and Feature Frequency to reduce the feature dimension,focusing on the characteristics and effects of the two features in colloquial short text classification.[Result/conclusion]The experimental results show that the classification effect after screening and extracting word features is still inferior to the performance of character features in the Microblog text classification.Therefore,choosing character features in colloquial text classification may be a more practical and effective method.
作者
刘小敏
王昊
李心蕾
邓三鸿
LIU Xiao-min;WANG Hao;LI Xin-lei;DENG San-hong(School of Information Management,Nanjing University,Nanjing 210023,China;Jiangsu Key Laboratory of Data Engineering and Knowledge Service,Nanjing 210023,China)
出处
《情报科学》
CSSCI
北大核心
2018年第12期126-133,共8页
Information Science
基金
国家自然科学基金项目(71503121)
南京大学"仲英青年学者"项目等的资助
关键词
特征粒度
短文本
口语化文本
特征降维
feature granularity
short text
colloquial text
feature reduction