摘要
社交媒体具有文本不规范的特点,现有自然语言处理工具直接应用于社交媒体文本时效果不甚理想,并且基于关键词的算法和应用也达不到预期效果。因此,研究如何更好地规范化社交媒体文本是非常有意义和价值的。本文基于社交媒体文本中非规范词与其规范形式具有相似上下文的假设,引入词嵌入模型来更好地刻画上下文的相似性,提出了一种改进的基于图的社交媒体文本规范化方法,该方法是无监督并且语言无关的,可以处理不同类型语言的大规模无标注社交媒体文本。实验结果表明,该方法能够改进前人方法的不足,并且在与相关方法的对比实验中取得了最好的F值。
The informal style of social media texts challenges many natural language processing tools, including many keywor&based methods proposed for social media textTherefore, the normalization of the social media text is indispensable. Based on the assumption of context similarity between the lexical variants, we proposed an improved graph-based social media text normalization method by introducing word embedding model to better capture the con- text similarity. As an unsupervised and language independent method, it can be used to process large-scale social media texts of various languages. Experimental results show that the proposed method outperforms the of previous methods with the best F-score.
出处
《中文信息学报》
CSCD
北大核心
2015年第5期104-111,共8页
Journal of Chinese Information Processing
基金
浙江省自然科学基金(LY12F02010)
四川省科学支撑项目(2014GZ0063)
关键词
社交媒体
文本规范化
自然语言处理
词嵌入
social media
text normalization
natural language process
word embedding