期刊文献+

一种改进的社交媒体文本规范化方法 被引量:1

An Improving Method for Social Media Text Normalization
下载PDF
导出
摘要 社交媒体具有文本不规范的特点,现有自然语言处理工具直接应用于社交媒体文本时效果不甚理想,并且基于关键词的算法和应用也达不到预期效果。因此,研究如何更好地规范化社交媒体文本是非常有意义和价值的。本文基于社交媒体文本中非规范词与其规范形式具有相似上下文的假设,引入词嵌入模型来更好地刻画上下文的相似性,提出了一种改进的基于图的社交媒体文本规范化方法,该方法是无监督并且语言无关的,可以处理不同类型语言的大规模无标注社交媒体文本。实验结果表明,该方法能够改进前人方法的不足,并且在与相关方法的对比实验中取得了最好的F值。 The informal style of social media texts challenges many natural language processing tools, including many keywor&based methods proposed for social media textTherefore, the normalization of the social media text is indispensable. Based on the assumption of context similarity between the lexical variants, we proposed an improved graph-based social media text normalization method by introducing word embedding model to better capture the con- text similarity. As an unsupervised and language independent method, it can be used to process large-scale social media texts of various languages. Experimental results show that the proposed method outperforms the of previous methods with the best F-score.
出处 《中文信息学报》 CSCD 北大核心 2015年第5期104-111,共8页 Journal of Chinese Information Processing
基金 浙江省自然科学基金(LY12F02010) 四川省科学支撑项目(2014GZ0063)
关键词 社交媒体 文本规范化 自然语言处理 词嵌入 social media text normalization natural language process word embedding
  • 相关文献

参考文献26

  • 1ARitter,CCherry,B Dolan. Unsupervised modeling of twitter conversations[C]//Proceedings of the Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010:172-180.
  • 2O Owoputi, B OConnor,C Dyer,et.al. Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters[C]//Proceedings of the Human Language Technologies : Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2013: 380-390.
  • 3K Gimpel, N Schneider, B OConnor, et.al. Part-of-speech Tagging for Twitter: Annotation, Features, and Experiments[C]//Proceedings of the Human Language Technologies: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics,2011:42-47.
  • 4E Brill, R C Moore. An improved error model for noisy channel spelling correction[C]//Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Englewood Cliffs, NJ, USA,2000: 286-293.
  • 5K Toutanova, R C Moore. Pronunciation modeling for improved spelling correction[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL, Philadelphia, USA, 2002: 144-151.
  • 6M Choudhury, R Saraf, V Jain, et.al. Investigation and modeling of the structure of texting language[J]. International Journal of Document Analysis and Recognition, 2007,10: 157-174.
  • 7P Cook, S Stevenson. An unsupervised model for text message normalization[C]//Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, Boulder, USA. 2009: 71-78.
  • 8A Aw, M Zhang, J Xiao. A phrase-based statistical model for SMS text normalization[C]//Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 2006: 33- 40.
  • 9D Pennell, Y Liu. A Character-Level Machine Translation Approach for Normalization of SMS Abbreviations[C]//Proceedings of Fifth International Joint Conference on Natural Language Processing, 2011: 974-982.
  • 10Y Yang, J Eisenstein. A Log-Linear Model for Unsupervised Text Normalization[C]//Proceedings of the Empirical Methods on Natural Language Processing, 2013: 61-72.

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部