一种改进的社交媒体文本规范化方法被引量：1

An Improving Method for Social Media Text Normalization

下载PDF

导出

摘要社交媒体具有文本不规范的特点,现有自然语言处理工具直接应用于社交媒体文本时效果不甚理想,并且基于关键词的算法和应用也达不到预期效果。因此,研究如何更好地规范化社交媒体文本是非常有意义和价值的。本文基于社交媒体文本中非规范词与其规范形式具有相似上下文的假设,引入词嵌入模型来更好地刻画上下文的相似性,提出了一种改进的基于图的社交媒体文本规范化方法,该方法是无监督并且语言无关的,可以处理不同类型语言的大规模无标注社交媒体文本。实验结果表明,该方法能够改进前人方法的不足,并且在与相关方法的对比实验中取得了最好的F值。 The informal style of social media texts challenges many natural language processing tools, including many keywor＆based methods proposed for social media textTherefore, the normalization of the social media text is indispensable. Based on the assumption of context similarity between the lexical variants, we proposed an improved graph-based social media text normalization method by introducing word embedding model to better capture the con- text similarity. As an unsupervised and language independent method, it can be used to process large-scale social media texts of various languages. Experimental results show that the proposed method outperforms the of previous methods with the best F-score.

作者宋亚军于中华陈黎丁革建罗谦

机构地区四川大学计算机学院浙江师范大学数理与信息工程学院中国民用航空总局第二研究所信息技术分公司

出处《中文信息学报》 CSCD 北大核心 2015年第5期104-111,共8页 Journal of Chinese Information Processing

基金浙江省自然科学基金(LY12F02010) 四川省科学支撑项目(2014GZ0063)

关键词社交媒体文本规范化自然语言处理词嵌入 social media text normalization natural language process word embedding

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献26

1ARitter,CCherry,B Dolan. Unsupervised modeling of twitter conversations[C]//Proceedings of the Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010:172-180.
2O Owoputi, B OConnor,C Dyer,et.al. Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters[C]//Proceedings of the Human Language Technologies : Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2013: 380-390.
3K Gimpel, N Schneider, B OConnor, et.al. Part-of-speech Tagging for Twitter: Annotation, Features, and Experiments[C]//Proceedings of the Human Language Technologies: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics,2011:42-47.
4E Brill, R C Moore. An improved error model for noisy channel spelling correction[C]//Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Englewood Cliffs, NJ, USA,2000: 286-293.
5K Toutanova, R C Moore. Pronunciation modeling for improved spelling correction[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL, Philadelphia, USA, 2002: 144-151.
6M Choudhury, R Saraf, V Jain, et.al. Investigation and modeling of the structure of texting language[J]. International Journal of Document Analysis and Recognition, 2007,10: 157-174.
7P Cook, S Stevenson. An unsupervised model for text message normalization[C]//Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, Boulder, USA. 2009: 71-78.
8A Aw, M Zhang, J Xiao. A phrase-based statistical model for SMS text normalization[C]//Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 2006: 33- 40.
9D Pennell, Y Liu. A Character-Level Machine Translation Approach for Normalization of SMS Abbreviations[C]//Proceedings of Fifth International Joint Conference on Natural Language Processing, 2011: 974-982.
10Y Yang, J Eisenstein. A Log-Linear Model for Unsupervised Text Normalization[C]//Proceedings of the Empirical Methods on Natural Language Processing, 2013: 61-72.

引证文献1

1张玉倩.社交媒体对现实交流的影响分析[J].明日风尚,2016,0(12):163-163.

1李洋.微博文本规范化研究综述[J].现代计算机,2014,20(2):26-29.
2钱涛,姬东鸿,戴文华.基于迁移的微博分词和文本规范化联合模型[J].华南理工大学学报（自然科学版）,2015,43(11):47-53.
3孙温稳.XML文本的标准化[J].电子技术与软件工程,2016(7):187-187. 被引量：1
4孙温稳.基于国内现存文本语料库规范化的现状研究及改进[J].河南科技,2016,35(11):19-20.
5彭敏,官宸宇,朱佳晖,谢倩倩,黄佳佳,黄济民,杨绍雄,高望,应称.面向社交媒体文本的话题检测与追踪技术研究综述[J].武汉大学学报（理学版）,2016,62(3):197-217. 被引量：14
6夏建明,杨俊安.基于稀疏流形聚类嵌入模型和L_1范数正则化的标签错误检测[J].控制与决策,2014,29(6):1103-1108. 被引量：2
7郑秋生,刘守喜.基于CRF的互联网文本命名实体识别研究[J].中原工学院学报,2016,27(1):70-73. 被引量：9
8万琪,杨祎.中文情绪分析方法研究综述[J].现代计算机,2017,23(2):3-5. 被引量：2
9邓加原,姬东鸿,费超群,任亚峰.基于无监督学习算法的推特文本规范化[J].计算机应用,2016,36(7):1887-1892. 被引量：1
10于洪志,杨博,关白.藏文文本规范化技术的研究与实践[J].西北民族大学学报（自然科学版）,2006,27(1):43-47. 被引量：3

中文信息学报

2015年第5期

浏览历史

内容加载中请稍等...

一种改进的社交媒体文本规范化方法被引量：1

参考文献26

引证文献1

相关作者

相关机构

相关主题

浏览历史

一种改进的社交媒体文本规范化方法 被引量：1

参考文献26

引证文献1

相关作者

相关机构

相关主题

浏览历史

一种改进的社交媒体文本规范化方法被引量：1