期刊文献+

多策略融合的俄语文本词语提取方法研究

Extracting Terms from Russian Texts Based on Multi Strategies
下载PDF
导出
摘要 俄语是联合国工作语言之一,是俄罗斯等多个国家的官方语言。随着“一带一路”倡议的推进和全球化进程的加快,俄语文本数据成为有关组织管理决策的重要信息来源,俄语文本挖掘也因而成为重要的管理决策支持方法。然而,俄语文本挖掘方法研究目前还远未成熟,尤其是其关键基础——俄语文本词语提取的性能较低,阻碍着俄语文本建模的准确性。因此,文章提出一种多策略融合的俄语文本词语提取方法,结合俄语词性分析、语法规则和串频统计等多种方法,自动提取包含单词和短语在内的俄语词语。在联合国平行语料库和Taiga Corpus语料库上的实验结果表明,文章提出的方法在保证高召回率的同时,达到了85%以上的高准确率,显著优于常用的n-gram方法,能够为俄语文本主题发现和文本分/聚类等文本挖掘应用提供有效的词库。 Russian is one of the working languages of the United Nations and the official language of many countries including Russia.With the advancement of the Belt and Road Initiative and the acceleration of globalization,Russian text data has become an important information resource for managerial decision-making of related organizations and Russian text mining has thus become a significant decision-making method.However,Russian text mining methods are still far away from being mature,especially the essential Russian text term extraction method,which affects the accuracy of Russian text modeling.This paper proposes a Russian text term extraction method,which combines multi strategies including Russian POS analysis,grammatical rules and string frequency statistics to automatically extract Russian words and multiword expressions.Experiments on the United Nations Parallel Corpus and the Taiga Corpus show that the proposed method achieves a high accuracy of approximate 85%which is much higher than normal recall rate,such as the n-gram method.The proposed method can be used to create lexicons for Russian text mining applications such as text topic discovery,text classification,and text clustering.
作者 唐菊香 孙怿晖 廖晓 刘建国 于娟 TANG Juxiang;SUN Yihui;LIAO Xiao;LIU Jianguo;YU Juan
出处 《中国科技术语》 2021年第3期59-67,共9页 CHINA TERMINOLOGY
基金 国家自然科学基金项目“基于本体学习与本体映射的组织异构数据融合方法研究”(71771054)。
关键词 俄语文本挖掘 词语提取 词性标注 频繁词串 Russian text mining term extraction POS tag frequent word-string
  • 相关文献

参考文献7

二级参考文献54

共引文献36

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部