期刊文献+

基于迭代算法的新词识别 被引量:7

New Words Identification Based on Iterative Algorithm
下载PDF
导出
摘要 新词识别是中文信息处理的重要基础,但中文字符极强的构词能力给新词检测带来较大困难。受对偶原理的启发,提出一种基于迭代算法的新词识别算法。对目标语料进行分词和词性标注,通过两遍扫描进行字符串统计并提取重复模式。结合词语结构的特征,迭代使用重复模式互信息、左(右)熵,左(右)邻右(左)平均熵等特征进行新词识别,获得候选新词列表。利用中文词语搭配库对候选新词列表进行最后一次过滤得到最终新词列表。实验结果表明,利用该方法进行新词识别,P@10值达到100%,P@100值提高至90%,左(右)邻右(左)平均熵可在一定程度上提高新词识别的准确率。 New words identification is an important foundation for Chinese information processing. However, the energetic word building ability of Chinese makes it difficult to automatically identify new words. Inspired by the duality principle, a new word identification algorithm based on iterative algorithm is proposed. The target corpus is analyzed for segmentation and part-of-speech tagging. The repetitive patterns are extracted after statistic of string frequency through scanning twice. Combining with word structure's characteristics, the candidate list of new words is obtained through iteratively using characteristics of repetitive patterns such as Mutual Information(MI), the left(right) entropy, the right(left) average entropy of the left(right) neighbor. The final list of new words is obtained by filtering the candidate list with the help of the library of Chinese words collocation. With this method for identification of new words, results show that the value of P@10 reaches 100%, and that of P@100 increases to 90%, the use of the right(left) average entropy of the left(right) neighbor can raise the accuracy of new words identification.
出处 《计算机工程》 CAS CSCD 2014年第7期154-158,164,共6页 Computer Engineering
基金 国家自然科学基金资助项目(61272362)
关键词 对偶原理 新词识别 迭代算法 信息熵 重复模式 中文词语搭配库 duality principle new words identification iterative algorithm information entropy repetitive pattern the library of Chinesewords collocation
  • 相关文献

参考文献12

二级参考文献61

  • 1邹纲,刘洋,刘群,孟遥,于浩,西野文人,亢世勇.面向Internet的中文新词语检测[J].中文信息学报,2004,18(6):1-9. 被引量:59
  • 2崔世起,刘群,孟遥,于浩,西野文人.基于大规模语料库的新词检测[J].计算机研究与发展,2006,43(5):927-932. 被引量:32
  • 3罗智勇,宋柔.基于多特征的自适应新词识别[J].北京工业大学学报,2007,33(7):718-725. 被引量:14
  • 4贺敏,龚才春,张华平,程学旗.一种基于大规模语料的新词识别方法[J].计算机工程与应用,2007,43(21):157-159. 被引量:24
  • 5郑家恒 李文花.新词语自动识别方法研究.自然语言理解与机器翻译[M].北京:清华大学出版社,2001..
  • 6陆志苇.现代汉语构词法(修订本)[M].北京:中华书局,1975..
  • 7H Y Tan. Chinese place automatic recognition research. In: C N Huang, Z D Dong, eds. Proc of Computational Language.Beijing: Tsinghua University Press, 1999
  • 8Zhang Huaping, Liu Qun, Zhang Hao, et al. Automatic recognition of Chinese unknown words recognition. First SIGHAN Workshop Attached with the 19th COLING, Taipei, 2002
  • 9S R Ye, T S Chua, J M Liu. An agent-based approach to Chinese named entity recognition. The 19th Int'l Conf on Computational Linguistics, Taipei, 2002
  • 10J Sun, J F Gao, L Zhang, et al. Chinese named entity identification using class-based language model. The 19th Int'l Conf on Computational Linguistics, Taipei, 2002

共引文献301

同被引文献57

引证文献7

二级引证文献24

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部