期刊文献+

基于专有名词优先的快速中文分词 被引量:5

A Rapid Chinese Word Segmentation Method Based on Priority Special Names
下载PDF
导出
摘要 中文分词是中文信息处理系统中的一个重要部分。主题信息检索系统对分词的速度和准确率有特殊的要求。文中回答了词库建立的词条来源和存储结构两大问题,提出了一种基于专有名词优先的快速中文分词方法:利用首字哈希、按字数分层存储、二分查找的机制,通过优先切分专有名词,将句子切分成碎片,再对碎片进行正反两次机械切分,最后通过快速有效的评价函数选出最佳结果并作调整。实验证明,该分词方法对主题信息文献的分词速度达92万字每秒,准确率为96%,表明该分词方法在主题信息文献的分词处理中具有较高性能。 Chinese word .segmentation is a key component of Chinese information processing systems. The topic information retrieval system has special requirement for both speed and veracity. Answer two important questions for building dictionary: how to get word items and how to organize them, and design a rapid Chinese word segmentation algorithm based on dictionary based on special name. Use "first character Hash, store the items according to the word length, and binary search mechanism, cut the sentences by special name, then bidirection nmximum match to segment the rest, use an easy but effective .scoring function to select the best, and adjust at last. The experiment result shows this segmentation method can reach a speed of 920 000 words per .second, and the correctness rate can reach 96%, which proves that this method has high efficiency.
出处 《计算机技术与发展》 2008年第3期24-27,共4页 Computer Technology and Development
关键词 中文分词 专有名词 词典机制 Chinese word segmentation special name dictionary mechanism
  • 相关文献

参考文献6

  • 1孙茂松,邹嘉彦.汉语自动分词研究中的苦干理论问题[J].语言文字应用,1995(4):40-46. 被引量:45
  • 2Palmer D.A trainable rule-based algorithm for word segmentation[C]// The 35th Annual Meeting of the Association for Computational Linguistics (ACL'97).Madrid:[s.n.],1997.
  • 3Choi A,Cheng C H,Ko Y L.Word extraction from Chinese documents by occurrence counts[C]//1988 International Conference on Computer Processing of Chinese and Oriental Languages.Toronto,Canada:[s.n.],1988:488-491.
  • 4Fan C K,Tsai W H.Automatic word identification in Chinese sentences by the relaxation technique[J].Computer Processing of Chinese and Oritental Languages,1988,4(1):33-56.
  • 5孙茂松,左正平,黄昌宁.汉语自动分词词典机制的实验研究[J].中文信息学报,2000,14(1):1-6. 被引量:118
  • 6刘群,张华平,俞鸿魁,程学旗.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展,2004,41(8):1421-1429. 被引量:198

二级参考文献28

  • 1孙茂松,邹嘉彦.汉语自动分词研究中的苦干理论问题[J].语言文字应用,1995(4):40-46. 被引量:45
  • 2H Y Tan. Chinese place automatic recognition research. In: C N Huang, Z D Dong, eds. Proc of Computational Language.Beijing: Tsinghua University Press, 1999
  • 3Zhang Huaping, Liu Qun, Zhang Hao, et al. Automatic recognition of Chinese unknown words recognition. First SIGHAN Workshop Attached with the 19th COLING, Taipei, 2002
  • 4S R Ye, T S Chua, J M Liu. An agent-based approach to Chinese named entity recognition. The 19th Int'l Conf on Computational Linguistics, Taipei, 2002
  • 5J Sun, J F Gao, L Zhang, et al. Chinese named entity identification using class-based language model. The 19th Int'l Conf on Computational Linguistics, Taipei, 2002
  • 6Lawrence R Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc of IEEE, 1989,77(2): 257~286
  • 7Shai Fine, Yoram Singer, Naftali Tishby. The hierarchical hidden Markov model: Analysis and applications. Machine Learning,1998, 32(1): 41~62
  • 8Richard Sproat, Thomas Emerson. The first international Chinese word segmentation bakeoff. The First SIGHAN Workshop Attached with the ACL2003, Sapporo, Japan, 2003. 133~143
  • 9J Hockenmaier, C Brew. Error-driven learning of Chinese word segmentation. In: J Guo, K T Lua, J Xu, eds. The 12th Pacific Conf on Language and Information, Singapore, 1998
  • 10Andi Wu, Zixin Jiang. Word segmentation in sentence analysis.1998 Int'l Conf on Chinese Information Processing, Beijing, 1998

共引文献338

同被引文献44

引证文献5

二级引证文献14

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部