摘要
中文分词是中文信息处理系统中的一个重要部分。主题信息检索系统对分词的速度和准确率有特殊的要求。文中回答了词库建立的词条来源和存储结构两大问题,提出了一种基于专有名词优先的快速中文分词方法:利用首字哈希、按字数分层存储、二分查找的机制,通过优先切分专有名词,将句子切分成碎片,再对碎片进行正反两次机械切分,最后通过快速有效的评价函数选出最佳结果并作调整。实验证明,该分词方法对主题信息文献的分词速度达92万字每秒,准确率为96%,表明该分词方法在主题信息文献的分词处理中具有较高性能。
Chinese word .segmentation is a key component of Chinese information processing systems. The topic information retrieval system has special requirement for both speed and veracity. Answer two important questions for building dictionary: how to get word items and how to organize them, and design a rapid Chinese word segmentation algorithm based on dictionary based on special name. Use "first character Hash, store the items according to the word length, and binary search mechanism, cut the sentences by special name, then bidirection nmximum match to segment the rest, use an easy but effective .scoring function to select the best, and adjust at last. The experiment result shows this segmentation method can reach a speed of 920 000 words per .second, and the correctness rate can reach 96%, which proves that this method has high efficiency.
出处
《计算机技术与发展》
2008年第3期24-27,共4页
Computer Technology and Development
关键词
中文分词
专有名词
词典机制
Chinese word segmentation
special name
dictionary mechanism