摘要
预处理过程的词语粗切分,是整个中文词语分析的基础环节,对最终的召回率、准确率、运行效率起着重要的作用。词语粗分必须能为后续的过程提供少量的、高召回率的、中间结果。本文提出了一种基于N-最短路径方法的粗分模型,旨在兼顾高召回率和高效率。在此基础上,引入了词频的统计数据,对原有模型进行改进,建立了更实用的统计模型。针对人民日报一个月的语料库(共计185,192个句子),作者进行了粗分实验。按句子进行统计,2-最短路径非统计粗分模型的召回率为99.73%;在10-最短路径统计粗分模型中,平均6.12个粗分结果得到的召回率高达99.94%,比最大匹配方法高出15%,比以前最好的切词方法至少高出6.4%。而粗分结果数的平均值较全切分减少了64倍。实验结果表明:N-最短路径方法是一种预处理过程中实用、有效的的词语粗分手段。
As the very first step of Chinese word segmentation,rough segmentation tries to cover the correct segmentation with as few candidates as possible. This paper presents a model of rough segmentation, which is based on the N-shortest-paths method,to achieve the goal. In parallel,a statistical model can easily be obtained by attaching frequencies to the edges of the word-graphs. Experiments have been made on a one-month news corpus of 185,192 sentences from the People s Daily. By sentence,the recalling rate of the non-statistical model based on 2-shortest-paths method is 99.73 % . When the statistical model is applied, a recalling rate as high as 99. 94 % , nearly 6.4% higher than known best approach and 15% higher than the maximum matching segmentation, can be reached with 6.12 candidates on average. In addition, the average number of segmentation candidates is reduced by 64 times as compared to the approach of full segmentation. The result shows that the N-shortest-paths method is effective for the task of rough segmentation.
出处
《中文信息学报》
CSCD
北大核心
2002年第5期1-7,共7页
Journal of Chinese Information Processing
基金
国家重点基础研究项目(G1998030507-4
G1998030510).