摘要
新词识别是中文信息处理的重要基础,但中文字符极强的构词能力给新词检测带来较大困难。受对偶原理的启发,提出一种基于迭代算法的新词识别算法。对目标语料进行分词和词性标注,通过两遍扫描进行字符串统计并提取重复模式。结合词语结构的特征,迭代使用重复模式互信息、左(右)熵,左(右)邻右(左)平均熵等特征进行新词识别,获得候选新词列表。利用中文词语搭配库对候选新词列表进行最后一次过滤得到最终新词列表。实验结果表明,利用该方法进行新词识别,P@10值达到100%,P@100值提高至90%,左(右)邻右(左)平均熵可在一定程度上提高新词识别的准确率。
New words identification is an important foundation for Chinese information processing. However, the energetic word building ability of Chinese makes it difficult to automatically identify new words. Inspired by the duality principle, a new word identification algorithm based on iterative algorithm is proposed. The target corpus is analyzed for segmentation and part-of-speech tagging. The repetitive patterns are extracted after statistic of string frequency through scanning twice. Combining with word structure's characteristics, the candidate list of new words is obtained through iteratively using characteristics of repetitive patterns such as Mutual Information(MI), the left(right) entropy, the right(left) average entropy of the left(right) neighbor. The final list of new words is obtained by filtering the candidate list with the help of the library of Chinese words collocation. With this method for identification of new words, results show that the value of P@10 reaches 100%, and that of P@100 increases to 90%, the use of the right(left) average entropy of the left(right) neighbor can raise the accuracy of new words identification.
出处
《计算机工程》
CAS
CSCD
2014年第7期154-158,164,共6页
Computer Engineering
基金
国家自然科学基金资助项目(61272362)
关键词
对偶原理
新词识别
迭代算法
信息熵
重复模式
中文词语搭配库
duality principle
new words identification
iterative algorithm
information entropy
repetitive pattern
the library of Chinesewords collocation