摘要
基于类的统计语言模型是解决统计模型数据稀疏问题的重要方法.但该方法的两个主要瓶颈在于:(1)词的聚类.目前我们很难找到一种比较成熟且运算量适中、收敛效果好的聚类算法.(2)基于类的模型为增强对不同领域语料的适应能力往往牺牲了一部分预测能力.该文的工作就是围绕这两个瓶颈问题展开的.在词的聚类方面,作者基于自然语言词与词之间的相似度,提出了一种词的分层聚类算法.实验证明,该算法在算法复杂度和聚类效果上比传统的基于贪婪原则的统计聚类算法都有明显的改进.在提高预测能力方面,提出了一种新的基于类的可变长模型(Vari-gram )的生成方法,用此方法生成的基于类的Vari-gram 模型预测能力远高于通常的基于类的n 元模型.
Class based statistic language model is an important method to solve the problem of sparse data. But there are two bottlenecks about this model: (1) The problem of word clustring, it is hard to find a suitable clustering method that has good performance and not large amount of computation. (2) Class based method always lose some prediction ability to adapt the text of different domain. The authors try to solve above problems in this paper. This paper presents a hierarchical word clustering algorithm based on the similarity between words in nature language. Experiments show that this method is better than conventional greedy clustering method in speed and performance. At the same time, this paper presents a new method to create the vari gram model, and gets a class based vari gram model that performance is much better than traditional class based n gram model.
出处
《计算机学报》
EI
CSCD
北大核心
1999年第9期942-948,共7页
Chinese Journal of Computers