摘要
通过对现有基于统计的停用词选取方法的考察,提出了一种新的停用词选取方法.用该方法分别计算词条在语料库中各个句子内发生的概率和包含该词条的句子在语料库中的概率,在此基础上计算它们的联合熵,依据联合熵选取停用词.将该方法与传统方法选取的停用词表进行了对比,并比较了将各种方法用于文本分类的预处理时对分类效果的影响.实验结果表明,该方法更好地避免了语料的行文格式对停用词选取的影响,比传统方法更适用于文本分类的预处理.
By investigating the methods of automatically selecting stop words based on statistical methods, a new method is proposed. The idea of this method is to calculate the probability that the word occurs in each sentence of corpus, and calculate the probability that the sentences include the word occuring in corpus, then calculate the entropy of these probabilities, and select stop words according to the entropy. The stoplist determined by this method is compared with that determined by the traditional methods, the effects of various preprocessing methods on the categorization are compared also. The experiments show that the method is better in avoiding the impact of the style or manner of writing in corpus on choosing the stoplist, and more suitable for preprocessing the text categorization than traditional methods.
出处
《北京理工大学学报》
EI
CAS
CSCD
北大核心
2005年第4期337-340,共4页
Transactions of Beijing Institute of Technology