期刊文献+

中文停用词表的自动选取 被引量:35

Automatic Selection of Chinese Stoplist
下载PDF
导出
摘要 通过对现有基于统计的停用词选取方法的考察,提出了一种新的停用词选取方法.用该方法分别计算词条在语料库中各个句子内发生的概率和包含该词条的句子在语料库中的概率,在此基础上计算它们的联合熵,依据联合熵选取停用词.将该方法与传统方法选取的停用词表进行了对比,并比较了将各种方法用于文本分类的预处理时对分类效果的影响.实验结果表明,该方法更好地避免了语料的行文格式对停用词选取的影响,比传统方法更适用于文本分类的预处理. By investigating the methods of automatically selecting stop words based on statistical methods, a new method is proposed. The idea of this method is to calculate the probability that the word occurs in each sentence of corpus, and calculate the probability that the sentences include the word occuring in corpus, then calculate the entropy of these probabilities, and select stop words according to the entropy. The stoplist determined by this method is compared with that determined by the traditional methods, the effects of various preprocessing methods on the categorization are compared also. The experiments show that the method is better in avoiding the impact of the style or manner of writing in corpus on choosing the stoplist, and more suitable for preprocessing the text categorization than traditional methods.
出处 《北京理工大学学报》 EI CAS CSCD 北大核心 2005年第4期337-340,共4页 Transactions of Beijing Institute of Technology
关键词 停用词 中文停用词表 联合熵 stop word Chinese stoplist union entropy
  • 相关文献

参考文献12

  • 1Hart G W. To decode short cryptograms[A]. Communications of the ACM[C]. New York: Association for Computing Machinery, 1994.102-108.
  • 2Van Rijsbergen C J. Information retrieval[M]. London: Butterworths Scientific Publication, 1975.
  • 3Fox C. Lexical analysis and stoplists(including the ‘Brown Corpus’stoplist), information retrieval: Data structures and algorithms[M]. Upper Saddle River, New Jersey: Prentice Hall, 1992.
  • 4Sinka M P, Corne D W. Web intelligence WI 2003[A]. Proceedings IEEE/WIC International Conference on Soc[C]. Los Alamitos: IEEE Comput, 2003.396-402.
  • 5Silva C, Ribeiro B. The importance of stop word removal on recall values in text categorization[J]. Neural Networks, 2003, 3:20-24.
  • 6Yang Y. Pedersen J O. A comparative study on feature selection in text categorization[A]. Proceedings of ICML-97, 14th International Conference on Machine Learning[C]. San Francisco: Morgan Kaufmann Publishers Inc., 1997.412-420.
  • 7Luhn H P. The automatic creation of literature abstracts[J]. IBM Journal of Research and Development, 1958, 2(2):159-165.
  • 8Harman D. An experimental study of factors important in document ranking[A]. Proceedings of the 1986 ACM Conference on Research and Developments in Information Retrieval[C]. New York: Association for Computing Machinery, 1986.186-193.
  • 9北京大学计算语言学研究所. 1998年1月人民日报切分、标注语料库[EB/OL]. http:∥icl.pku.edu.cn//icl_groups/corpus/dwldform1.asp,2001-05-10/2004-04-01. (in Chinese)Institute of Computational Linguistics Peking University. Word segmentation corpus from People's Daily(January 1998)[EB/OL]. http:∥icl.pku.edu.cn//icl_groups/corpus/dwldform1.asp,2001-05-10/2004-04-01.
  • 10陆玉昌,鲁明羽,李凡,周立柱.向量空间法中单词权重函数的分析和构造[J].计算机研究与发展,2002,39(10):1205-1210. 被引量:126

二级参考文献1

共引文献125

同被引文献264

引证文献35

二级引证文献238

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部