摘要
传统文本聚类的向量空间模型中,认为词的权重只和词频有关,而与词语出现的上下文无关。本文介绍了如何借助按词语之间关系组织的本体论词典对文章进行上下文分析,得到文章中词语之间意义上的相互关系,进而用相关词语的词频以及关系的权重量化地给出一个词语受到上下文的支持程度,所以在衡量词语权重时不仅考虑其词频,而且考虑上下文的支持情况。文章还介绍了如何用自动构建的方法得到本文所需的词典,使得在本体论词典资源还不太丰富的汉语中也能应用上面的方法。实验数据表明,本文的方法能有效的消除噪音,提高文本聚类的效果。
In traditional text clustering with Vector Space Model, only frequency is used as the weight of a word, while the context is not taken into account. In this paper we describe a method to weight the supporting degree of the context to a word using relations between words, which are captured by an ontology dictionary. The supporting degree is weighted by both the frequencies of related words and the weights of the relations. A general methodology for automatically structuring the ontology dictionary is also given in this paper. Experiments show that our method tends to effectively reduce the noise, and performs better than the traditional method.
出处
《中文信息学报》
CSCD
北大核心
2007年第6期109-115,共7页
Journal of Chinese Information Processing
关键词
计算机应用
中文信息处理
文本聚类
上下文
词语权重
本体论词典
computer application
Chinese information processing
text clustering
context
weight
ontology dictionary