摘要
根据中文文本的特点,不仅考虑文本中词汇概率信息,还结合文本语义等多方面信息来计算文本特征项的权值,从而提出一种基于多重因子加权的特征项权值计算方法,并给出具体算法。通过与基于词频及基于TF-IDF的特征项权值计算方法的比较试验,证明文中提出的特征项权值计算方法能有效提高文本聚类的正确率。
According to the characteristics of Chinese texts, the article proposes a method for computing weight of text characteristic items based on multiple factors weighting. The weight of a characteristic item is computed according to many aspects, It dose not only consider the appearance rate of word, but also unifies the semantic information in the text, And, the algorithm of computing the weight of characteristic item is provided in this paper. Finally, this paper presents the results of the experiments by comparing with traditional computing the weight of characteristic item based on word rate and TF- IDF, which illustrates that the method proposed in this paper improves the correct rate of text clustering,
出处
《计算技术与自动化》
2007年第1期81-83,86,共4页
Computing Technology and Automation
关键词
特征项
文本聚类
中文文本
自然语言处理
characteristic item
text clustering
chinese text
natural language procession