摘要
目前文本分类所采用的文本—词频矩阵具有词频维数过大和过于稀疏两个特点,给计算造成了一定困难。为解决这一问题,从用户使用搜索引擎时选择所需文本的心理出发,提出了一种基于主题词频数特征的文本主题划分方法。该方法首先根据统计方法筛选各文本类的主题词,然后以主题词类替代单个词作为特征采用模糊C-均值(FCM)算法施行文本聚类。实验获得了较好的主题划分效果,并与一种基于词聚类的文本聚类方法进行了过程及结果中多个方面的比较,得出了一些在实施要点和应用背景上较有意义的结论。
The word frequency matrix currently used in text categorization is characterized with high dimensionality and excessive sparsity. These two features caused some difficulties to computing. To solve this problem, according to the search engine users' selections, a new text categorization method based upon the feature of topic words frequency was proposed. This approach was designed to filter new concept topic words by statistical method, and then the FCM clustering algorism was applied to the documents, using the frequency of topic words rather than the frequency of single word as the feature. This method performs well in the experiment. Furthermore, this method was compared in many aspects with a text categorization method based on keyword qlusters, and some useful conclusions about implementation and application were reached.
出处
《计算机应用》
CSCD
北大核心
2006年第8期1993-1995,共3页
journal of Computer Applications
基金
厦门大学985二期信息创新平台项目资助(0000-X07204)
关键词
搜索引擎
文本聚类
模糊C-均值
主题词筛选
search engine
document clustering
Fuzzy C-Means(FCM)
topic word filtering