摘要
提出了一种基于知识的文本分类方法,其中引入领域知识,利用领域特征作为文本特征,增强文本表示能力,将文本分类过程看作集聚计算过程.文本索引过程采用了改进型特征选取和权重计算方法.提出了一种基于互信息的学习算法,从训练语料中自动学习领域特征集聚计算公式.实验结果显示基于领域知识的文本分类技术总体性能优于传统的贝叶斯分类模型,领域知识的应用能够有效改善对相似主题和相反主题的分类性能.
A knowledge-based text categorization method is proposed, taking domain features as textual features to improve text representation function and considering text categorization as aggregation computation procedure. A feature re-selection and reweighting technique is proposed for text indexing procedure. To learn feature aggregation functions from labeled training collection automatically, a learning method based on mutual information is employed. Comparative experiment results showed that the text categorization method based on domain knowledge works better than the conventional naive Bayes classifier based on beg-of-words model as a whole and that using domain knowledge will improve effectiveness of classifying similaror or antithetical topics.
出处
《东北大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2005年第8期733-735,共3页
Journal of Northeastern University(Natural Science)
基金
国家自然科学基金资助项目(60203019)微软亚洲研究院联合资助项目(60473140)国家教育部科学技术研究重点项目(104065).
关键词
领域知识
文本分类
集聚计算
机器学习
朴素贝叶斯模型
domain knowledge
text categorization
aggregation computation
machine lemming
naive Bayes model