摘要
单词的共同出现信息可以为文本分类做出贡献,但是,目前的文本分类研究中未能充分使用这一信息。文中提出了一种利用关联特征来提高朴素贝叶斯文本分类器性能的策略,给出了关联特征集的构造方法,设计并实现了冗余关联特征剔除算法和关联特征筛选算法,使得特征空间中的每个特征都具有较强的分类能力。实验证明,经处理后的关联特征集可以提高朴素贝叶斯文本分类器的性能。
The information of the co-occurrence of words can make contributions to automatic text classification. However, the current text classifiers fail to take full advantage of this information. We defined the association feature to describe this information. In order to make the association features to be good discriminators, we proposed the technology to create association feature set. Firstly, we set up the association feature by an apriori-like algorithm. Secondly, we proposed an algorithm to evaluate the discriminative ability of association features for pruning the redundant features. Thirdly, we proposed the feature selection algorithm, which is based on IG (information gain) algorithm, for further dimensionality reduction of the feature set. The experimental results on Reuters21578 dataset show that when association feature is applied, the Macro F1 of naive Bayes text classifier is enhanced to 83.5% from 72%. This result means that association features can be used to improve the performance of naive Bayes text classifier.
出处
《西北工业大学学报》
EI
CAS
CSCD
北大核心
2004年第4期413-416,共4页
Journal of Northwestern Polytechnical University
基金
国家自然科学基金 (60 0 73 0 55)资助
关键词
朴素贝叶斯分类器
关联特征
特征筛选
.
Algorithms
Classification (of information)
Data mining
Discriminators
Information analysis
Performance