摘要
为了减小训练集中各类别资源分布不均衡对分类性能造成的影响,该文对原始训练集使用类别均衡法,即对原始训练集以类为单位进行重新组合,使得重组后的训练集类别分布尽可能均衡,从而可以在均衡的类别上进行训练和分类,以降低在训练过程中对小类别的不公平待遇。在复旦大学语料库上使用类别均衡法,分别用N a ve B ayes和R occh io方法分类,前者的宏平均F1从48.62%提高到了80.99%,后者的宏平均F1从64.58%提高到80.26%,微平均F1从73.99%提高到80.47%。实验结果显示,类别均衡法显著提高了分类性能。
A category homogenizing method was developed to lower the effects of uneven distribution of different resources in a training set on text categorization. Categories in the original training set are reassembled to form a new training set in which the category distribution is more uniform, and therefore, training and classification are implemented to change unfair treatment for small categories in the training process. The method was applied to the Fudan University classification corpus with the macro-average...
出处
《清华大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2005年第S1期1802-1805,共4页
Journal of Tsinghua University(Science and Technology)
基金
高等学校优秀青年教师教学科研奖励计划资助项目
关键词
文本分类
训练集
类别均衡法
text categorization
training set
category homogenizing