期刊文献+

文档分类之特征选择方法的实验比较 被引量:6

An Empirical Study on Feature Selection Methods for Centroid-based Text Classification
下载PDF
导出
摘要 在自动化信息处理中,由于大量信息是基于文字表达的,使得文本分类成为其核心任务之一。其中,相比较其他分类算法,基于类中心的文档分类方法凭借其极高的效率和较好的性能得到了更广泛的应用。然而,该分类方法的性能很大程度上取决于文本的特征空间表示。在此将4种较大差异的特征选择方法作为预处理方法,构造适合类中心点分类的特征空间,对它们的性能进行分析。实验表明,基于支持向量机的特征选择方法不仅有较好的最低错误率,并且对选择的特征数目不敏感,因此我们推荐在实际应用中使用基于支持向量机的特征选择方法作为基于类中心的文档分类算法的预处理。 Document classification plays an important role in today's automated information processing, since most of accessible information through internet is presented in text. In document classification ,a centroid-based classification method is shown to be very efficient and effective in a recent study. Howev- er,performance of the centroid-based classification method largely depends on the quality of the feature space. This paper studies four feature selection methods with different principles. Empirical study shows that the SVM-based feature selection method is the most stable and effective one among the four,and is recommended to be preferred in applications.
出处 《广西师范大学学报(自然科学版)》 CAS 北大核心 2008年第3期181-184,共4页 Journal of Guangxi Normal University:Natural Science Edition
基金 国家863高科技项目基金资助(2006AA01Z143 2006AA01Z139) 国家自然科学基金资助项目(60673043) 国家社科基金资助项目(07BYY051)
关键词 文本分类 特征选择 信息增益 RELIEF 随机森林 支持向量机 document classification feature selection information ain - Relief - random forests SVM
  • 相关文献

参考文献14

  • 1SALTON G. Automatic text processing: the transformation,analysis ,and retrieval of information by computer[M]. Boston ,MA : Addison-Wesley, 1989.
  • 2HAN E,KARYPIS G. Centroid-based document classification: analysis & experimental results [C]//Proceedings of the 4th European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD). Berlin:Springer, 2000 : 424-431.
  • 3景丽萍,黄厚宽,石洪波.用于文本挖掘的特征选择方法TFIDF及其改进[J].广西师范大学学报(自然科学版),2003,21(A01):142-145. 被引量:23
  • 4伍建军,康耀红.文本分类中特征选择方法的比较和改进[J].郑州大学学报(理学版),2007,39(2):110-113. 被引量:16
  • 5QUINLAN J R. C4.5..programs for machine learning[M]. San Mateo,CA:Morgan Kaufmann, 1993.
  • 6KIRA K,RENDELL L A. A practical approach to feature selectionEC]//Proceedings of the International Conference on Machine Learning (ICML). San Francisco ,CA :Morgan Kaufmann, 1992 : 249-256.
  • 7KONONENKO I. Estimating attributes:analysis and extensions of Relief[C]//Proeeedings of the European Conference on Machine Learning (ECML). Berlin :Springer, 1994 : 171-182.
  • 8BREIMAN L. Random forests [J]. Machine learning, 2001,45 ( 1 ) : 5-32.
  • 9WITTEN I H,FRANK E. Data mining:practical machine learning tools with java implementations[M]. San Francisco,CA :Morgan Kaufmann, 2000.
  • 10GUYON I,WESTON J,BARNHILL S,et al. Gene selection for cancer classification using support vector machines: machine learning[M ]. Berlin: Springer, 2002,46 : 389-422.

二级参考文献6

  • 1张宁,贾自艳,史忠植.使用KNN算法的文本分类[J].计算机工程,2005,31(8):171-172. 被引量:98
  • 2Rogati M,Yang Y.High-performing feature selection for text classification[C]∥Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management ACM,2002:659-661.
  • 3Yang Yiming.A comparative study on feature selection in text categorization[C]∥Proceedings of the Fourteenth International Conference on Machine Learning (ICMLp97),1997:412-420.
  • 4Luigi Galavotti,Fabrizio Sebastiani.Feature selection and negative evidence in automated text categorization[C]∥Proceedings of the ACM KDD-00 Workshop on Text Mining,2000.
  • 5Yang Yiming.Expert network:effective and efficient learning from human decisions in text categorization and retrieval[C]∥Proceedings of the 7th Annual International ACN-SIGIR Conference on Research and Development in Information Retrieval.Dublin,2001.
  • 6秦进,陈笑蓉,汪维家,陆汝占.文本分类中的特征抽取[J].计算机应用,2003,23(2):45-46. 被引量:73

共引文献37

同被引文献45

引证文献6

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部