期刊文献+

基于语料信息度量的文本分类性能影响研究 被引量:5

Study on the Influences of Text Categorization Performance Based on Corpus Information Measurement
下载PDF
导出
摘要 基于不同的分类算法针对特性迥异的语料数据进行分类,其分类效果往往不同。通过研究分类算法针对专门语料库与自建语料库分类效果各不相同的根本原因,提出一种提高分类性能的新途径。从不同语料库的自动分类对比入手,定义类别聚类密度、类别复杂度、类别清晰度三个指标对语料库信息进行度量,通过多因素方差分析考察三个指标与分类性能的关系,得出语料的各项指标对不同分类算法分类性能的影响关系,并提出一种基于类别清晰度的交叠类文本分类方法以验证指标的有效性。实验表明:该三个指标都在不同程度上影响着分类算法的分类性能。语料类别的聚类密度越高,复杂度越低,类别清晰度越高,其表现出的分类效果越好。 The categorization performances usually vary in different corpus data with different categorization algorithms. The article proposes a new method to improve the categorization performance based on the analysis of the basic reason for the difference in categorization effects of the specialized corpus and the self-built corpus. It measures the corpus information from the comparison of the automatic catego-rization performances of different corpus through defining three indexes, namely, the category clustering density, the category complexity and the category definition. And it inspects the relationship between the three indexes and the categorization performance with multiple factors analysis of variance to obtain the effect relationship of the different indexes on the different algorithms categorization performances, and proposes an overlap text categorization method based on the category definition to verify the validity of the index. The experiments show that three indexes all affect the categorization performance of different algorithms to some extent. The higher clustering density, the lower complexity and the higher category definition, the better categorizationperformances will be.
出处 《情报杂志》 CSSCI 北大核心 2014年第9期157-162,180,共7页 Journal of Intelligence
关键词 语料库 自建语料 类别信息 分类算法 分类性能 corpus self-built corpus category information categorization algorithm categorization performance
  • 相关文献

参考文献26

二级参考文献90

共引文献182

同被引文献52

  • 1张启蕊,张凌,董守斌,谭景华.训练集类别分布对文本分类的影响[J].清华大学学报(自然科学版),2005,45(S1):1802-1805. 被引量:27
  • 2刘新斌,李俊.一种基于N-gram组合的中文垃圾邮件过滤方法[J].微电子学与计算机,2004,21(12):85-91. 被引量:5
  • 3王煜,白石,王正欧.用于Web文本分类的快速KNN算法[J].情报学报,2007,26(1):60-64. 被引量:33
  • 4胡燕,吴虎子,钟珞.基于改进的kNN算法的中文网页自动分类方法研究[J].武汉大学学报(工学版),2007,40(4):141-144. 被引量:20
  • 5Bin Wang,Gareth J. F. Jones,Wenfeng Pan.Using online linear classifiers to filter spam emails[J]. Pattern Analysis and Applications . 2006 (4)
  • 6Fabrizio Sebastiani.Machine learning in automated text categorization[J]. ACM Computing Surveys (CSUR) . 2002 (1)
  • 7Yiming Yiming, Liu Xin. A Re-examination of Text Categorization Methods [ C]. Proceedings of the 22nd Annual International ACM SlGIR Conference ON Research and Development in the Information Retrieval. Berkeley, USA, 1999= 42-49.
  • 8Sebastiani F. Machine Learning in Automated Text Categorization [ J ]. ACM Computing Surveys. 2002 ( ! ) : 1 -47.
  • 9T. Bailey, A. KJain, A Note on Distance Weighted K-Nearest Neighbor Rules [J]. IEEE Transactions on Systems, Man, and Cybematics, 1978 (8): 311-313.
  • 10G. Guo, H. Wang, D. Bell, KNN Model Based Approach in Classification [C]. In ODBASE, 2003:986 -996.

引证文献5

二级引证文献18

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部