期刊文献+

一种基于κ-最近邻的无监督文本分类算法 被引量:2

An Unsupervised Text Classification Algorithm Based on k-nearest Neighbors
下载PDF
导出
摘要 κ-最近邻分类(KNN)是一种广泛使用的文本分类方法,但是该方法并不适用分布不均匀的数据集,同时对κ值也比较敏感。本文分析了传统KNN方法的不足及产生这些不足的根本原因,并提出一种无监督的KNN文本分类算法(UKNNC)。该方法先采用误差平方和准则自适应地从κ个最近邻居所包含的各类别中挑选与输入文档于同一簇的部分邻居作为参照,然后根据输入文档对各类参照邻居核密度的扰动程度进行分类。实验证明该方法具有更高的分类质量,能够有效适用于分布复杂的数据集,同时分类结果对κ值不敏感。 κ-Nearest Neighbors (KNNC) is a widely used classifier in text categorization community, but it suffers from the presumption that training data are evenly distributed among all categories, and it is sensitive to the parameter k. In this paper, we propose an unsuperviset strategy (UKNNC) for the KNN Classifier, which adopts sum-of-squared-error criterion to adaptively select the contributing part from these neighbors and classifies the input document in term of the disturbance degree which it brings to the kernel densities of these selected neighbors. The experimental results indicate that our algorithm UKNNC is not sensitive to the parameter κ and achieves significant classification performance improvement on imbalanced corpora.
出处 《情报学报》 CSSCI 北大核心 2008年第4期550-555,共6页 Journal of the China Society for Scientific and Technical Information
基金 基金项目:教育部攻关项目数字信息资源的规划、管理与利用研究(NO.JZD20050024).
关键词 κ-最近邻 核密度估计 误差平方和准则 文本分类 κ-nearest neighbor, kernel density estimation, sum-of-squared-error criterion, text classification
  • 相关文献

参考文献11

  • 1Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. 17th Annual Int. ACM SIGIR Conf[C]. on Research and development in Information Retrieval, 1994 : 13-22.
  • 2Heui Scok Lim. Improving KNN Based Text Classification with Well Estimated Parameters [ J ]. Lecture Notes in Computer Science , 2004(3316) :516-523.
  • 3Songbo Tan. An effective refinement strategy for KNN text classifier[ J ] . Expert Systems with Applications, 2006(30) : 290-298.
  • 4D.W. Jacobs, D. Weinshall. Classification with nonMetric distances: image retrieval and class representation [J]. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2000, 22(6): 583-600.
  • 5Japkowicz, N. Learning from imbalanced data sets: A comparison for various strategies [ R ]. Proceedings of Learning from Imbalanced Data Sets: AAAI Work Shop, 2000.
  • 6Y. Yang, J. P., Pedersen. A comparative study on feature selection in text categorization [ C ]. The FourteenthInternational Conference on Machine Learning, Morgan Kaufmann, 1997:412-420.
  • 7Anil K. Ghosh, Probal Chaudhuri, and C. A. Murthy. Multiscale Classification Using Nearest Neighbor Density Estimates [ J ]. IEEE transactions on systems, man, and cybernetics-part b: cybernetics, 2006,36(5) : 1139-1148.
  • 8Yang Y. An evaluation of statistical approaches to text categorization[J]. Journal of Information Access, 1996( 1 ) : 69-90.
  • 9David A Bell, Guan J W, Yaxin Bi. On Combining Classifier Mass Functions for Text Categorization[J]. IEEE transactions on knowledge and data engineering, 2005, 17 (10) : 1307-1319.
  • 10Alexander Hinneburg, Daniel A Keim. A General Approach to Clustering in Large Databases with Noise[J]. Knowledge and Information Systems, 2003 ( 5 ) : 387-415.

同被引文献14

  • 1王创新.关联规则提取中对Apriori算法的一种改进[J].计算机工程与应用,2004,40(34):183-185. 被引量:32
  • 2苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:387
  • 3王煜,王正欧,白石.用于文本分类的改进KNN算法[J].中文信息学报,2007,21(3):76-82. 被引量:15
  • 4印鉴,谭焕云.基于χ~2统计量的kNN文本分类算法[J].小型微型计算机系统,2007,28(6):1094-1097. 被引量:13
  • 5Ghosh A K,Chaudhuri P, Murthy C A. Multiscale classification using nearest neighbor density estimates[J]. IEEE Transactions on Systems, man, and Cybernetics-part b: cybernetics, 2006, 36 (5):1139-1148.
  • 6Debole F, Sebastiani F. An analysis of the relative hardness of reuters-21578 subsets[J]. Journal of the American Society for Information Science and Technology, 2004,56 (6) : 584-596.
  • 7Jacobs D W, Weinshall D. Classification with nonmetric dis - tance:image retrieval and class representation[J]. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2000, 22 (6) : 583-600.
  • 8Yang Y. An evaluation of statistical approaches to text categorization[J]. Information Retrieval, 1999,1 ( 1 ) : 76-88.
  • 9Vries A D, Mamoulis N, Nes N, et al. Efficient KNN search on vertically decomposed data[C]//Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin. Madison:ACM Press, 2002 : 322-333.
  • 10Yang Y,Pedersen J O. A comparative study on feature selection in text categorization[C]//Proceedings of ICML-97,14^th International Conference on Machine Learning (Nashville, US). 1997:412-420.

引证文献2

二级引证文献12

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部