一种基于κ-最近邻的无监督文本分类算法被引量：2

An Unsupervised Text Classification Algorithm Based on k-nearest Neighbors

下载PDF

导出

摘要 κ-最近邻分类（KNN）是一种广泛使用的文本分类方法，但是该方法并不适用分布不均匀的数据集，同时对κ值也比较敏感。本文分析了传统KNN方法的不足及产生这些不足的根本原因，并提出一种无监督的KNN文本分类算法（UKNNC）。该方法先采用误差平方和准则自适应地从κ个最近邻居所包含的各类别中挑选与输入文档于同一簇的部分邻居作为参照，然后根据输入文档对各类参照邻居核密度的扰动程度进行分类。实验证明该方法具有更高的分类质量，能够有效适用于分布复杂的数据集，同时分类结果对κ值不敏感。 κ-Nearest Neighbors （KNNC） is a widely used classifier in text categorization community, but it suffers from the presumption that training data are evenly distributed among all categories, and it is sensitive to the parameter k. In this paper, we propose an unsuperviset strategy （UKNNC） for the KNN Classifier, which adopts sum-of-squared-error criterion to adaptively select the contributing part from these neighbors and classifies the input document in term of the disturbance degree which it brings to the kernel densities of these selected neighbors. The experimental results indicate that our algorithm UKNNC is not sensitive to the parameter κ and achieves significant classification performance improvement on imbalanced corpora.

作者余小鹏马费成

机构地区武汉大学信息管理学院武汉工程大学经济管理学院

出处《情报学报》 CSSCI 北大核心 2008年第4期550-555,共6页 Journal of the China Society for Scientific and Technical Information

基金基金项目：教育部攻关项目数字信息资源的规划、管理与利用研究（NO.JZD20050024）.

关键词 κ-最近邻核密度估计误差平方和准则文本分类 κ-nearest neighbor, kernel density estimation, sum-of-squared-error criterion, text classification

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献11

1Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. 17th Annual Int. ACM SIGIR Conf[C]. on Research and development in Information Retrieval, 1994 : 13-22.
2Heui Scok Lim. Improving KNN Based Text Classification with Well Estimated Parameters [ J ]. Lecture Notes in Computer Science , 2004(3316) :516-523.
3Songbo Tan. An effective refinement strategy for KNN text classifier[ J ] . Expert Systems with Applications, 2006(30) : 290-298.
4D.W. Jacobs, D. Weinshall. Classification with nonMetric distances: image retrieval and class representation [J]. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2000, 22(6): 583-600.
5Japkowicz, N. Learning from imbalanced data sets: A comparison for various strategies [ R ]. Proceedings of Learning from Imbalanced Data Sets: AAAI Work Shop, 2000.
6Y. Yang, J. P., Pedersen. A comparative study on feature selection in text categorization [ C ]. The FourteenthInternational Conference on Machine Learning, Morgan Kaufmann, 1997:412-420.
7Anil K. Ghosh, Probal Chaudhuri, and C. A. Murthy. Multiscale Classification Using Nearest Neighbor Density Estimates [ J ]. IEEE transactions on systems, man, and cybernetics-part b: cybernetics, 2006,36(5) : 1139-1148.
8Yang Y. An evaluation of statistical approaches to text categorization[J]. Journal of Information Access, 1996( 1 ) : 69-90.
9David A Bell, Guan J W, Yaxin Bi. On Combining Classifier Mass Functions for Text Categorization[J]. IEEE transactions on knowledge and data engineering, 2005, 17 (10) : 1307-1319.
10Alexander Hinneburg, Daniel A Keim. A General Approach to Clustering in Large Databases with Noise[J]. Knowledge and Information Systems, 2003 ( 5 ) : 387-415.

同被引文献14

1王创新.关联规则提取中对Apriori算法的一种改进[J].计算机工程与应用,2004,40(34):183-185. 被引量：32
2苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量：387
3王煜,王正欧,白石.用于文本分类的改进KNN算法[J].中文信息学报,2007,21(3):76-82. 被引量：15
4印鉴,谭焕云.基于χ~2统计量的kNN文本分类算法[J].小型微型计算机系统,2007,28(6):1094-1097. 被引量：13
5Ghosh A K,Chaudhuri P, Murthy C A. Multiscale classification using nearest neighbor density estimates[J]. IEEE Transactions on Systems, man, and Cybernetics-part b: cybernetics, 2006, 36 (5):1139-1148.
6Debole F, Sebastiani F. An analysis of the relative hardness of reuters-21578 subsets[J]. Journal of the American Society for Information Science and Technology, 2004,56 (6) : 584-596.
7Jacobs D W, Weinshall D. Classification with nonmetric dis - tance:image retrieval and class representation[J]. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2000, 22 (6) : 583-600.
8Yang Y. An evaluation of statistical approaches to text categorization[J]. Information Retrieval, 1999,1 ( 1 ) : 76-88.
9Vries A D, Mamoulis N, Nes N, et al. Efficient KNN search on vertically decomposed data[C]//Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin. Madison:ACM Press, 2002 : 322-333.
10Yang Y,Pedersen J O. A comparative study on feature selection in text categorization[C]//Proceedings of ICML-97,14^th International Conference on Machine Learning (Nashville, US). 1997:412-420.

引证文献2

1李国志,王洪春,李世全.一种基于分类的关联规则Apriori算法[J].江南大学学报（自然科学版）,2009,8(5):535-538. 被引量：3
2刘海峰,张学仁,姚泽清,刘守生.基于类别选择的改进KNN文本分类[J].计算机科学,2009,36(11):213-216. 被引量：9

二级引证文献12

1梁小寒,陈慧萍.基于一个新的类的关联分类方法[J].计算机工程与设计,2011,32(4):1319-1321.
2王洪群.数据挖掘的油田成本决策系统研究[J].电脑开发与应用,2011,24(4):47-50. 被引量：1
3周靖,刘晋胜.特征联合熵的一种改进K近邻分类算法[J].计算机应用,2011,31(7):1785-1788. 被引量：8
4赵静,刘培玉,许明英.邮件过滤中特征选择方法的性能评价与分析[J].计算机应用研究,2012,29(2):693-697. 被引量：7
5苟和平,景永霞,冯百明,李勇.一种基于粗糙集的改进KNN文本分类算法[J].科学技术与工程,2012,20(20):4926-4929. 被引量：3
6范仕伦,薛天俊,夏玮.基于贝叶斯算法和费舍尔算法的垃圾邮件过滤系统设计与实现[J].信息网络安全,2012(9):18-22. 被引量：11
7胡元,石冰.基于区域划分的kNN文本快速分类算法研究[J].计算机科学,2012,39(10):182-186. 被引量：23
8马甲林,刘金岭,金春霞.基于概念簇的文本分类算法[J].图书情报工作,2013,57(15):132-136. 被引量：2
9王锋,王艳娜,梁义涛,史卫亚,宋红霞,霍富强.基于KNN算法的小麦隐蔽性虫害分类器设计[J].农机化研究,2014,36(7):182-185. 被引量：6
10陈智敏,蒙祖强,林啟锋.基于改进KNN的话题跟踪算法[J].小型微型计算机系统,2014,35(8):1722-1725. 被引量：2

1朱登明,王兆其.基于运动序列分割的运动捕获数据关键帧提取[J].计算机辅助设计与图形学学报,2008,20(6):787-792. 被引量：25
2王珣.一种基于多目标的自组织神经网络学习方法[J].小型微型计算机系统,2002,23(5):565-568. 被引量：1
3陈路,胡贤德.Pegels模型在软件可靠性中的预测研究[J].长春师范大学学报,2016,35(6):46-49.
4姚敏.一种前向网络的多准则学习方法[J].通信学报,1996,17(4):113-117. 被引量：11
5梁栋.一种复杂场景下景物图像的匹配算法[J].微型机与应用,2015,34(10):48-50. 被引量：1
6刘劲,马杰,田金文.基于小波和双谱的脉冲星信号识别[J].信息与控制,2009,38(2):249-252. 被引量：3
7杨帅,薛文,谢永红,王晓宇,祝小杰.基于单分类的协同过滤推荐算法[J].计算机工程,2011,37(19):59-61. 被引量：1
8陈路,钱丽.基于Holt指数平滑模型的软件可靠性预测[J].巢湖学院学报,2016,18(3):18-23. 被引量：3
9姚敏.神经网的自适应学习准则及其算法[J].新浪潮,1995(2):1-3.
10姚敏.神经网的自适应学习准则及其算法[J].系统工程与电子技术,1995,17(3):40-45. 被引量：1

情报学报

2008年第4期

浏览历史

内容加载中请稍等...

一种基于κ-最近邻的无监督文本分类算法被引量：2

参考文献11

同被引文献14

引证文献2

二级引证文献12

相关作者

相关机构

相关主题

浏览历史

一种基于κ-最近邻的无监督文本分类算法 被引量：2

参考文献11

同被引文献14

引证文献2

二级引证文献12

相关作者

相关机构

相关主题

浏览历史

一种基于κ-最近邻的无监督文本分类算法被引量：2