摘要
κ-最近邻分类(KNN)是一种广泛使用的文本分类方法,但是该方法并不适用分布不均匀的数据集,同时对κ值也比较敏感。本文分析了传统KNN方法的不足及产生这些不足的根本原因,并提出一种无监督的KNN文本分类算法(UKNNC)。该方法先采用误差平方和准则自适应地从κ个最近邻居所包含的各类别中挑选与输入文档于同一簇的部分邻居作为参照,然后根据输入文档对各类参照邻居核密度的扰动程度进行分类。实验证明该方法具有更高的分类质量,能够有效适用于分布复杂的数据集,同时分类结果对κ值不敏感。
κ-Nearest Neighbors (KNNC) is a widely used classifier in text categorization community, but it suffers from the presumption that training data are evenly distributed among all categories, and it is sensitive to the parameter k. In this paper, we propose an unsuperviset strategy (UKNNC) for the KNN Classifier, which adopts sum-of-squared-error criterion to adaptively select the contributing part from these neighbors and classifies the input document in term of the disturbance degree which it brings to the kernel densities of these selected neighbors. The experimental results indicate that our algorithm UKNNC is not sensitive to the parameter κ and achieves significant classification performance improvement on imbalanced corpora.
出处
《情报学报》
CSSCI
北大核心
2008年第4期550-555,共6页
Journal of the China Society for Scientific and Technical Information
基金
基金项目:教育部攻关项目数字信息资源的规划、管理与利用研究(NO.JZD20050024).
关键词
κ-最近邻
核密度估计
误差平方和准则
文本分类
κ-nearest neighbor, kernel density estimation, sum-of-squared-error criterion, text classification