摘要
针对传统文本分类算法在向量空间模型表示下存在向量高维、稀疏以及忽略特征语义相关性等缺陷所导致的分类效率低和精度不高的问题,以知网(HowNet)为知识库,构建语义概念向量模型SCVM(Semantic Concept Vector Model)表示文本,根据概念语义及上下文背景对同义词进行归并,对多义词进行排歧,提出基于概念簇的文本分类算法TCABCC(Text Classification Algorithm Based on the Concept of Clusters),通过改进传统KNN,用概念簇表示各个类别训练样本,使相似度的计算基于文本概念向量和类别概念簇。实验结果表明,该算法构造的分类器在效率和性能上均比传统KNN有较大的提高。
The traditional text classification algorithms has the problems of high - dimensional, rarefaction and ignoring the semantic correlation of keywords in the vector space model, and it easily leads to low efficiency and poor quality. Taking HowNet as knowledge repository, this paper develops the semantic concept vector model to represent text, merges synonyms and disambiguates polymerizes according to the concept of semantic and the context background. Then it proposes the text classification algorithm of TCABCC based on concept clusters by improving KNN, which uses concept clusters to present training samples of each category, makes similarity calculation based on text concept vector and category concept clusters. The experimental results show that the classifier constructed by this algorithm greatly improves the efficiency and performance than traditional KNN.
出处
《图书情报工作》
CSSCI
北大核心
2013年第15期132-136,82,共6页
Library and Information Service
基金
江苏省教育厅高校哲学社会科学项目"网络资源个性化信息服务模式研究"(项目编号:2012SJD870001)研究成果之一
关键词
文本分类
语义概念向量
概念簇
KNN
知网
text classification
semantic concept vector
concept cluster
KNN
HowNet