摘要
文本自动分类系统无法直接理解其语义并进行分类,需要对文本进行预处理,提取能表达文本主题内容的关键词,将这些关键词用结构化的形式保存起来,形成文本的表示。针对文本数据中存在大量词语共现的特点,提出了一种基于上下文的文本分类方法。该方法利用词语的上下文关系定义了词语相似度和词语权值,更科学地表达了词语在该类别中的语义表示,从而更能提高文本分类的质量。实验结果表明,该方法的分类效果比传统的简单向量距离分类法有明显的改善。
Automatic text categorization system cannot directly understand its semantic and classification,need text pretreatment,extraction can express text topics content keywords,,these keywords using structured stored together to form the text representation.According to the common characteristics presented by a large number of words,a context-based text classification method is put forward.This method defines the similarity and weights of words using the context relations between them,which expressed more scientific terms in this category in the semantic representation,thus improve the quality of text categorization better.Experimental results show that the method of classification context-based performance has significantly improved compared with the traditional simple vector distance classification.
出处
《计算机技术与发展》
2011年第8期145-148,152,共5页
Computer Technology and Development
基金
江苏省淮安市科技计划项目(HAG09061)
淮阴工学院重点基金项目(HGA0907)
关键词
词语共现
上下文
词语相似度
文本分类
word co-occurrence
context
word similarity
text classification