期刊文献+

一种基于聚类的PU主动文本分类方法 被引量:24

Clustering-Based PU Active Text Classification Method
下载PDF
导出
摘要 文本分类是信息检索的关键问题之一.提取更多的可信反例和构造准确高效的分类器是PU(positive and unlabeled)文本分类的两个重要问题.然而,在现有的可信反例提取方法中,很多方法提取的可信反例数量较少,构建的分类器质量有待提高.分别针对这两个重要步骤提供了一种基于聚类的半监督主动分类方法.与传统的反例提取方法不同,利用聚类技术和正例文档应与反例文档共享尽可能少的特征项这一特点,从未标识数据集中尽可能多地移除正例,从而可以获得更多的可信反例.结合SVM主动学习和改进的Rocchio构建分类器,并采用改进的TFIDF(term frequency inverse document frequency)进行特征提取,可以显著提高分类的准确度.分别在3个不同的数据集中测试了分类结果(RCV1,Reuters-21578,20 Newsgoups).实验结果表明,基于聚类寻找可信反例可以在保持较低错误率的情况下获取更多的可信反例,而且主动学习方法的引入也显著提升了分类精度. Text classification is a key technology in information retrieval. Collecting more reliable negative examples, and building effective and efficient classifiers are two important problems for automatic text classification. However, the existing methods mostly collect a small number of reliable negative examples, keeping the classifiers from reaching high accuracy. In this paper, a clustering-based method for automatic PU (positive and unlabeled) text classification enhanced by SVM active learning is proposed. In contrast to traditional methods, this approach is based on the clustering technique which employs the characteristic that positive and negative examples should share as few words as possible. It finds more reliable negative examples by removing as many probable positive examples from unlabeled set as possible. In the process of building classifier, a term weighting scheme TFIPNDF (term frequency inverse positive-negative document frequency, improved TFIDF) is adopted. An additional improved Rocchio, in conjunction with SVMs active learning, significantly improves the performance of classifying. Experimental results on three different datasets (RCV1, Reuters-21578, 20 Newsgroups) show that the proposed clustering- based method extracts more reliable negative examples than the baseline algorithms with very low error rates and implementing SVM active learning also improves the accuracy of classification significantly.
出处 《软件学报》 EI CSCD 北大核心 2013年第11期2571-2583,共13页 Journal of Software
基金 国家自然科学基金(60903098,60973040)
关键词 PU(FIositive and unlabeled)文本分类 聚类 TFIPNDF(term FREQUENCY inverse positive negative document frequency) 主动学习 可信反例 改进的Rocchio positive and unlabeled (PU) text classification clustering TFIPNDF (term frequency inverse positive-negative documentfrequency) active learning reliable negative example improved Rocchio
  • 相关文献

参考文献41

  • 1Liu W, Wang T. Online active multi-field learning for efficient email spam filtering. Knowledge and Information Systems, 2012, 33(1):117-136. [doi: 10.1007/s 10115-011-0461-x].
  • 2Fumera G, Pillai I, Roli F. Spam filtering based on the analysis of text information embedded into images. Journal of Machine Learning Research, 2006,7:2699-2720.
  • 3Qi XG, Davison BD. Web page classification: Feature and algorithms. ACM Computing Surveys, 2009,41(2):Article 12. [doi: 10. 1145/1459352.1459357].
  • 4Anotonellis I, Bouras C, Poulopoulos V. Personalized news categorization through scalable text classification. Frontiers of WWW Research and Development-APWEB, Lecture Notes in Computer Science, 2006,3841:391-401. [doi: 10.1007/11610113 35].
  • 5Hu M, Liu B. Mining and summarizing customer review. In: Proc. of the ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining. New York: ACM, 2004. 168-177. [doi: 10.1145/1014052.1014073].
  • 6Kim S, Hovy E. Determining the sentiment of opinions. In: Proc. of the Int’l Conf. on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2004. [doi: 10. 3115/1220355.1220555].
  • 7Schohn G, Cohn D. Less is more: Active learning with support vector machines. In: Proc. of the 17th Int’l Conf. on Machine Learning. San Francisco: Morgan Kaufmann Publishers, Inc., 2000. 839-846.
  • 8Liu B, Lee WS, Yu PS, Li XL. Partially supervised classification of text documents. In: Sammut C, Hoffmann AG, eds. Proc. of the 19th Int’l Conf. on Machine Learning. San Francisco: Morgan Kaufmann Publishers, Inc., 2002. 387-394.
  • 9Yu H, Han JW, Chang KCC. PEBL: Positive example based learning for Web page classification using SVM. In: Proc. of the Knowledge Discovery and Data Mining. New York: ACM, 2002. 239-248. [doi: 10.1145/775047.775083].
  • 10Li XL, Liu B. Learning to classify texts using positive and unlabeled data. In: Proc. of the Int’l Joint Conf. on Artificial Intelligence. San Francisco: Morgan Kaufmann Publishers, Inc., 2003. 587-592.

同被引文献181

引证文献24

二级引证文献73

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部