期刊文献+

三种用于垃圾网页检测的随机欠采样集成分类器 被引量:8

Three random under-sampling based ensemble classifiers for Web spam detection
下载PDF
导出
摘要 针对垃圾网页检测过程中轻微的不平衡分类问题,提出三种随机欠采样集成分类器算法,分别为一次不放回随机欠采样(RUS-once)、多次不放回随机欠采样(RUS-multiple)和有放回随机欠采样(RUS-replacement)算法。首先使用其中一种随机欠采样技术将训练样本集转换成平衡样本集,然后对每个平衡样本集使用分类回归树(CART)分类器算法进行分类,最后采用简单投票法构建集成分类器对测试样本进行分类。实验表明,三种随机欠采样集成分类器均取得了良好的分类效果,其中RUS-multiple和RUS-replacement比RUS-once的分类效果更好。与CART及其Bagging和Adaboost集成分类器相比,在WEBSPAM UK-2006数据集上,RUS-multiple和RUS-replacement方法的AUC指标值提高了10%左右,在WEBSPAM UK-2007数据集上,提高了25%左右;与其他最优研究结果相比,RUS-multiple和RUS-replacement方法在AUC指标上能达到最优分类结果。 In order to solve the problem of slighty imbalanced classification in Web spam detection, three ensemble classifiers based on random under-sampling techniques were proposed, including Random Under-Sampling once without replacement (RUS-once), Random Under-Sampling multiple times without replacement (RUS-muhiple) and Random Under- Sampling with replacement ( RUS-replacement). At first, the unbalanced training dataset was converted into several balanced datasets by using one of the under-sampling techniques. Secondly, the Classification And Regression Tree (CART) classifiers were trained based on the balanced datasets. Finally, an ensemble classifier was constructed with all of the CART classifiers based on simple voting rule and used to classify the test samples. The experimental results show that the three kinds of random under-sampling based ensemble classifiers achieve good classification results, the performance of RUS-multiple and RUS- replacement are better than RUS-once. Compared with CART, Bagging with CART and Adaboost with CART, the AUC values of RUS-muhiple and RUS-replacement increase about 10% on WEBSPAM UK-2006 and about 25% on WEBSPAM UK-2007; compared with several state-of-the-art baseline classification models, RUS-multiple and RUS-replacement achieve the optimal results in AUC value.
作者 陈木生 卢晓勇 CHEN Musheng LU Xiaoyong(School of Information Engineering, Nanchang University, Nanchang Jiangxi 330031, China School of Software, Nanchang University, Nanchang Jiangxi 330047, China)
出处 《计算机应用》 CSCD 北大核心 2017年第2期535-539,558,共6页 journal of Computer Applications
基金 江西省科技支撑计划项目(20131102040039)~~
关键词 垃圾网页检测 不平衡分类 集成学习 欠采样 分类回归树 Web spam detection imbalanced classification ensemble learning under-sampling Classification And Regression Tree (CART)
  • 相关文献

参考文献2

二级参考文献34

  • 1林舒杨,李翠华,江弋,林琛,邹权.不平衡数据的降采样方法研究[J].计算机研究与发展,2011,48(S3):47-53. 被引量:31
  • 2GYONGYI Z, GARCIA-MOLINA H. Web spam taxonomy [ C]// Proceedings of the 14st International Workshop on Adversarial Information Retrieval on the Web. Chiba, Japan: AIRWeb, 2005:39-47.
  • 3EIRON N, MCCURLEY K S. Analysis of anchor text for Web search [ C]// Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2003:459-460.
  • 4SPIRIN N, HAN J. Survey on Web spam detection: principles and algorithms [ J]. ACM SIGKDD Explorations Newsletter, 2012, 13 (2): 50-64.
  • 5CHANDRA A, SUAIB M. A survey on Web spam and spam 2.0 [ J]. International Journal of Advanced Research in Computer Science, 2014,4(15) : 634 -644.
  • 6PRIETO V M, ALVAREZ M, CACHEDA F. SAAD, a content based Web spam analyzer and detector [ J]. Journal of Systems and Software, 2013, 86(11) : 2906 - 2918.
  • 7SCARSELLI F, TSOI A C, HAGENBUCHNER M, et al. Solving graph data issues using a layered architecture approach with applications to Web spam detection [ J]. Neural Networks, 2013, 48(1) : 78 - 90.
  • 8GAO S, ZHANG H, ZHENG X, et al. Improving SVM classifiers with link structure for Web spam detection [ J]. Journal of Computational Information Systems, 2014, 10(6) :2435 -2443.
  • 9BREIMAN L. Random forests-- random features [J]. Machine Learning, 1999, 45 ( 1 ) : 5 - 32.
  • 10BREIMAN L, FRIEDMAN J, OLSHEN R, et al. Classification and regression trees [M]. Boca Raton, FL: CRC Press, 1984:18 -Sg.

共引文献18

同被引文献54

引证文献8

二级引证文献48

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部