摘要
针对垃圾网页检测过程中轻微的不平衡分类问题,提出三种随机欠采样集成分类器算法,分别为一次不放回随机欠采样(RUS-once)、多次不放回随机欠采样(RUS-multiple)和有放回随机欠采样(RUS-replacement)算法。首先使用其中一种随机欠采样技术将训练样本集转换成平衡样本集,然后对每个平衡样本集使用分类回归树(CART)分类器算法进行分类,最后采用简单投票法构建集成分类器对测试样本进行分类。实验表明,三种随机欠采样集成分类器均取得了良好的分类效果,其中RUS-multiple和RUS-replacement比RUS-once的分类效果更好。与CART及其Bagging和Adaboost集成分类器相比,在WEBSPAM UK-2006数据集上,RUS-multiple和RUS-replacement方法的AUC指标值提高了10%左右,在WEBSPAM UK-2007数据集上,提高了25%左右;与其他最优研究结果相比,RUS-multiple和RUS-replacement方法在AUC指标上能达到最优分类结果。
In order to solve the problem of slighty imbalanced classification in Web spam detection, three ensemble classifiers based on random under-sampling techniques were proposed, including Random Under-Sampling once without replacement (RUS-once), Random Under-Sampling multiple times without replacement (RUS-muhiple) and Random Under- Sampling with replacement ( RUS-replacement). At first, the unbalanced training dataset was converted into several balanced datasets by using one of the under-sampling techniques. Secondly, the Classification And Regression Tree (CART) classifiers were trained based on the balanced datasets. Finally, an ensemble classifier was constructed with all of the CART classifiers based on simple voting rule and used to classify the test samples. The experimental results show that the three kinds of random under-sampling based ensemble classifiers achieve good classification results, the performance of RUS-multiple and RUS- replacement are better than RUS-once. Compared with CART, Bagging with CART and Adaboost with CART, the AUC values of RUS-muhiple and RUS-replacement increase about 10% on WEBSPAM UK-2006 and about 25% on WEBSPAM UK-2007; compared with several state-of-the-art baseline classification models, RUS-multiple and RUS-replacement achieve the optimal results in AUC value.
作者
陈木生
卢晓勇
CHEN Musheng LU Xiaoyong(School of Information Engineering, Nanchang University, Nanchang Jiangxi 330031, China School of Software, Nanchang University, Nanchang Jiangxi 330047, China)
出处
《计算机应用》
CSCD
北大核心
2017年第2期535-539,558,共6页
journal of Computer Applications
基金
江西省科技支撑计划项目(20131102040039)~~
关键词
垃圾网页检测
不平衡分类
集成学习
欠采样
分类回归树
Web spam detection
imbalanced classification
ensemble learning
under-sampling
Classification And Regression Tree (CART)