摘要
多数在线垃圾邮件识别方法未有效区分用户针对不同邮件内容的感兴趣程度,导致垃圾邮件识别精度不高.文中提出了一种基于支持向量机的垃圾邮件在线识别新方法.即结合传统增量学习及主动学习理论,先通过随机选择代表样本寻找分类最不确定的样本进行人工标注;接着引入用户兴趣度的概念,提出了新的样本标注模型和算法性能评价标准;最后结合"轮盘赌"方法将标注后样本加入训练样本集.多种对比实验表明,文中方法针对垃圾邮件识别精度高,样本训练及待标注样本选择速度快,具有较高的在线应用价值.
Most online spam identification methods cannot effectively distinguish user interest degree in contents of different emails, thus causing identification precision to be very low .In this paper , a novel online spam identifica-tion method based on the support vector machine (SVM) is proposed.First, according to the theories of incremen-tal learning and active learning , the representative samples are randomly selected from training sets so as to find out samples with most uncertain classification for users to implement labeling .Then , the concept of the user interest degree is introduced , and a new sample labeling model and a new algorithm performance evaluation criterion are proposed .Finally, the“roulette” method is employed to add the labeled samples to the training sets .The results of various comparative experiments show that the proposed method effectively helps achieve high spam identification precision and high speeds of training samples and selecting the samples to be labeled , so its online application is highly valuable .
出处
《华南理工大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2014年第7期21-27,共7页
Journal of South China University of Technology(Natural Science Edition)
基金
国家科技成果转化项目(财建[2011]329
财建[2012]258)
关键词
垃圾邮件
支持向量机
增量学习
主动学习
用户兴趣
spam
support vector machines
incremental learning
active learning
user interest