摘要
为在不显著降低垃圾邮件识别精度的同时有效提高邮件识别速度,提出了一种在线垃圾邮件快速识别新方法.首先引入用户正、负兴趣集的概念,结合用户兴趣集及支持向量机对邮件进行分类;然后根据主动学习理论,结合训练集样本密度及改进角度差异方法寻找分类最不确定的样本并推荐给用户进行类别标注;最后将标注后样本及分类最确定性样本加入训练集,并使用样本价值评价新函数淘汰冗余样本以生成新的训练集.实验表明,本文方法的用户标注负担小,垃圾邮件识别精度高、速度快,具有较高的在线应用价值.
In order to improve the spam identification speed without sacrificing the accuracy seriously,a novel quick online spam identication method is proposed.Firstly,the conceptions of user positive interest set and user negative interest set are intro-duced,and emails are classified by combining user interest sets and support vector machine.Secondly,based on the active learning theory,the sample densities of different categories and the improved angle diversity method are used to select the most uncertainly classified samples,and the selected samples are recommended to users for labeling.Finally,the labeled and the classified samples with greatest possiblities are put into the training set,and a novel sample value evaluating function is proposed to filter the redundant samples for generating a new training set.Experimental results show that,the sample labeling burden of the proposed method is small,the spam identification accuracy is high,and the spam identification speed is fast,the high value of the proposed method on online application is proved.
出处
《电子学报》
EI
CAS
CSCD
北大核心
2015年第10期1963-1970,共8页
Acta Electronica Sinica
基金
国家科技成果转化项目(财建[2011]329
财建[2012]258)
关键词
垃圾邮件
用户兴趣集
支持向量机
主动学习
在线应用
spam
user interest set
support vector machine
active learning
online application