期刊文献+

基于堆叠式降噪自编码器的中文垃圾邮件过滤 被引量:3

Chinese Spam Filtering based on Stacked Denoising Autoencoders
原文传递
导出
摘要 针对传统特征选择方法在中文垃圾邮件过滤处理中出现的特征项提取不明确、过滤精度低的问题,提出了一种基于堆叠式降噪自编码器(Stacked Denoising Autoencoder,SDA)的中文垃圾邮件过滤方法.首先,对处理后的语料使用Word2vec工具集中的连续词袋(Continuous Bag-of-Words,CBOW)模型进行训练,得到对应的词向量;接着以词向量作为输入,采用堆叠式降噪自编码器深度网络以无监督学习方式对其进行有效的特征提取;最后,采用改进的Softmax分类器对网络进行有监督微调.该方法在TREC06C数据集上进行测试,将准确率、精确率、召回率、更能衡量二分类效果的f1得分值作为实验评价标准,实验结果表明,相比于贝叶斯模型、KNN分类算法、SVM以及传统的堆叠式降噪自编码器,方法的准确率、精确率、召回率及f1得分值达到了93.5%、94.8%、92%和93.2%,在中文垃圾邮件过滤中拥有更好的二分类效果和健壮性. Aiming at the problem that the traditional feature selection method is not clear in the Chinese spam filtering process and the filtering accuracy is low,this paper proposed a Chinese spam filtering method based on Stacked Denoising Autoencoder(SDA).First,the processed corpus is trained using the Continuous Bag-of-Words(CBOW) model in the Word2 vec tool set to obtain the corresponding word vector.Firstly,Use the Continuous Bag-of-Words(CBOW) model training in the Word2 vec tool set for the processed corpus,we can transform word segments into vectors;Then use the word vector as input,twe apply the Stacked denoising Autoencoder(SDA) to effectively extract the features text in unsupervised learning.Finally,we use the improved softmax function for classification.The test was carried out on the TREC06 C data set,used the accuracy,precision,recall and fl scores that better measure the effect of the two classifications as the evaluation criteria of the experiment.The experimental results show that compared with Bayesian model,KNN classification algorithm,SVM model and traditional Stacked Denoising Autoencoder,the accuracy,precision,recall,and fl score of the method reached 93.5%,94.8%,92% and 93.2%,and had better dichotomous effect and robustness in the application.
作者 张柳艳 聂云峰 段生月 张贵昌 ZHANG Liu-yan;NIE Yun-feng;DUAN Sheng-yue;ZHANG Gui-chang(School of Information Engineering,Nanchang HangKong University,Nanchang 330063,China)
出处 《数学的实践与认识》 北大核心 2020年第1期105-114,共10页 Mathematics in Practice and Theory
基金 国家自然科学基金(41101426).
关键词 中文垃圾邮件 堆叠式降噪自编码器 无监督学习 词向量 Chinese spam stacked denoising autoencoder unsupervised learning word vectors
  • 相关文献

参考文献1

二级参考文献7

共引文献7

同被引文献10

引证文献3

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部