摘要
完成了邮件过滤系统中的预处理工作。实现了信息增益特征选择算法,通过实验对比,得出了PU系列语料库合适的特征维数。使用词频反文档频率公式计算了特征词的权重,通过算法把标准邮件集处理成了支持向量机算法可以直接处理的向量空间模型的形式。
The preprocessing method for spam filtering system is discussed.Information gain feature selection algorithm is realized.According to experimental results,the appropriate feature dimensions of PU serial corpus are given respectively.The weights of features are calculated by TF-IDF formula,and then the E-mail corpus is presented in vector space model which can be processed directly by the algorithm of support vector machine.
出处
《湖北汽车工业学院学报》
2007年第3期40-43,共4页
Journal of Hubei University Of Automotive Technology
关键词
邮件过滤
预处理
特征选择
spam filtering
preprocessing
feature selection