摘要
针对朴素贝叶斯(NB)算法在分类前期的训练阶段大量消耗系统和网络资源,严重影响分类效率的问题,提出使用树结构的思想,对NB算法中使用数组来维护训练样本中特征词出现的次数进行优化改进。针对NB算法在邮件样本属性个数较多时,分类效果较差的问题,对特征词条件概率进行开方处理,增加了系统对高频词汇的敏感度。实验结果表明:与NB算法相比,改进后的算法在训练时间、查准率、调和率等方面具有较好的效果,通过调整开方次数z值,来降低垃圾邮件的误判率,实验发现,当z值取到3时,各项分类性能指标都达到了一个比较理想的效果。
Aiming at the problem that naive Bayes(NB)algorithm consume a large amount of system and network resources in the early training stage,which seriously affects the classification efficiency,the idea of using tree structure is proposed.The array is used in the NB algorithm to maintain the feature words in the training samples.The number of occurrences is optimized for improvement.For the NB algorithm,when the number of mail sample attributes is large,the classification effect is poor,and the conditional probability of the feature words is taken as the rooting,which increases the sensitivity of the system to high-frequency vocabulary.The experimental results show that compared with the NB algorithm,the improved algorithm has better effects in training time,precision,reconciliation rate,etc.By adjusting the z value of the rooting times,the false positive of spam is reduced,the experiment found that when the z value is taken to 3,the classification performance indicators achieve a satisfactory effect.
作者
王鹿
李志伟
朱成德
李永久
WANG Lu;LI Zhiwei;ZHU Chengde;LI Yongjiu(School of Electronic and Electrical Engineering,Shanghai University of Engineering Science,Shanghai 201620,China;School of Materials Engineering,Shanghai University of Engineering Science,Shanghai 201620,China)
出处
《传感器与微系统》
CSCD
2020年第9期46-48,52,共4页
Transducer and Microsystem Technologies
基金
国家自然科学基金资助项目(61705127)
上海市经济和信息化委员会产业转型升级发展专项资金产研合作专题项目(沪CXY-2016-009)。