摘要
互联网灰色产业服务日益泛滥,而传统的网页过滤算法无法准确高效地过滤掉灰色产业服务网页。为解决这一问题,基于TF*IDF提出一种改进的网页特征提取和权重计算方法,利用因子分解机模型对网页进行分类,并以代孕网站为例进行实验和评估。实验结果表明,该方法精确率达到98.89%,召回率达到98.63%,且对海量网页的过滤能够在线性时间复杂度内完成,大大提高了灰色产业服务信息过滤精度和效率。
In recent years, Internet gray industry has become rampant. Unfortunately, traditional webpage filtering algorithms are not able to filter the webpages of the gray industry efficiently and accurately. To solve this problem, we first propose an improved method of webpage feature selection and weight calculation based on TF*IDF, and then classify webpages using Factorization Machines. Taking surrogacy website as an example, we conduct extensive experiments and evaluations in the real-world scenarios. The experiment results show that this method achieves a precision of 98.89% and a recall of 98.63%, and is able to filter gray industry webpages in linear time, which greatly improve the accuracy and efficiency of filtering.
作者
付强
裴佩
丁永刚
FU Qiang;PEI Pei;DING Yong-gang(Wuhan Marine Communication Institute,Wuhan 430072,China;School of Computer,Central China Normal University,Wuhan 430079,China;School of Education,Hubei University,Wuhan 430062,China)
出处
《软件导刊》
2019年第9期150-153,157,共5页
Software Guide
基金
湖北省高等学校人文社科重点基地绩效评价管理研究中心项目(2015JX01)
关键词
灰色产业服务
网页过滤
特征选择
权重计算
因子分解机
gray industry service
webpage filtering
feature selection
weight calculation
factorization machines