摘要
为了提高对钓鱼网站的识别准确率,通过对钓鱼网站统一资源定位符(URL)文本数据的分析,结合钓鱼网站内部链接关系组成的网络拓扑结构特征,提出了基于URL文本特征及链接关系的钓鱼网站识别算法FAUFL。该算法的原理是:以URL文本特征作为输入,采用随机森林算法生成基于URL文本特征的钓鱼网站判别算法;以链接关系作为输入构建相关网页群,采用基于最大流切割的相关网页群算法生成基于链接关系的钓鱼网站判别算法;将上述两种判别算法结果作为输入,采用Bagging算法进行进一步评估。测试结果表明钓鱼网站识别算法FAUFL算法的识别准确率为99.2%,比基于URL文本特征的算法的准确率提高3.9%,比基于链接关系的算法提高5.0%。
Based on the analysis of the uniform resource location (URL) text data of fishing sites and the characteristics of the network topology composed of fishing websites, a fishing site recognition algorithm based on URL text features and link relation (FAUFL) is proposed to improve the accuracy rate of fishing site recognition. The principle of the algorithm is as below: By using URL text features as input, the random forest algorithm is used to generate the fishing site discrimination algorithm based on URL text features. The related web page group is constructed by using the link relation as input, and the related web page algorithm based on the maximum flow cutting is used to generate the fishing website based on the link discriminant algorithm. By taking the above two kinds of discriminant algorithms' results as input, the further evaluation is conducted by using the Bagging algorithm. The test results show that the accuracy rate of the FAUFL is 99.2%, which is 3.9% higher than that of the URL text feature-based algorithm, and 5.0% higher than that of the link-based algorithm
作者
赵蹲宇
张兆心
Zhao Dunyu;Zhang Zhaoxin(School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001)
出处
《高技术通讯》
北大核心
2017年第8期708-717,共10页
Chinese High Technology Letters
基金
国家重点研发计划(SQ2017YFGX110125-01)
国家自然科学基金(61370215
61370211
61402137)
国家科技支撑计划(2012BAH45B01)
国家信息安全计划(2017A065
2017A111)资助项目