基于支持向量机的搜索引擎垃圾网页检测研究被引量：5

Study of the Web Spam Detection Based on the Support Vector Machine

下载PDF

导出

摘要搜索引擎垃圾网页作弊的检测问题一般被视为一个二元分类问题,基于机器学习的分类算法建立分类器,将网页分成正常网页和垃圾网页2类.现有的基于内容特征的垃圾网页检测模型忽略了网页之间的链接关系,故构建了软间隔支持向量机分类器,以网页的内容特征作为支持向量,根据网页之间的链接具有相似性的特点定义了惩罚函数,使用样本集学习,得出了线性支持向量机网页分类器,并对分类器的分类效果进行了测试.实验结果表明基于支持向量机的分类器的效果明显好于使用内容特征构建的决策树分类器. With the widespread application of search engines, some web pages often canT out cheating the search engines for the purpose of increasing rankings in the search results. These web pages are called web spam. The web spam detection problem is viewed as a classification problem, and that means classification models are created by machine learning classification algorithms, which include two categories： Normal and Spam. Content-based classification models usually ignore the link structures of web pages. So the soft margin support vector machine classification model which takes the content features as the support vector has been developed by learning the sample set, and penalty functions are defined according to the links between web pages that seems to have similar characteristics. The classification effect of the model is also studied. The experimental results have showed that the effect of the support vector machine-based classifier is significantly better than the decision tree classifier built by content features.

作者贾志洋李伟伟高炜夏幼明

机构地区云南大学旅游文化学院宁德职业技术学院计算机科学系云南师范大学信息学院

出处《云南民族大学学报（自然科学版）》 CAS 2011年第3期173-176,共4页 Journal of Yunnan Minzu University:Natural Sciences Edition

基金国家自然科学基金(60903131) 云南省教育厅科学研究基金(2010Y108)

关键词垃圾网页垃圾网页检测机器学习网页分类支持向量机 web spam web spare detection machine learning web page classification support vector machine

分类号 TP391.3 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献10

1EIRON N,MCCURLEY K S.Analysis of anchor text for web search[C] //Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval.July 28-August 1,2003,Toronto,Canada.New York:ACM,2003:459-460.
2王利刚,赵政文,赵鑫鑫.搜索引擎中的反SEO作弊研究[J].计算机应用研究,2009,26(6):2035-2037. 被引量：14
3GY(O)NGYI Z,GARCIA-MOLINA H.Web spamn taxonomy[C] //Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web.May 10 -14,2005,Chiba,Japan.2005:39-48.
4NTOULAS A,NAJORK M,MANASSE M,et al.Detecting Spam Web Pages thronugh Content Analysis[C]//Proceedings of the 15th International Conference on World Wide Web.May 22-26,2006,Edinburgh,Scotland,UK.New York:ACM,2006:83 -92.
5贾志洋,李伟伟,张海燕.基于内容的搜索引擎垃圾网页检测[J].计算机应用与软件,2009,26(11):165-167. 被引量：9
6祝伟华,刘期勇.基于Lucene.Net具有用户权限的全文检索系统的应用[J].云南民族大学学报（自然科学版）,2009,18(1):73-76. 被引量：3
7徐启华,杨瑞.一种新的软间隔支持向量机分类算法[J].计算机工程与设计,2005,26(9):2316-2318. 被引量：7
8CHEN Dirong,WU Qiang,YING Yiming.Support vector machine soft margin classifiers:Error analysis[J].The Journal of Machine Learning Research,2004,12 (5):1 143 -1 175.
9ZHANG Tong,POPESCUL A,DOM B.Linear prediction models with graph regularization for web -page categorization[C] //Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.August 20-23,2006,Philadelphia,USA.New York:ACM,2006:821-826.
10ABERNETHY J,CHAPELLE O,CASTILLO C.Graph regularization methods for Web spam detection[J].Machine Learning,2010,81(2):207-225.

二级参考文献25

1许建华,张学工,李衍达.支持向量机的新发展[J].控制与决策,2004,19(5):481-484. 被引量：132
2祁亨年.支持向量机及其应用研究综述[J].计算机工程,2004,30(10):6-9. 被引量：186
3欧阳柳波,李学勇,李国徽,王鑫.专业搜索引擎搜索策略综述[J].计算机工程,2004,30(13):32-33. 被引量：34
4王栋,曾洪详,战守义,祝烈煌.基于SVM分类机的移动通信欺诈检测系统[J].计算机工程与设计,2004,25(6):859-861. 被引量：5
5徐启华,师军.基于支持向量机的航空发动机故障诊断[J].航空动力学报,2005,20(2):298-302. 被引量：54
6管建和,甘剑峰.基于Lucene全文检索引擎的应用研究与实现[J].计算机工程与设计,2007,28(2):489-491. 被引量：71
7周平.Lucene全文检索引擎技术及应用[J].重庆工学院学报,2007,21(7):86-88. 被引量：10
8Jansen B ,Spink A. An Analysis of web documents retrieved and viewed [ C ]//Proceedings of ICIC '03. Las Vegas, Nevada, USA,2003 : 65 - 69.
9Ntoulas A, Najork M, Manasse M. Detecting spam web pages through content analysis[ C ]//Proceedings of the 15th International Conference on World Wide Web. Edinburgh, Scotland,2006 : 83 - 92.
10Gyongyi Z, Molina H. Web spam taxonomy [ C]//Proceedings of the 1 st International Workshop on Adversarial Information Retrieval on the Web. Chiba, Japan,2005 : 39 - 47.

共引文献29

1林豫柏,罗键.基于支持向量回归的钢材力学性能模型及应用[J].江南大学学报（自然科学版）,2006,5(4):497-499. 被引量：1
2朱云涛,尹怡欣,杜军平.SVM增量算法及在旅游信息分类中的应用[J].计算机工程与设计,2007,28(3):700-702. 被引量：1
3赵博,李永忠,杨鸽,徐静.改进SVM在入侵检测中的应用研究[J].计算机工程与应用,2009,45(17):102-104. 被引量：1
4常艳,汤小春.网络广告中反CPC点击作弊研究[J].科学技术与工程,2010,10(4):928-932.
5詹建国,虞剑波.基于网络的备课系统的开发[J].科技创新导报,2010,7(19):31-31.
6谭龙江.基于搜索引擎优化的网络宣传机模型[J].计算机应用,2010,30(8):2232-2234. 被引量：3
7廖文军.基于SEO技术网站建设优化研究[J].科技信息,2010(11X):56-56. 被引量：8
8赵静.搜索引擎优化的作弊与防范[J].办公自动化（综合月刊）,2010(11):8-8. 被引量：4
9刘萍萍.中小型电子商务企业SEO策略研究[J].科技情报开发与经济,2010,20(34):107-109. 被引量：1
10张晓宇,吴向前,张平洋.农业网站中垃圾网页过滤方法的研究[J].网络安全技术与应用,2011(1):55-57. 被引量：2

同被引文献44

1唐发明,王仲东,陈绵云.支持向量机多类分类算法研究[J].控制与决策,2005,20(7):746-749. 被引量：90
2张应辉,曾庆华,王志伟.遗传算法的混合算子策略[J].计算机科学,2007,34(4):222-224. 被引量：15
3Zoltan Gyongyi, Hector Garcia - Molina, Jan Pedersen. Combating web spam with TrustRank [ M ]. In Proceedings of the 30st International Conference on .Very Large Data Bases, Trondheim, Toronto, Canada. San Francis- co : Morgan Kaufmann. , 2004:576 - 583.
4Avier Ortega F, Craig Macdonald, Troyano Jos6 A, et ai. Spam detection with a content -based random- walk algorithm [ M ]. Proceedings of the 2rid interna- tional workshop on Search and mining user - generated contents, Toronto, Canada. New York : ACM, 2010: 45 -51.
5Wu Bao -ning, Vinay Goel, Brian D Davison. Topical TrustRank : Using topicality to combat web spam [ M ]. In Proceedings of the 15th International World Wide Web Conference, Edinburgh, Scotland. New York : ACM, 2006:63 -72.
6Google. PRO - Googleg PageRank 0 Penalty [ J/OL] 2010 - 12 - 28 (2011 - 03 - 21 ). http ://pr. efactory de/e - pr0. shtml.
7Vijay Krishnan, Rashmi Raj. Web spam detection with anti - trust rank [ M ]. In Proceedings of the Second In- ternational Workshop on Adversarial Information Re- trieval on the Web, Washington, USA. New York : ACM, 2006:37 - 43.
8Wu Baoning, Vinay Goel, 'Brian D D. Propagating trust and distrust to demote web spam [ M ]. In Pro- ceedings of Models of Trust for the Web, Edinburgh, Scotland. New York :ACM, 2006.
9Gan Qingqing, Torsten Suel. Improving web spam classifiers using link structure [ M ]. In Proceedings of the Third International Workshop on Adversarial Infor- mation Retrieval on the Web, Banff, Alberta, Cana-da. New York : ACM, 2007 : 17 - 20.
10Hiroo Saito, Masashi Toyoda, Masaru Kitsuregawa, et al. A large - scale study of link spam detection by graph algorithms [ M ]. In Proceedings of the Third In- ternational Workshop on Adversarial Information Re- trieval on the Web, Banff, Alberta, Canada. New York : ACM, 2007:45-48.

引证文献5

1贾志洋,夏幼明,高炜,王勇刚.搜索引擎垃圾网页检测模型研究[J].重庆文理学院学报（自然科学版）,2011,30(5):53-58. 被引量：1
2刘虎,罗斌,吴晟,侯明.改进的遗传算法在SVM参数优化中的应用[J].云南师范大学学报（自然科学版）,2012,32(4):47-51. 被引量：8
3王明宸.搜索引擎垃圾网页技术的原理与类型[J].技术与市场,2016,23(12):110-110.
4李驰,李林.搜索引擎应对垃圾网页的技术研究[J].电脑知识与技术（过刊）,2015,21(9X):20-22.
5陈明,马宏忠,潘信诚,张利,屈斌.基于S变换和遗传算法优化SVM的GIS机械故障诊断[J].电力信息与通信技术,2020,18(5):1-6. 被引量：11

二级引证文献20

1钱江.恢复中美华沙会谈的起步[J].百年潮,2000(3):20-24.
2刘天祥,包腾飞,宋锦焘,沈寿亮,梁睿斌,姜彦作.基于遗传算法的LIBSVM模型大坝扬压力预测研究[J].三峡大学学报（自然科学版）,2013,35(6):24-28. 被引量：10
3魏峻.基于和声搜索算法的支持向量机参数优化[J].河南科学,2014,32(7):1228-1232. 被引量：2
4王洪伟,王伟,孟园.搜索引擎排序作弊的识别:基于文本内容和链接结构的分析[J].系统工程理论与实践,2015,35(2):445-457. 被引量：2
5高雷阜,张秀丽,佟盼.GA_SJ在SVM核参数优化中的应用[J].计算机工程与应用,2015,51(4):110-114. 被引量：5
6郝建华.煤矿变电所数显仪表字符识别研究[J].工矿自动化,2016,42(9):64-67. 被引量：1
7马贵阳,宫清君,潘振,刘培胜,李存磊.基于支持向量机结合遗传算法的天然气水合物相平衡研究[J].天然气工业,2017,37(5):46-52. 被引量：17
8李良荣,荣耀祖,顾平,李震.基于SVM的车牌识别技术研究[J].贵州大学学报（自然科学版）,2018,35(5):48-54. 被引量：15
9高雷阜,佟盼.融合改进遗传和人工蜂群的SVM参数优化算法[J].计算机工程与应用,2016,52(18):36-39. 被引量：5
10胡勤,朱鸿斌,赵凯凯,覃爱淞.基于GA优化SVM的滚动轴承故障诊断方法研究[J].广东石油化工学院学报,2020,30(1):44-47. 被引量：11

1杨凡,朱焱,唐寿洪.基于免疫克隆选择算法的垃圾网页检测[J].计算机应用与软件,2015,32(6):20-23. 被引量：1
2项雪琰,高玲,魏亚利.基于 KPCA 和 RST 的不平衡垃圾网页检测[J].山东师范大学学报（自然科学版）,2015,30(3):10-13.
3李法良,朱焱,曾俊东.集成PCA降维与分类算法的垃圾网页检测[J].计算机应用与软件,2014,31(10):269-272. 被引量：4
4王莉丽,朱焱,马永强.基于朴素贝叶斯的伪装型垃圾网页检测[J].计算机应用,2013,33(A01):102-103. 被引量：4
5高爽,张化祥,房晓南.基于独立成分分析和协同训练的垃圾网页检测[J].山东大学学报（工学版）,2013,43(2):29-34. 被引量：1
6贾志洋,夏幼明,高炜,王勇刚.搜索引擎垃圾网页检测模型研究[J].重庆文理学院学报（自然科学版）,2011,30(5):53-58. 被引量：1
7贾志洋,李伟伟,张海燕.基于内容的搜索引擎垃圾网页检测[J].计算机应用与软件,2009,26(11):165-167. 被引量：9
8宋军涛,周铜,杜庆灵.支持向量机和蚁群算法的网页分类研究[J].计算机工程与应用,2009,45(17):122-124. 被引量：6
9高爽,张化祥,房晓南.基于多视图典型相关分析的垃圾网页检测[J].计算机应用研究,2013,30(3):810-813. 被引量：3
10提高Linux系统安全性的5个做法[J].计算机与网络,2011,37(17):37-37.

云南民族大学学报（自然科学版）

2011年第3期

浏览历史

内容加载中请稍等...

基于支持向量机的搜索引擎垃圾网页检测研究被引量：5

参考文献10

二级参考文献25

共引文献29

同被引文献44

引证文献5

二级引证文献20

相关作者

相关机构

相关主题

浏览历史

基于支持向量机的搜索引擎垃圾网页检测研究 被引量：5

参考文献10

二级参考文献25

共引文献29

同被引文献44

引证文献5

二级引证文献20

相关作者

相关机构

相关主题

浏览历史

基于支持向量机的搜索引擎垃圾网页检测研究被引量：5