期刊文献+

基于支持向量机的搜索引擎垃圾网页检测研究 被引量:5

Study of the Web Spam Detection Based on the Support Vector Machine
下载PDF
导出
摘要 搜索引擎垃圾网页作弊的检测问题一般被视为一个二元分类问题,基于机器学习的分类算法建立分类器,将网页分成正常网页和垃圾网页2类.现有的基于内容特征的垃圾网页检测模型忽略了网页之间的链接关系,故构建了软间隔支持向量机分类器,以网页的内容特征作为支持向量,根据网页之间的链接具有相似性的特点定义了惩罚函数,使用样本集学习,得出了线性支持向量机网页分类器,并对分类器的分类效果进行了测试.实验结果表明基于支持向量机的分类器的效果明显好于使用内容特征构建的决策树分类器. With the widespread application of search engines, some web pages often canT out cheating the search engines for the purpose of increasing rankings in the search results. These web pages are called web spam. The web spam detection problem is viewed as a classification problem, and that means classification models are created by machine learning classification algorithms, which include two categories: Normal and Spam. Content-based classification models usually ignore the link structures of web pages. So the soft margin support vector machine classification model which takes the content features as the support vector has been developed by learning the sample set, and penalty functions are defined according to the links between web pages that seems to have similar characteristics. The classification effect of the model is also studied. The experimental results have showed that the effect of the support vector machine-based classifier is significantly better than the decision tree classifier built by content features.
出处 《云南民族大学学报(自然科学版)》 CAS 2011年第3期173-176,共4页 Journal of Yunnan Minzu University:Natural Sciences Edition
基金 国家自然科学基金(60903131) 云南省教育厅科学研究基金(2010Y108)
关键词 垃圾网页 垃圾网页检测 机器学习 网页分类 支持向量机 web spam web spare detection machine learning web page classification support vector machine
  • 相关文献

参考文献10

  • 1EIRON N,MCCURLEY K S.Analysis of anchor text for web search[C] //Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval.July 28-August 1,2003,Toronto,Canada.New York:ACM,2003:459-460.
  • 2王利刚,赵政文,赵鑫鑫.搜索引擎中的反SEO作弊研究[J].计算机应用研究,2009,26(6):2035-2037. 被引量:14
  • 3GY(O)NGYI Z,GARCIA-MOLINA H.Web spamn taxonomy[C] //Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web.May 10 -14,2005,Chiba,Japan.2005:39-48.
  • 4NTOULAS A,NAJORK M,MANASSE M,et al.Detecting Spam Web Pages thronugh Content Analysis[C]//Proceedings of the 15th International Conference on World Wide Web.May 22-26,2006,Edinburgh,Scotland,UK.New York:ACM,2006:83 -92.
  • 5贾志洋,李伟伟,张海燕.基于内容的搜索引擎垃圾网页检测[J].计算机应用与软件,2009,26(11):165-167. 被引量:9
  • 6祝伟华,刘期勇.基于Lucene.Net具有用户权限的全文检索系统的应用[J].云南民族大学学报(自然科学版),2009,18(1):73-76. 被引量:3
  • 7徐启华,杨瑞.一种新的软间隔支持向量机分类算法[J].计算机工程与设计,2005,26(9):2316-2318. 被引量:7
  • 8CHEN Dirong,WU Qiang,YING Yiming.Support vector machine soft margin classifiers:Error analysis[J].The Journal of Machine Learning Research,2004,12 (5):1 143 -1 175.
  • 9ZHANG Tong,POPESCUL A,DOM B.Linear prediction models with graph regularization for web -page categorization[C] //Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.August 20-23,2006,Philadelphia,USA.New York:ACM,2006:821-826.
  • 10ABERNETHY J,CHAPELLE O,CASTILLO C.Graph regularization methods for Web spam detection[J].Machine Learning,2010,81(2):207-225.

二级参考文献25

共引文献29

同被引文献44

  • 1唐发明,王仲东,陈绵云.支持向量机多类分类算法研究[J].控制与决策,2005,20(7):746-749. 被引量:90
  • 2张应辉,曾庆华,王志伟.遗传算法的混合算子策略[J].计算机科学,2007,34(4):222-224. 被引量:15
  • 3Zoltan Gyongyi, Hector Garcia - Molina, Jan Pedersen. Combating web spam with TrustRank [ M ]. In Proceedings of the 30st International Conference on .Very Large Data Bases, Trondheim, Toronto, Canada. San Francis- co : Morgan Kaufmann. , 2004:576 - 583.
  • 4Avier Ortega F, Craig Macdonald, Troyano Jos6 A, et ai. Spam detection with a content -based random- walk algorithm [ M ]. Proceedings of the 2rid interna- tional workshop on Search and mining user - generated contents, Toronto, Canada. New York : ACM, 2010: 45 -51.
  • 5Wu Bao -ning, Vinay Goel, Brian D Davison. Topical TrustRank : Using topicality to combat web spam [ M ]. In Proceedings of the 15th International World Wide Web Conference, Edinburgh, Scotland. New York : ACM, 2006:63 -72.
  • 6Google. PRO - Googleg PageRank 0 Penalty [ J/OL] 2010 - 12 - 28 (2011 - 03 - 21 ). http ://pr. efactory de/e - pr0. shtml.
  • 7Vijay Krishnan, Rashmi Raj. Web spam detection with anti - trust rank [ M ]. In Proceedings of the Second In- ternational Workshop on Adversarial Information Re- trieval on the Web, Washington, USA. New York : ACM, 2006:37 - 43.
  • 8Wu Baoning, Vinay Goel, 'Brian D D. Propagating trust and distrust to demote web spam [ M ]. In Pro- ceedings of Models of Trust for the Web, Edinburgh, Scotland. New York :ACM, 2006.
  • 9Gan Qingqing, Torsten Suel. Improving web spam classifiers using link structure [ M ]. In Proceedings of the Third International Workshop on Adversarial Infor- mation Retrieval on the Web, Banff, Alberta, Cana-da. New York : ACM, 2007 : 17 - 20.
  • 10Hiroo Saito, Masashi Toyoda, Masaru Kitsuregawa, et al. A large - scale study of link spam detection by graph algorithms [ M ]. In Proceedings of the Third In- ternational Workshop on Adversarial Information Re- trieval on the Web, Banff, Alberta, Canada. New York : ACM, 2007:45-48.

引证文献5

二级引证文献20

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部