摘要
搜索引擎垃圾网页作弊的检测问题一般被视为一个二元分类问题,基于机器学习的分类算法建立分类器,将网页分成正常网页和垃圾网页2类.现有的基于内容特征的垃圾网页检测模型忽略了网页之间的链接关系,故构建了软间隔支持向量机分类器,以网页的内容特征作为支持向量,根据网页之间的链接具有相似性的特点定义了惩罚函数,使用样本集学习,得出了线性支持向量机网页分类器,并对分类器的分类效果进行了测试.实验结果表明基于支持向量机的分类器的效果明显好于使用内容特征构建的决策树分类器.
With the widespread application of search engines, some web pages often canT out cheating the search engines for the purpose of increasing rankings in the search results. These web pages are called web spam. The web spam detection problem is viewed as a classification problem, and that means classification models are created by machine learning classification algorithms, which include two categories: Normal and Spam. Content-based classification models usually ignore the link structures of web pages. So the soft margin support vector machine classification model which takes the content features as the support vector has been developed by learning the sample set, and penalty functions are defined according to the links between web pages that seems to have similar characteristics. The classification effect of the model is also studied. The experimental results have showed that the effect of the support vector machine-based classifier is significantly better than the decision tree classifier built by content features.
出处
《云南民族大学学报(自然科学版)》
CAS
2011年第3期173-176,共4页
Journal of Yunnan Minzu University:Natural Sciences Edition
基金
国家自然科学基金(60903131)
云南省教育厅科学研究基金(2010Y108)
关键词
垃圾网页
垃圾网页检测
机器学习
网页分类
支持向量机
web spam
web spare detection
machine learning
web page classification
support vector machine