摘要
提出一种基于内容规则的网页净化算法。包含两部分,先提出一种同层表间的比较迭代算法,通过迭代的方式对于网页中的噪声内容进行层层剥离。为进一步判断网页中锚文本与网页主题的相关性,又提出一种基于修正的编辑距离的计算锚文本的主题相似性的算法,在一定程度上考虑了网页的语义因素。该算法具有更高的准确度,同时具有很低的时间复杂度。实验结果表明,在对海量网页进行净化处理时,算法具有良好的效果。
This paper presents a new algorithm for the elimination of noise in Web pages based on a group of content - related rules. First, the authors present an algorithm which can peel off noises by iteratively comparing the tables on the same level of the page' s table tree. Next, an algorithm is presented in order to evaluate the similarity of anchor text' s topic and the content of the page. To some extent, as the new algorithm takes semantic facts of the Web pages into consideration, it acquires higher accuracy than pure rule - based algorithm, while requires lower time complexity. The result of experiment indicates that this algorithm performs very effectively when purifying great mass of Web pages.
出处
《现代图书情报技术》
CSSCI
北大核心
2008年第3期51-54,共4页
New Technology of Library and Information Service
基金
国家科技支撑计划课题基金项目“知识组织系统的集成及服务体系研究与实现”(项目编号:2006BAH03B03-01)的研究成果之一
关键词
网页净化
编辑距离
Noise reduction in Web pages Levenshtein distance