摘要
对网页抽取进行研究,提出一种新的网页正文信息提取方法,它利用网页布局特征与网页视觉热区来确定网页正文信息。首先选取网页的一部分区域作为网页视觉热区,通过文档对象模型得到候选正文信息块,在此基础上,给出候选正文信息块重要度函数确定网页正文信息。实验结果表明,该方法具有良好的性能。
A study is made on web pages extraction and a new extraction method for web pages content is suggested.Layout features and visual hot zone are used by it to determine web pages content.In the paper,first a part of web page’s region is selected as web page visual hot zone,the candidate content blocks are then obtained by documents object model.Furthermore,the significance function of the candidate content blocks is deduced to extracting content for web pages.Experimental results indicate that the proposed method has good performance.
出处
《计算机应用与软件》
CSCD
北大核心
2012年第6期199-201,共3页
Computer Applications and Software
关键词
布局特征
视觉热区
文档对象模型
候选正文信息块
重要度函数
Layout features Visual hot zone Document object model Candidate content blocks Significance function