摘要
网页正文提取是WEB挖掘的重要步骤。传统网页正文提取方法都需要经过分块这一步骤之后来识别网页正文块,提出了利用行文本之间的内容相似度和标签相似度结合的方法来提取网页正文。该算法避免了传统网页提取算法的分块步骤,在规范网页之后,先提取网页的最大文本行,然后计算每行文本与最大行的内容相似度和标签相似度,再结合内容相似度与标签相似度来提取网页正文。实验中,利用随机抽取的网页进行了测试,其测试精度接近95%,表明该算法在实际中是有效的。
HTML Extraction is important to WEB Mining. A new web page content extracting method was proposed. It combined content similarity and tag similarity of line text to extract web page content. This approach avoided a traditional step called web page blocking when dealing with web pages. It first extracted the largest text line and computes the similarity of line text and line tags between each line, then, used text similarity and tag similarity to extract web page content. Finally some web pages have been collected to test this approach. In experiments, the accuracy of this approach closes to 95%, which shows that this method is effective in practice.
出处
《西南科技大学学报》
CAS
2010年第1期80-84,共5页
Journal of Southwest University of Science and Technology
基金
国家人事部留学归国人员启动基金(07ZD0105)
西南科技大学留学归国人员启动基金(07ZX0102)