摘要
由于HTML本身在自描述上的缺陷,网页信息中不可避免地存在大量的噪音信息。文章在分析了网页的HTML文档结构和噪音类型的基础上,给出了网页文本信息提取、对噪声抑制的方法,以及实现的过程。并尝试性地使用信噪比的概念作为评判文本信息提取去噪结果优劣的依据,实验结果显示,抽取去噪效果明显;同时实验表明,信噪比可以作为网页信息去噪结果优劣的评判标准。
Because of the limitation of HTML in self- description, Web pages contain lots of noised information. This article analyses the construct of HTML document and the type of noises, provides the news information exacting and noises restrain method, and the process of realization. This article also attempts to use SNR ( Signal - to - Noise Ratio) to estimate the quality of re - noise result. The experiment indicates that SNR can be used as the judgment standard of the quality of de - noising results.
出处
《微计算机应用》
2007年第9期921-924,共4页
Microcomputer Applications
关键词
信噪比
信息提取
网页去噪
Signal - to - noise Ratio, Information Extraction, Web de - noising