期刊文献+

网页文本信息提取及结果评价 被引量:10

Web Page Text Information Extraction and Result Estimation
下载PDF
导出
摘要 由于HTML本身在自描述上的缺陷,网页信息中不可避免地存在大量的噪音信息。文章在分析了网页的HTML文档结构和噪音类型的基础上,给出了网页文本信息提取、对噪声抑制的方法,以及实现的过程。并尝试性地使用信噪比的概念作为评判文本信息提取去噪结果优劣的依据,实验结果显示,抽取去噪效果明显;同时实验表明,信噪比可以作为网页信息去噪结果优劣的评判标准。 Because of the limitation of HTML in self- description, Web pages contain lots of noised information. This article analyses the construct of HTML document and the type of noises, provides the news information exacting and noises restrain method, and the process of realization. This article also attempts to use SNR ( Signal - to - Noise Ratio) to estimate the quality of re - noise result. The experiment indicates that SNR can be used as the judgment standard of the quality of de - noising results.
出处 《微计算机应用》 2007年第9期921-924,共4页 Microcomputer Applications
关键词 信噪比 信息提取 网页去噪 Signal - to - noise Ratio, Information Extraction, Web de - noising
  • 相关文献

参考文献7

二级参考文献44

  • 1荆涛,左万利.基于可视布局信息的网页噪音去除算法[J].华南理工大学学报(自然科学版),2004,32(z1):84-87. 被引量:21
  • 2张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量:57
  • 3常育红,姜哲,朱小燕.基于标记树表示方法的页面结构分析[J].计算机工程与应用,2004,40(16):129-132. 被引量:24
  • 4[1]Lin Shian-hua, Ho Jan-ming. Discovering informative content blocks from Web documents [A]. Proceeding of the 8th ACM SIG KDD International Conference on Knowledge Discovery and Data Mining [C]. Edmonton :ACM Press,2002.588 - 593.
  • 5[2]Yi Lan,Liu Bing, Li Xiao-li. Eliminating noisy information in Web pages for data mining [A]. Proceeding of the 8th ACM SIG KDD International Conference on Knowledge Discovery and Data Mining [C]. Washington, DC: ACM Press ,2003. 296 - 305.
  • 6[3]Kovacevic Milos, Dilligenti Michelangelo, Gori Marco,et al. Recognition of common areas in a Web page using a visualization approach [A]. Proceeding of the 10th International Conference on Artificial Intelligence: Methodology, Systems, Applications [C]. Varna: Springer,2002.203 - 212.
  • 7[4]Gupta Suhit, Kaiser Gail E, Neistadt David. et al. DOMbased content extraction of HTML documents [A].Proce-eding of the 12th International World Wide Web Conference [C]. Budapest: ACM Press ,2003. 207 - 214.
  • 8[5]Cai Deng, Yu Shi-peng, Wen Ji-rong, et al. Extracting content structure for Web pages Based on visual representation [A]. Proceeding of the 6th Asia Pacific Web Conference [C]. Xian: Springer,2003. 406 - 417.
  • 9O Buyukkokten, H Garcia-Molina, A Paepcke. Accordion summarization for end-game browsing on PDAs and cellular phones. In: Proc of ACM Conf on Human Factors in Computing Systems(CHI 2001). New York: ACM Press, 2001. 213~220
  • 10Wang Tengjiao, Tang Shiwei, Yang Dongqing, et al. COMIIX:Towards effective WEB information extraction, integration and query answering. In: Proc of SIGMOD' 02. New York: ACM Press, 2002. 620

共引文献219

同被引文献103

引证文献10

二级引证文献21

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部