期刊文献+

基于互信息度量的Web信息抽取 被引量:5

WEB INFORMATION EXTRACTION BASED ON MUTUAL INFORMATION METRIC
下载PDF
导出
摘要 如何从纷繁复杂的网页中抽取有价值的信息是信息检索和Web数据挖掘中的重要问题。利用网页集信息所呈现的分布特点,提出基于互信息度量的Web信息抽取方法,它能够自动识别噪声信息并保留关键信息。该方法将网页解析成DOM树,计算叶子节点的互信息值;然后按DOM树结构对叶子节点进行分块聚集,向上递归求得标签<body>的互信息值,并以此作为阈值区分噪声与非噪声。最后与多个国内知名网站上的实验及对比结果证明了该方法的有效性。 How to extract valuable information from complex web pages is an important issue in information retrieval and Web data mining. We utihse the distribution feature presented by the information of webpage set and propose a mutual information metric-based Web information extraction method, it can automatically identify the noisy information and keep the key information. In this method, webpage is parsed into a DOM tree and the mutual information value of leaf nodes is calculated. Then the leaf nodes are block aggregated according to the structure of the DOM tree, the mutual information value of tag 〈 body 〉 is upward recursively computed and is set as the threshold to distinguish the non-noise from noise. Experiments and contrast results on various famous domestic websites prove the effectiveness of the proposed method.
出处 《计算机应用与软件》 CSCD 北大核心 2013年第12期15-18,共4页 Computer Applications and Software
基金 国家自然科学基金项目(61070033 61100148) 广东省自然科学基金项目(9251009001000005 S2011040004804)
关键词 信息抽取 DOM 互信息 阈值 Information extraction DOM Mutual information Threshold
  • 相关文献

参考文献11

  • 1Byeong H K, Yang S K. Noise Elimination from the Web Documents by Using URL paths and Information Redundancy [ C ]//The 2006 Inter-national Conference on Information & Knowledge Engineering, 2006: 135 -141.
  • 2Chang C H, Kayed M, Girgis R, et al. A survey of web information ex- traction systems[J]. IEEE Transactions on Knowledge and Data Engi- neering,2006, 15 (10) :1411-1428.
  • 3陈钊,张冬梅.Web信息抽取技术综述[J].计算机应用研究,2010,27(12):4401-4405. 被引量:22
  • 4王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量:81
  • 5陈治昂,周知予,李大学.一种基于模板的快速网页文本自动抽取算法[J].计算机应用研究,2009,26(7):2646-2649. 被引量:11
  • 6Weninger T, Hsu W H, Hart J. CETR-content extraction via tagratios [ C ]//Proceedings of the 19th international conference on World Wide Web. Raleigh : ACM Press ,2010:971 - 980.
  • 7Sun F, Song D, Liao L. DOM based content extraction via text density [ C ]//Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. Beijing: ACM Press ,2011:245 - 254.
  • 8Cover T M ,Thomas J A. Elements of Information Theory[ M]. 2nd ed. John Wiley & Sons, Inc. , Hoboken,New Jersey, 2006.
  • 9Pinto D, Branstein M, Coleman R, et al. Quasm : A system for ques- tionanswering using semi-structured data [ C ]//Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries,2002:46- 55.
  • 10Gottron T. Content code blurring: A new approach to content extraction [ C]//DEXA08: Proceedings of the 19th International Conference on Database and Expert Systems Application,2008:29-33.

二级参考文献38

共引文献109

同被引文献35

  • 1Jie Zhao, Peiquan Jin. Extraction and Credibility Evaluation of Webbased Competitive Intelligence[J] Journal of Software, 2011, 6(8): 1513-1520.
  • 2Vidya.V.L. A Survey of Web Data Extraction Techniques[J]. International Journal of Advance Research in Computer Science and Management Studies, 2014, 2(9): 76-79.
  • 3Joaquim Fonseca, Ant6nio Grilo. WeCIM - Web Competitive Intelligence Methodology[J]. Journal of Economics, Business and Management. 2013, 1(1): 112-116.
  • 4Madhavan J, Ko D, Kot L, et al. Google' s deep web crawl [ J ]. Proceedings of the VLDB Endowment, 2008,1 ( 2 ) : 1241 -1252.
  • 5Stevanovic D, An Aijun, Vlajic N. Feature evaluation for Web crawler detection with data mining techniques [ J ]. Expert Sys- tems with Applications,2012,39(10) :8707-8717.
  • 6Liu X, Gong D. A comparative study of a-star algorithms for search and rescue in perfect maze [ C]//Proc of ICECICE. [ s. l. ] :IEEE ,2011:24-27.
  • 7Cali A, Martinenghi D. Querying the deep web[ C ]//Proceed- ings of the 13th international conference on extending database technology. [ s. l. ] : [ s. n. ] ,2010:724-727.
  • 8丁艳辉,李庆忠,董永权,彭朝晖.基于集成学习和二维关联边条件随机场的Web数据语义标注方法[J].计算机学报,2010,33(2):267-278. 被引量:6
  • 9王权,施韶亭.Web信息抽取技术在统一检索系统中的应用研究[J].计算机应用与软件,2010,27(10):120-122. 被引量:7
  • 10张鑫,陈梅,王翰虎,王嫣然.基于视觉特征和领域本体的Web信息抽取[J].计算机技术与发展,2011,21(2):58-61. 被引量:5

引证文献5

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部