期刊文献+

基于正文相关度的维吾尔网页正文提取 被引量:2

Content Extraction of Uighur Web Based on Content Correlativity
下载PDF
导出
摘要 网页表达的主要信息通常隐藏在大量无关的结构与文字中,使正文信息不能被迅速获取,影响文本检测的效率。为此,根据维吾尔网页的非规范化编码、论坛型网页较多等特点,提出一种基于正文相关度的正文提取算法,并建立上下文正文密度和节点间正文比例等数学模型对算法进行改进。对大量维吾尔网页的实验结果表明,该算法具有较好的正文提取正确率和召回率,能够有效地从维吾尔网页中提取到所需的正文信息。 In addition to the main content, most Uighur Web contain noises such as navigation panels, advertisements which are not related to the main content. To improve the efficiency of security detection, this paper presents a content extraction algorithm of Uighur Web based on Web text correlativity, and designs the model of text density and content scale to improve the algorithm. Experimental result shows that this algorithm can extract the main content from the Uighur Web efficiently.
出处 《计算机工程》 CAS CSCD 2012年第21期153-156,160,共5页 Computer Engineering
基金 新疆维吾尔自治区高技术研究发展基金资助项目(201012112) 新疆维吾尔自治区电子发展专项基金资助项目(XJDZZXZJ20109)
关键词 正文提取 正文相关度 信息安全 自然语言处理 正文密度 content extraction content correlativity information security natural language processing content density
  • 相关文献

参考文献10

  • 1Rahman A R, Alam H, Hartono R. Content Extraction from Html Documents[C]//Proc. of the 1st International Workshop on Web Document Analysis. Seattle, USA: [s. n.], 2001: 7-10.
  • 2Liu Ling, Pu C, Han Wei. XWRAP: An XML-enabled Wrapper Construction System for Web Information Sources[C]//Proc. of International Conference on Data Engineering. San Diego, USA: [s. n.], 2000: 611-621.
  • 3吴麒,陈兴蜀,谭骏.基于权值优化的网页正文内容提取算法[J].华南理工大学学报(自然科学版),2011,39(4):32-37. 被引量:8
  • 4Cai Deng, Yu Shipeng, Wen Jirong, et al. Extracting Content Structure for Web Pages Based on Visual Representation[C]//Proc. of the 5th Asian-Pacific Web Conference. Xi'an, China: [s. n.], 2003: 406-417.
  • 5Cai Deng, Yu Shipeng, Wen Jirong, et al. VIPS: A Vision Based Page Segmentation Algorithm[R]. Microsoft Research, Technical Report: MSR-TR-2003-79, 2003.
  • 6Sun Fei, Song Dandan, Liao Lejian. DOM Based Content Extraction via Text Density[C]//Proc. of the 34th Annual ACM SIGIR Conference. [S. 1 .]: ACM Press, 2011: 245-254.
  • 7王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量:81
  • 8Hegaret P L. W3C Document Object ModeI[EB/OL]. (2009- 01-06). http://www.w3.org/DOM.
  • 9Weninger T, Hsu W H, Cetr H J. Content Extraction via Tag Ratios[C]//Proc. of the 10th International World Wide Web Conference. New York, USA: [s. n.], 2010: 971-980.
  • 10宗成庆.统计自然语言理解[M].北京:清华大学出版社,2008.

二级参考文献28

  • 1欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报(自然科学版),2005,45(S1):1743-1747. 被引量:70
  • 2荆涛,左万利.基于可视布局信息的网页噪音去除算法[J].华南理工大学学报(自然科学版),2004,32(z1):84-87. 被引量:21
  • 3Wang J Y,Lochovsky F H. Data-rich section extraction from HTML pages [ C ]//Proc of the 3rd International Conference on Web Information Systems Engineering. Singapore: IEEE Computer Society Press,2002:313-322.
  • 4W3C DOM IG. Document object model[ EB/OL]. (2010- 6-5 ) http: //www. w3. org/DOM/.
  • 5Lin S H, Ho J M. Discovering informative content blocks from web documents [ C ] //Proc of the ACM SIGKDD'02. Alberta : ACM ,2002 : 190-195.
  • 6Lan Y, Liu B, Li X L. Eliminating noisy information in web pages for data mining [ C]//Proc of the Ninth ACM SIGKDD International Conference on Knowledge Disco- very and Data Mining. Washington : ACM,2003 : 296- 305.
  • 7Debnath S, Mitra P, Pal N, et al. Automatic identification of informative sections of web pages [ J ]. IEEE Tran. on Knowledge and Data Engineering, 2005, 17 ( 9 ) : 1233- 1246.
  • 8Suhit G, Gail K, David N, et al. DOM-based content extraction of HTML documents [ C]//Proc of the 12th International World Wide Web Conference. Budapest :ACM, 2003:207-217.
  • 9Cai Deng, He Xiao-fei, Wen Ji-rong, et al. Block-level link analysis [ C ]//Proc of SIGIR'04. Sheffied : ACM, 2004 : 134-142.
  • 10Song Rui-hua, Liu Hai-feng,Wen Ji-rong,et al. Learning block importance models for web pages [ C ] // Proc of World Wide Web Conference. New York: ACM, 2004: 343-348.

共引文献87

同被引文献25

引证文献2

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部