期刊文献+

基于统计的多文本网站文本内容抽取算法 被引量:2

An Algorithm for Extracting Text from Multi-text Websites Based on Statistics
下载PDF
导出
摘要 随着互联网的普及,网络数据的增长速度也呈现出井喷的态势。通过搜索引擎获取网络信息,已成为人们获取信息的主要途径,因此,搜索引擎的搜索精度,就成了亟待研究的问题。本文首先研究了文本摘要领域国内外的研究现状,分析了目前该领域的研究成果,对各种算法进行了综合分析,然后针对科研网站这类多文本内容的网站,提出了一种基于统计的网站文本信息的抽取算法。该算法利用宽度优先搜索策略爬虫,获取网站的HTML源码,对源码的结构进行分析,将其解析成DOM树,最后再利用基于统计的方法将网站的文本信息抽取出来。通过验证,该算法可以较好地实现用于网站摘要的综合文本的抽取。 With the popularity of the Internet,the growth rate of network data has shown a blowout trend.Accessing network information through search engine has become the main way for people to obtain information.Therefore,the study of search accuracy of the search engine takes a first priority.This paper first studies the research status of text summarization at home and abroad,analyzes the current research results in this field,and makes a comprehensive analysis of various algorithms.Then,aiming at the multi text content websites such as scientific research websites,this paper proposes a website text information extraction algorithm based on statistics.The algorithm uses the width first search strategy crawler to obtain the HTML source code of the website,analyzes the structure of the source code,parses it into DOM tree,and finally extracts the text information of the website by using the statistical method.This algorithm has been verified as a better way to extract the summarization of the comprehensive text from websites.
作者 王晴 Wang Qing(Xuzhou Open University,Xuzhou 221116,China)
机构地区 徐州开放大学
出处 《安徽电子信息职业技术学院学报》 2021年第4期6-12,共7页 Journal of Anhui Vocational College of Electronics & Information Technology
基金 2019年度江苏开放大学(江苏城市职业学院)“十三五”科研规划课题“基于SPOC的高职混合教学模式探讨”(19TXZC-10)。
关键词 自动文本摘要 网页文本抽取 宽度优先搜索 DOM树 ROUGE评价 automatic text summarization webpage text extraction breadth-first search DOM tree ROUGE evaluation
  • 相关文献

参考文献6

二级参考文献47

  • 1秦兵,刘挺,陈尚林,李生.多文档文摘中句子优化选择方法研究[J].计算机研究与发展,2006,43(6):1129-1134. 被引量:13
  • 2余正涛,樊孝忠,郭剑毅,耿增民.基于潜在语义分析的汉语问答系统答案提取[J].计算机学报,2006,29(10):1889-1893. 被引量:44
  • 3石晶,戴国忠.基于PLSA模型的文本分割[J].计算机研究与发展,2007,44(2):242-248. 被引量:25
  • 4MADHAVAN J, COHEN S, DONG X L, et al. Web- scale date intergration: you can only afford to pay as you go[C]//Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research(CIDR), January 7-10, 2007. Asilomar, CA, USA: [-s.n.], 2007, 7: 342-350.
  • 5EGLIN V, BRES S. Document page similarity based on layout visual saliency: application to query by example and document classification [C]//Proceedings of the Seventh International Conference on Document Analysis and Recognition, Aug. 3-6, 2003, Edinburgh, Scotland, UK. Washington, DC, USA: IEEE Computer Society, 2003, 2: 1208-1212.
  • 6BALAKRISHNAN R, KAMBHAMPATI S. SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement [C]// Proceedings of the 20th international conference on Wor[d Wide Web, March 28-April1, 2011. Hyderabad, India: [s. n. ], 2011: 227-236:
  • 7HONG J L, SlEW E G, EGERTON S. WMS- extracting multiple sections data records from search engine results pages [C]//Proceedings of the 2010 ACM Symposium on Applied Coputing, March 22-29, 2010. Sierre, Switzerland: ACM, 2010: 1696-1701.
  • 8LIU W, MENG X F, MENG W Y. VIDE: a vision- based approach for deep web data extraction [J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(3) : 447-460.
  • 9AN Y J, GELLER J, WU Y T, et al. Semantic deep web: automatic attribute extraction from the deep web data sources[C]//Proceedings of the 22nd Annual ACM Symposium on Applied Computing, March 11-15, 2007. Seoul, Korea:[s.n.], 2007: 1667-1672.
  • 10QIANG B H, XI J Q, ZHANG L. An effective schema extraction algorithm on the deep web[C]//Proceedings of the 4th International Conference on Wireless Communications, Networking and Mobile Computing, Oct. 12-14, 2008. Dalian, China: IEEE, 2008: 1-4.

共引文献80

同被引文献14

引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部