摘要
随着互联网的普及,网络数据的增长速度也呈现出井喷的态势。通过搜索引擎获取网络信息,已成为人们获取信息的主要途径,因此,搜索引擎的搜索精度,就成了亟待研究的问题。本文首先研究了文本摘要领域国内外的研究现状,分析了目前该领域的研究成果,对各种算法进行了综合分析,然后针对科研网站这类多文本内容的网站,提出了一种基于统计的网站文本信息的抽取算法。该算法利用宽度优先搜索策略爬虫,获取网站的HTML源码,对源码的结构进行分析,将其解析成DOM树,最后再利用基于统计的方法将网站的文本信息抽取出来。通过验证,该算法可以较好地实现用于网站摘要的综合文本的抽取。
With the popularity of the Internet,the growth rate of network data has shown a blowout trend.Accessing network information through search engine has become the main way for people to obtain information.Therefore,the study of search accuracy of the search engine takes a first priority.This paper first studies the research status of text summarization at home and abroad,analyzes the current research results in this field,and makes a comprehensive analysis of various algorithms.Then,aiming at the multi text content websites such as scientific research websites,this paper proposes a website text information extraction algorithm based on statistics.The algorithm uses the width first search strategy crawler to obtain the HTML source code of the website,analyzes the structure of the source code,parses it into DOM tree,and finally extracts the text information of the website by using the statistical method.This algorithm has been verified as a better way to extract the summarization of the comprehensive text from websites.
作者
王晴
Wang Qing(Xuzhou Open University,Xuzhou 221116,China)
出处
《安徽电子信息职业技术学院学报》
2021年第4期6-12,共7页
Journal of Anhui Vocational College of Electronics & Information Technology
基金
2019年度江苏开放大学(江苏城市职业学院)“十三五”科研规划课题“基于SPOC的高职混合教学模式探讨”(19TXZC-10)。