摘要
Web采集者为了尽可能准确地采集符合主题的网页信息,一般会根据多种Web信息来预测待采集链接的价值。文中为了提高Web采集系统预测链接价值的准确性,提出了一种能根据已采集页面自行调整Web信息重要性的Web采集者。它具有学习能力,能通过对训练集的爬行,分析出对于预测链接价值各种Web信息的重要性,以此调整采集过程中各Web信息的组合权值,得到符合实际Web情况的较优搜索策略。以计算机作为采集主题,对此算法和传统的Web信息固定组合的算法进行了比较。实验结果表明,较之传统的Web采集者,使用此算法的采集者具有较高的Web搜索精度。
In order to precisely obtain Web pages on the topic,the Web crawler usually uses various Web information to forecast the linkages' value. In this paper,in order to improve the Web crawlers' accuracy in forecasting linkages' value, a Web searching strategy is proposed, which can automatically adjust the importance of various Web information according to the crawled Web pages. This crawler has learning ability, which can analyze the importance of Web information through crawling the training set, and then adjust the weights of Web information, get a better search strategy corresponding to actual Web. The algorithm and traditional Web information combination al- gorithm is compared. The experiment result shows that compared with the Web crawler based on fixed weights of Web information, the new crawler has higher searching accuracy.
出处
《计算机技术与发展》
2013年第11期216-219,共4页
Computer Technology and Development
基金
湖南省教育科研计划资助项目(09C231)