期刊文献+

Web采集中信息组合自学习的研究

Research on Self-learning of Information Combination in Web Collecting
下载PDF
导出
摘要 Web采集者为了尽可能准确地采集符合主题的网页信息,一般会根据多种Web信息来预测待采集链接的价值。文中为了提高Web采集系统预测链接价值的准确性,提出了一种能根据已采集页面自行调整Web信息重要性的Web采集者。它具有学习能力,能通过对训练集的爬行,分析出对于预测链接价值各种Web信息的重要性,以此调整采集过程中各Web信息的组合权值,得到符合实际Web情况的较优搜索策略。以计算机作为采集主题,对此算法和传统的Web信息固定组合的算法进行了比较。实验结果表明,较之传统的Web采集者,使用此算法的采集者具有较高的Web搜索精度。 In order to precisely obtain Web pages on the topic,the Web crawler usually uses various Web information to forecast the linkages' value. In this paper,in order to improve the Web crawlers' accuracy in forecasting linkages' value, a Web searching strategy is proposed, which can automatically adjust the importance of various Web information according to the crawled Web pages. This crawler has learning ability, which can analyze the importance of Web information through crawling the training set, and then adjust the weights of Web information, get a better search strategy corresponding to actual Web. The algorithm and traditional Web information combination al- gorithm is compared. The experiment result shows that compared with the Web crawler based on fixed weights of Web information, the new crawler has higher searching accuracy.
出处 《计算机技术与发展》 2013年第11期216-219,共4页 Computer Technology and Development
基金 湖南省教育科研计划资助项目(09C231)
关键词 Web采集者 链接价值 主题搜索 搜索策略 Web信息组合 Web crawlers linkage value topic searching searching strategy Web information combination
  • 相关文献

参考文献7

二级参考文献43

  • 1郑长松,傅彦,佘莉.基于模板的Web信息自动提取方法[J].计算机应用研究,2009,26(2):570-572. 被引量:10
  • 2杨思洛.搜索引擎的排序技术研究[J].现代图书情报技术,2005(1):43-47. 被引量:23
  • 3郭太飞,何洁月.归纳学习XPATH Web信息提取规则[J].计算机技术与发展,2007,17(3):98-101. 被引量:7
  • 4[1]J Cho, H Garcia-Molina, L Page. Efficient crawling through URL ordering. The 7th World Wide Web Conference, Brisbane, 1998
  • 5[2]S Brin, L Page. The anatomy of a large-scale hypertexual web search engine. The 7th World Wide Web Conference, Brisbane, 1998
  • 6[3]Taher H Haveliwala. Efficient computing of PageRank. Stanford Database Group, Tech Rep, 1999
  • 7[4]Monika Henzinger. Link analysis in web information retrieval. IEEE Data Engineering Bulletin, 2000, 23(3): 3~8
  • 8[5]Dell Zhang, Yisheng Dong. An efficient algorithm to rank web resources. Computer Netwoks, 2000, 33: 449~455
  • 9[6]Lei Ming, Wang Jianyong .et al.. Improved relevance ranking in web gather. Journal of Computer Science and Technology, 2001, 16(5): 410~417
  • 10[7]S Lawrence, C L Giles. Accessibility of information on the web. Nature, 1999, 400: 107~109

共引文献107

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部