摘要
着重研究了网络化制造资源垂直搜索系统的主题爬虫和中文分词技术。通过在主题爬虫中增加评价网页模块,优先爬行与主题相似度高的网页中的链接,提高了爬虫的工作效率。在对中文分词词典进行分层存储的基础上,通过一种改进的简洁的中文分词词典匹配算法,有效地改善了分词的速度与精度,并缩减了索引库,增强了用户的响应。
This paper put emphasis on the technologies of the system, including the topic crawler and the Chinese word segmentation. To improve the efficiency of the crawler, a model of page evaluation was added into the crawler module; therefore the urls in a page with a high similarity of the topic will be first crawled. Besides, an improved word matching algorithm was proposed to enhance the speed and precision of word segmentation.
出处
《计算机应用》
CSCD
北大核心
2007年第5期1116-1118,共3页
journal of Computer Applications
基金
国家自然科学基金资助项目(5047185)
关键词
网络化制造
制造资源
垂直搜索引擎
页面解析
networked manufacturing
manufacturing resource
vertical search engine
, html parser