摘要
为了充分利用博客日志所提供的信息,提出了建立中文博客搜索引擎的基本思路。通过对博客的技术特点和博客搜索引擎工作原理的分析,设计了中文博客搜索引擎的系统结构。在此基础上,利用规则定义和正则表达式,结合真正简易聚合技术对传统的网络爬虫进行了改进,较好地解决了博客信息难以被收录的问题。利用真正简易聚合技术对博客信息进行格式化处理,加快了博客信息采集速度。通过对中文分词的扩展,利用Lucene.net全文搜索工具实现了一个中文博客搜索引擎。实验测试结果表明,采用的方案和技术是可行的。
To fully exploit the information contained in blogs, the idea of Chinese Blog search engine (CBSS) is proposed. After analyzing the characteristics of blogs and the principle of CBSS, the architecture of CBSS is designed. Based on the CBSS architecture, the traditional web crawler is improved by using rules definition, regular expression, and really simple syndication (RSS) to solve the problem that it is hard to collect blog information. Furthermore, the blog information is formatted using RSS to accelerate the process of collection. A CBSS is implemented based on Lucene.net with the expansion of Chinese words segmentation. Finally, the experimental results show that the design and the technology are feasible.
出处
《计算机工程与设计》
CSCD
北大核心
2010年第8期1718-1721,共4页
Computer Engineering and Design