摘要
This paper presents a twice-gathering information interactive system prototype of e-government based on the condition that the Intranet and the Extranet are physical isolated.Users in the Extranet can gather links of the latest related information from client software which is previously collected by web alert in the Internet.Finally,through ferry-type transport devices,information is browsed by users in the Intranet,and it is transported to a storage device and synchronized with the web platform in the Intranet.During information gathering in the Extranet and data synchronization in the Intranet,it is essential to avoid repeated gathering and copying by means of comparing the extracted information fingerprints gathered from the web pages.This prototype uses HashTrie to store information fingerprints.During testing,the structure based on HashTrie is 2.28 times faster than the Darts(double array Trie)which is the fastest structure in the existing applied patent.The existing 12 types of high speed Hash functions serving for HashTrie are also implemented.When the dictionary content is larger than 5×105 words,the PJWHash or the SuperFastHush function can be adopted;when the dictionary content is 105 words, CalcStrCR32 and ELFHash functions can be adopted.
提出一种在内网和外网间处于物理隔离状态下防止信息重复采集的电子政务二次信息采集交互系统原型.外网用户能够从客户端软件中二次采集由webalert功能采集的互联网中最新相关网页的链接所指内容,最后再通过摆渡式传输设备将采集结果传递到存储设备上,与内网搭建的网络平台进行数据同步,供内网用户直接浏览.在外网抓取信息和内外网数据同步中,都需要对网页提取信息指纹进行对比,防止重复抓取和拷贝.原型采用HashTrie保存信息指纹.进行评测对比后,可知基于HashTrie信息指纹提取比目前专利申请中速度最快的Darts(双数组Trie)结构快2.28倍,还提出了一种新的Hash函数,并且实现了现有12种高速Hash函数以供HashTrie使用,当词典容量大于50万词时,可以采用PJWHash或SuperFastHash函数,而当词典容量为10万词时,可以采用CalcStrCRC32和ELFHash函数.
基金
The National Basic Research Program of China(973 Program)(No.2007CB310806)