摘要
针对目前Web信息挖掘中存在大量页面重复的问题,从Web信息的组织角度对其中的一些关键问题进行深入分析,提出了基于关键词的部分相似页面消重算法——Web多摘要消重方法(multiabstractremoverepeat,MARR)。MARR方法对传统基于词表和倒排文件的Web信息数据库进行改装,增加一个字段用于记录关键词所对应的摘要块号,采用文本摘要算法,按倒排文件方式索引,根据内容基于查询词目的相似程度,在检索过程中过滤或标识与查询词目相关的部分内部重复现象,以获得更合理的检索结果组织形式。MARR方法还将传统按PageRank值顺序排列改成树型组织方式,以方便用户信息发现的需要。该方法在基于三明钢铁集团MES智能信息代理的原型化Web检索系统中得到应用。
With regard to the organization of web information retrieval, some pivotal problems of web information mining are analyzed and an arithmetic to remove repeats of similar pages searched by keyword Multi abstract remove repeat (MARR) is presented, which changes the traditional web information database composed of words tables and converse files, adds a field to record the abstract number corresponded with key words, text-abstract arithmetic, sorts is adopted by the index ofconverse file, the repeats are filtered and marked according to the similarity of content from retrieved entry in order to obtain a more reasonable retrieval result, and normal structure sorted is substituted by PageRank for users' needs in information mining. This arithmetic is applied to the archetypal web retrieval system originated from MES information system agent of Sanming steel company.
出处
《计算机工程与设计》
CSCD
北大核心
2006年第23期4521-4524,4555,共5页
Computer Engineering and Design
关键词
信息检索
消重方法
文本摘要
倒排文件
树型组织
information retrieval
remove repeat method
text abstract
converse file
tree structure