基于重复检测的多摘要消重方法被引量：1

Multi abstract remove repeat method for web mining research

下载PDF

导出

摘要针对目前Web信息挖掘中存在大量页面重复的问题,从Web信息的组织角度对其中的一些关键问题进行深入分析,提出了基于关键词的部分相似页面消重算法——Web多摘要消重方法(multiabstractremoverepeat,MARR)。MARR方法对传统基于词表和倒排文件的Web信息数据库进行改装,增加一个字段用于记录关键词所对应的摘要块号,采用文本摘要算法,按倒排文件方式索引,根据内容基于查询词目的相似程度,在检索过程中过滤或标识与查询词目相关的部分内部重复现象,以获得更合理的检索结果组织形式。MARR方法还将传统按PageRank值顺序排列改成树型组织方式,以方便用户信息发现的需要。该方法在基于三明钢铁集团MES智能信息代理的原型化Web检索系统中得到应用。 With regard to the organization of web information retrieval, some pivotal problems of web information mining are analyzed and an arithmetic to remove repeats of similar pages searched by keyword Multi abstract remove repeat （MARR） is presented, which changes the traditional web information database composed of words tables and converse files, adds a field to record the abstract number corresponded with key words, text-abstract arithmetic, sorts is adopted by the index ofconverse file, the repeats are filtered and marked according to the similarity of content from retrieved entry in order to obtain a more reasonable retrieval result, and normal structure sorted is substituted by PageRank for users＇ needs in information mining. This arithmetic is applied to the archetypal web retrieval system originated from MES information system agent of Sanming steel company.

作者程菲汪建海罗键

机构地区厦门大学自动化系

出处《计算机工程与设计》 CSCD 北大核心 2006年第23期4521-4524,4555,共5页 Computer Engineering and Design

关键词信息检索消重方法文本摘要倒排文件树型组织 information retrieval remove repeat method text abstract converse file tree structure

分类号 TP393.09 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献10

1Zhang Ling.Web mining research in intelligent information retrieval[D].Shanghai:Department of Computer Science and Engineering Shanghai Jiaotong University,2003.
2Manber U.Finding similar files in a large file system[R].Tuscon,Arizona:Technical Report TR 93-33,University of Arizona,1993.
3Junghoo Cho,Narayanan Shivakumar,Hector Garcia-Molina.Finding replicated web collections[C].Department of Computer Science Stanford University,the Digit Library,Stanford,1999.39-45.
4Narayanan Shivakumar,Hector Garcia-Molina.The SCAM approach to copy detection in digital libraries[C].Department of Computer Science Stanford University,the Digit Library,Stanford,1995.83-89.
5Narayanan Shivakumar.Finding near-replicas of documents on the web[DB/OL].Http://dbpubs.stanford.edu/pub/1998-31.
6Brin S,Davis J,Garcia-Molina H.Copy detection mechanisms for digital documents[C].San Francisco,CA:Proceedings of the ACM SIGMOD Annual Conference,1995.
7Calvin Chan Hai-Hua Lu.CMPUT690 term project fingerprinting using polynomial (Rabin's method)[DB/OL].http://www.cs.ualberta.ca/～calvinc/690.ps.
8Mckenzie.Selecting a hashing algorithm[J].SP&E,1990,20(2):209-224.
9刘艳青,田萱,苏桂莲.基于Internet的个性化信息检索技术的研究[J].计算机工程与设计,2004,25(5):772-775. 被引量：12
10耿玉良,陈家琪,王咏梅.中文Web检索中聚类算法的改进[J].计算机工程与设计,2005,26(10):2685-2687. 被引量：9

二级参考文献30

1Brin S,Page L. The anatomy of a large-scale hypertextual web search engine [C]. Proceedings of 7th WWW Conference. Amsterdam: Elsevier Science, 1998. 107-117.
2Pazzani M, Muramatsu J, Billsus D. Syskill & webert identifying interesting web sites[C]. Proc. 13th Natl. Conf on Artificial Intelligence, 1996.
3Malone T W, Grant K R, Turbak F A,et al. Intelligent information sharing systems [J]. Communications of the ACM,1987,30(5).
4Culliss G. User popularity ranked search engineer[EB/OL].www. infornortics.com/searchengines/boston 1999/culliss/index.htm.
5Lang K. News weeder: Learning to filter netnews [Z]. Proceedings of Machine Learning, 1995.
6Kroon H C M, Mitchell T M, Kerckhoffs E J H. Improving learning accuracy in information filtering[Z].
7Jennings A, Higuchi H. A personal news services based on a user model neural networking[J]. IEICE Transactions on Information and Systems, 1992, (3).
8Cooley R, Srivastava J. Data prepariation for mining world wide web browsing patterns[J]. Journal of Knowledge and Information Systems 1999,1(1): 5-32.
9Wu K L, Yu P S,Ballman A. Speed Tracer: A web usage mining and analysis tool[J]. IBM System Journal 1998,37(1):89-105.
10Pazzani M. Billsus D. Learning and revising user profiles:The identification of interesting web sites [J]. Machine Learning 1997, 27 (5): 313-331.

共引文献19

1王枝军,强俊,程效军.基于Web的信息检索系统的设计与实现[J].计算机工程与设计,2006,27(6):1025-1027. 被引量：10
2申利民,王敏.基于柔性的个性化服务系统的开发过程[J].计算机工程与设计,2006,27(6):1086-1089. 被引量：1
3谭德坤,赵珑,吴润秀,孙辉.基于UDDI Registry的智能检索引擎的研究[J].计算机工程与设计,2007,28(4):858-861. 被引量：2
4曹红兵.搜索引擎的个性化检索研究[J].图书情报工作,2007,51(3):129-132. 被引量：16
5黄建春,邹汉斌,李晓峰.基于文本聚类的映射聚类算法研究[J].计算机工程与设计,2007,28(6):1264-1266. 被引量：1
6张永,侯莉莉,周振龙.基于多Agent的智能信息检索框架[J].计算机工程与设计,2007,28(5):1137-1139. 被引量：6
7原福永,张园园.基于链接分析的相关排序方法的研究和改进[J].计算机工程与设计,2007,28(7):1630-1631. 被引量：11
8张辉,谢科,庞斌,吴辉.一种基于关键特征的搜索引擎结果聚类算法[J].北京航空航天大学学报,2007,33(6):739-742. 被引量：4
9张立彬,赵麟.个性化网络信息服务技术发展的新走向[J].情报科学,2007,25(7):1103-1108. 被引量：7
10陈俊杰,刘炜.一种基于本体的个性化模式库建模方法[J].计算机研究与发展,2007,44(7):1151-1159. 被引量：7

同被引文献4

1韩京宇,徐立臻,董逸生.一种大数据量的相似记录检测方法[J].计算机研究与发展,2005,42(12):2206-2212. 被引量：32
2曹渠江,董明.一种在高维空间中聚类检测重复记录的新方法[J].计算机工程与应用,2008,44(29):171-173. 被引量：4
3邱越峰,田增平,季文贇,周傲英.一种高效的检测相似重复记录的方法[J].计算机学报,2001,24(1):69-77. 被引量：72
4陈桂林,王永成.字串去重的快速算法研究[J].情报学报,2000,19(3):254-258. 被引量：4

引证文献1

1辛义定,丁君辉,徐远兵.面向ESB的重复消息检测方法研究[J].计算机应用与软件,2013,30(1):126-128.

1朱鸿鹏.改进的数据消重方法在垂直搜索引擎中的应用[J].邵阳学院学报（自然科学版）,2012,9(2):34-36.
2童均.树型组织结构图的算法研究及实现[J].重庆电子工程职业学院学报,2014,23(5):137-140.
3刘博济.多媒体的应用给语文教学带来了生机[J].学生之友（小学版）,2009(13):28-28.
4林丽华.将Gmail的重复联系人合并起来[J].电脑迷,2010(2):66-66.
5我就是天使.桌面图标重复现象小结[J].网络与信息,2008(6):48-48.
6李润波.浅谈信息技术在语文教学中的应用[J].青少年日记（教育教学研究）,2014(2):84-84.
7张宗福.一种基于LCS的微博相似页面检测方法[J].集成技术,2013,2(3):5-9.
8刘洋,连建勇,曹文慧,侯志辉,聂小燕.浅析网络智能在信息代理中的应用[J].南华大学学报（理工版）,2002,16(1):87-92. 被引量：1
9朱梦麟,李光耀,周毅敏.基于树比较的Web页面主题信息抽取[J].微型机与应用,2011,30(19):67-69.
10邓箴.基于DOM的Web信息抽取方法[J].计算机光盘软件与应用,2010(10):18-18. 被引量：1

计算机工程与设计

2006年第23期

浏览历史

内容加载中请稍等...

基于重复检测的多摘要消重方法被引量：1

参考文献10

二级参考文献30

共引文献19

同被引文献4

引证文献1

相关作者

相关机构

相关主题

浏览历史

基于重复检测的多摘要消重方法 被引量：1

参考文献10

二级参考文献30

共引文献19

同被引文献4

引证文献1

相关作者

相关机构

相关主题

浏览历史

基于重复检测的多摘要消重方法被引量：1