期刊文献+

一种在高维空间中聚类检测重复记录的新方法 被引量:4

New approach for clustering similar duplicate records based on high dimensions
下载PDF
导出
摘要 数据清理是构建数据仓库中的一个重要研究领域。检测相似重复记录是数据清洗中一项非常重要的任务。提出了一种聚类检测相似重复记录的新方法,该方法是基于N-gram将关系表中的记录映射到高维空间中,并且通过可调密度的改进型DB-SCAN算法IDS来聚类检测相似重复记录。并用实验证明了这种方法的有效性。 Data cleaning is an important area of data warehouse.Detecting duplicate records is a critical task in data cleaning.A new duplicate detection methods is proposed in this paper.The approach based on N-gram mappings all records in a relation to a high dimensions and clusters duplicate records through an improved DBSCAN algorithms which named IDS.IDS can cluster approximately duplicate records by usitlg adjustable density.At last the experimental results prove the approach's effectiveness. :
作者 曹渠江 董明
出处 《计算机工程与应用》 CSCD 北大核心 2008年第29期171-173,共3页 Computer Engineering and Applications
关键词 相似重复记录 N-GRAM 入侵检测系统 approximately duplicate database N-gram Intrusion Detection System(1DS)
  • 相关文献

参考文献6

  • 1Hernandez M,Stolfo S.The merge/purge problem for largedatabases[C]//Proc ACM SIGMOD International Conference on Management of Data, 1995: 127-138.
  • 2Hernandez M A,Stolfo S J.Real-World data is dirty:data cleansing and the merge/purge problem[J].Data Mining and Knowledge Discovery, 1999,2( 1 ) :9237.
  • 3Qiu Y F,Tian Z P,Ji W Y,et al.An efficient approach for detecting approximately duplicate database reeords[J].Chinese J of Computers, 2001,24( 1 ).
  • 4Surajit C,Kris G,Venkatesh G,et al.Robust and efficient fuzzy match for online data cleaning[C]//Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data.New York, USA : ACM Press, 2003 : 313-324.
  • 5韩京宇,徐立臻,董逸生.一种大数据量的相似记录检测方法[J].计算机研究与发展,2005,42(12):2206-2212. 被引量:32
  • 6Han Jiawei.Kamber M.Data mining concepts and techniques[M].北京:机械工业出版社,2004:223-263.

二级参考文献16

  • 1Mauricio Hernandez, Salvatore Stolfo. The merge/purge problem for large databases. In: ACM SIGMOD Record. New York:ACM Press, 1995. 127- 138.
  • 2Alvaro Monge, Charles Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records.Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'97), Tucson, AZ, 1997.
  • 3Karen Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, 1992, 24(4): 377-439.
  • 4Liang Jin, Chen Li, Sharad Mehrotra. Efficient record linkage in large data sets. The 8th Int'l Conf. Database Systems for Advanced Applications, Kyoto, Japan, 2003.
  • 5Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, et al. Robust and efficient fuzzy match for online data cleaning. In: Proc. 2003 ACM SIGMOD Int'l Conf. Management of Data. New York:ACM Press, 2003. 313-324.
  • 6Sunita Sarawagi, Anuradha Bhamidipaty. Interactive deduplication using active learning. In: Proc. 8th ACM SIGKDD Int'l Conf.Knowledge Discovery and Data Mining. New York: ACM Press,2002. 269- 278.
  • 7Wai Lup Low, Mong Li Lee, Tok Wang Ling. A knowledgebased approach for duplicate elimination in data cleaning.Information Systems, 2001, 26(8): 585-606.
  • 8Rohit Ananthakrishna, Surajit Chaudhuri, Venkatesh Ganti.Eliminating fuzzy duplicates in data warehouses. In: Proc. 28th VLDB. San Francisco: Morgan Kaufmann, 2002. 586-597.
  • 9Christos Faloutsos, King-Ip Lin. FastMap: A fast algorithm for indexing data-mining and visualization of traditional and multimedia datasets. In: ACM SIGMOD Record. New York: ACM Press,1995. 163-174.
  • 10G li R. Hjaltason, Hanan Samet. Properties of embedding methods for similarity searching in metric spaces. IEEE Trans.Pattern Analysis and Machine Intelligence, 2003, 25 (5), 530-549.

共引文献31

同被引文献39

  • 1葛利.一种基于混合遗传算法学习的过程神经网络[J].哈尔滨工业大学学报,2005,37(7):986-988. 被引量:21
  • 2韩京宇,徐立臻,董逸生.一种大数据量的相似记录检测方法[J].计算机研究与发展,2005,42(12):2206-2212. 被引量:32
  • 3朱恒民,王宁生.一种改进的相似重复记录检测方法[J].控制与决策,2006,21(7):805-808. 被引量:12
  • 4程菲,汪建海,罗键.基于重复检测的多摘要消重方法[J].计算机工程与设计,2006,27(23):4521-4524. 被引量:1
  • 5Imagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detec- tion:a survey [ J ]. IEEE Transactions on Knowledge and Data Engi- neering,2007,19 ( 1 ) : 1 - 16.
  • 6Li Huang, Hai Jin, Pingpeng Yuan, et al. Duplicate records cleansing with length filtering and dynamic weighting [ C ]. Fourth International Conference on Semantics, Knowledge and Grid. Beijing: IEEE Press, 2008:95 - 102.
  • 7Coelho L S. Gaussian quantum behaved particle swarm optimization ap- proaches for constrained engineering design problems [ J ]. Expert Sys- tems with Applications,2010,37 (2) : 1676 - 1683.
  • 8Sun J, Fang W, Xu X J, et al. Quantum-Behaved Particle Swarm Opti- mization: Analysis of the Individual Particle' s Behavior and Parameter Selection [ J ]. Evolutionary Computation,2012,20 ( 3 ) : 349 - 393.
  • 9Liang Jin,,Chen Li,Mehrotra S.Efficient record linkage in large data sets. Proc.of the8th Int’l Conf on Database.Systems for Advanced Applications . 2003
  • 10Ahmed K,Elmagarmid,Panagiotis G,et al.Duplicate record detection:a survey. IEEE Transactions on Knowledge and Data Engineering . 2007

引证文献4

二级引证文献17

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部