期刊文献+

对基于MPN数据清洗算法的改进 被引量:13

IMPROVMENT ON THE ALGORITHM OF DATA CLEANING BASED ON MPN
下载PDF
导出
摘要 相似重复记录的清除是数据清洗领域中的一个很重要的方面,它的目的是清除冗余的数据。介绍了该问题的流行算法—多趟近邻排序算法MPN(Multi-Pass Sorted Neighborhood),该算法能较好地对相似重复记录进行清除,但也有其不足:一是在识别中窗口大小固定,窗口的大小选取对结果影响很大。二是采用传递闭包,容易引起误识别。提出了基于MPN算法的一种改进算法,试验结果证明改进算法在记忆率和准确率上优于MPN算法。 Cleaning approximately duplicate records is an important task in data cleaning. MPN, a popular algorithm for this task, is introduced and its deficiencies are analyzed. Firstly, window is fixed in detecting approximately duplicate records. Secondly, transfer closure is used,but it is easy to make errors. An improved algorithm of data cleaning based on MPN is introduced. The experimental results prove that this improved algorithm is better than MPN in the aspects of recall and precision,
作者 李坚 郑宁
出处 《计算机应用与软件》 CSCD 北大核心 2008年第2期245-247,共3页 Computer Applications and Software
基金 浙江省科技厅重点科研社会发展项目(2006C23060)
关键词 数据清洗 相似重复记录 MPN Data cleaning Approximately duplicate records MPN
  • 相关文献

参考文献7

  • 1Alvaro E Monge, Charles Elkan. An Effieient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. DMKD, 1997.
  • 2Rahm E, Do H H. Data Cleaning:Problems and Current Approaches. IEEE Bulletin on Data Engineering,2000,23 (4) : 3 - 13.
  • 3Mong Li Lee, Wynne Hsu, Vijay Kothari. Cleaning the Spurious Links in Data. in IEEE Intelligent Systems:Special issue on Data and Information Cleaning and Preprocessing,Volume 19,No. 2, March/April 2004.
  • 4Zhu X, Wu X, Chen Q. Eliminating Class Noise in Large Datasets, Proc. 20th int'l conf. Machine Learning AAAIPress,2003.
  • 5Rohit Ananthakrishna, Surajit Chaudhurl. Venkatesh Canti Eliminating Fuzzy Duplicates in Data Warehouses. VLDB ,2002.
  • 6洪圆,孙未未,施伯乐.一种使用双阈值的数据仓库环境下重复记录消除算法[J].计算机工程与应用,2005,41(1):168-170. 被引量:9
  • 7佘春红,许向阳.关系数据库中近似重复记录的识别[J].计算机应用研究,2003,20(9):36-39. 被引量:7

二级参考文献16

  • 1M Hernandez, S Stolfo. The Merge/Purge Problem for Large Databases[C]. Proceedings of the ACM SIGMOD, International Conference on Management of Data, May 1995. 127-138.
  • 2E Rahm, H H Do. Data Cleaning: Problem and Current Approaches[J]. IEEE Data Engineering Bulletin,2000,23(3).
  • 3A E Monge, C P Elkan. An Efficient Domain-independent Algorithm for Detecting Approximately Duplicate Database Records[C]. Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (Tucson, Arizona), May1997.
  • 4An Adaptive and Efficient Algorithm for Detecting Approximately Duplicate Database Records [EB/OL]. http://research.microsoft. com/resarch/db/debull/AOOdec/monge.ps.
  • 5A E Monge , C P Elkan. The Field Matching Problem: Algorithms and Applications[C]. Proc. 2nd Intl. Conf. Knowledge Discovery and Data Mining, 1996.
  • 6Galhardas H, Florescu D, ShashaD, et al. AJAX: An Extensible Data Cleaning Tool[C]. Proc. ACM SIGMOD, Conf. ,2000.590.
  • 7Rohit Ananthakrishna,Surajit Chaudhuri,Venkatesh Ganti.Eliminating Fuzzy Duplicates in Data Warehouses.VLDB,2002:586-597.
  • 8Luis Gravano,Panagiotis G Ipeirotis,H V Jagadish et al.Divesh Srivastava:Using q--grams in a DBMS for Approximate String Processing[J]. IEEE Data Eng Bull,2001 ;24(4) :28-34.
  • 9Pdcardo A Baeza-Yates,Berthier A Ribeiro-Neto.Modem Information Retrieval[M].ACM Press/Addison-Wesley, 1999.
  • 10Alvaro E Monge,Charles Elkan.An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. DMKD, 1997.

共引文献13

同被引文献167

引证文献13

二级引证文献73

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部