摘要
相似重复记录的清除是数据清洗领域中的一个很重要的方面,它的目的是清除冗余的数据。介绍了该问题的流行算法—多趟近邻排序算法MPN(Multi-Pass Sorted Neighborhood),该算法能较好地对相似重复记录进行清除,但也有其不足:一是在识别中窗口大小固定,窗口的大小选取对结果影响很大。二是采用传递闭包,容易引起误识别。提出了基于MPN算法的一种改进算法,试验结果证明改进算法在记忆率和准确率上优于MPN算法。
Cleaning approximately duplicate records is an important task in data cleaning. MPN, a popular algorithm for this task, is introduced and its deficiencies are analyzed. Firstly, window is fixed in detecting approximately duplicate records. Secondly, transfer closure is used,but it is easy to make errors. An improved algorithm of data cleaning based on MPN is introduced. The experimental results prove that this improved algorithm is better than MPN in the aspects of recall and precision,
出处
《计算机应用与软件》
CSCD
北大核心
2008年第2期245-247,共3页
Computer Applications and Software
基金
浙江省科技厅重点科研社会发展项目(2006C23060)
关键词
数据清洗
相似重复记录
MPN
Data cleaning
Approximately duplicate records
MPN