期刊文献+

基于数据分组匹配的相似重复记录检测 被引量:6

Detection of Approximately Duplicated Records Based on Data Grouping Matching
下载PDF
导出
摘要 针对数据集成中相似重复记录的识别问题,提出一种数据特征属性优选分组的算法。通过计算特征属性的方差来确定某维属性的权值,基于数据分组思想选择权值大的属性,将数据集分割成不相交的小数据集,并在各小数据集中用模糊匹配算法进行相似重复记录的识别。理论分析和实验结果表明,该方法识别效率和检测精度较高。 Approximately duplicated records in multi-source data integration is one of the key factors affecting the data quality. A data grouping algorithm based on properties optimization of records is proposed in order to improve identification accuracy and detection efficiency. The method firstly calculates the variance of a property to determine the weight of the property, then chooses the property of larger weight to split the data sets into small data sets according to the thoughts on data grouping and duplicated records are identified based on the algorithm of fuzzy matching. Through theory analysis and experiments, it shows that identification accuracy and detection efficiency of the method are higher and it can effectively solve the problems of identification in approximately duplicate records of the data integration.
出处 《计算机工程》 CAS CSCD 北大核心 2010年第12期104-106,共3页 Computer Engineering
基金 湖南省高等学校科学研究基金资助项目(09C339)
关键词 多源数据集 属性优选 数据分组匹配 相似重复记录 multi-source data sets properties optimization data grouping matching approximately duplicated records
  • 相关文献

参考文献7

二级参考文献30

共引文献39

同被引文献55

引证文献6

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部