摘要
数据清理是构建数据仓库中的一个重要研究领域。检测相似重复记录是数据清洗中一项非常重要的任务。提出了一种聚类检测相似重复记录的新方法,该方法是基于N-gram将关系表中的记录映射到高维空间中,并且通过可调密度的改进型DB-SCAN算法IDS来聚类检测相似重复记录。并用实验证明了这种方法的有效性。
Data cleaning is an important area of data warehouse.Detecting duplicate records is a critical task in data cleaning.A new duplicate detection methods is proposed in this paper.The approach based on N-gram mappings all records in a relation to a high dimensions and clusters duplicate records through an improved DBSCAN algorithms which named IDS.IDS can cluster approximately duplicate records by usitlg adjustable density.At last the experimental results prove the approach's effectiveness. :
出处
《计算机工程与应用》
CSCD
北大核心
2008年第29期171-173,共3页
Computer Engineering and Applications
关键词
相似重复记录
N-GRAM
入侵检测系统
approximately duplicate database
N-gram
Intrusion Detection System(1DS)