摘要
针对现有数据去重方法中存在的时间效率和检测精度低的问题,结合Web信息整合的特点,提出一种逐级聚类的数据去重方法(SCDE)。首先通过关键属性分割和Canopy聚类将数据划分成小记录集,然后精确检测相似重复记录,并提出基于动态权重的模糊实体匹配策略,采用动态权重赋值,降低属性缺失对记录相似度计算带来的影响,并对名称的特殊性进行处理,提高匹配准确率。实验结果显示:该方法在时间效率和检测精度上均优于传统算法,其中准确率提高12.6%。该方法已应用于林业黄页系统中,取得了较好的应用效果。
Since traditional data dedupliation methods are of low time efficiency and detection accuracy, a Stepwise Clustering Data Elimination (SCI)E) method was presented based on the features of Web information integration. Firstly the whole record set was divided into sub-sets using both key attributes division and the Canopy clustering technique, and then the similar records in each sub-set were accurately eliminated. A fuzzy entity matching strategy based on dynamic weight was proposed to accurately eliminate the duplicate records, which reduced the influence of missing attribute on record similarity calculation, and the name of company was especially treated to improve the matching accuracy. The results show that the method is superior to traditional algorithms in time efficiency and detection accuracy, and the precision is improved by 12. 6%. The method is applied in forestry yellow page system and performs well.
出处
《计算机应用》
CSCD
北大核心
2013年第9期2493-2496,共4页
journal of Computer Applications
基金
中央高校基本科研业务费专项基金资助项目(BLYX200928)
关键词
Web信息整合
相似重复记录
动态权重
模糊实体匹配
Web information integration
approximately duplicate record
dynamic weight
fuzzy entity matching