摘要
Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing ap- proaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based en- tity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes' comparisons with proper thresholds. We use ac- tive learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity res- olution approaches in accuracy.
Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing ap- proaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based en- tity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes' comparisons with proper thresholds. We use ac- tive learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity res- olution approaches in accuracy.
基金
The authors thank anonymous reviewers for their in- spiting doubts and helpful suggestions during the reviewing process. This work was supported by the National Basic Research Program of China (973 Program) (2012CB316201), the Fundamental Research Funds for the Cen- tral Universities (N 120816001) and the National Natural Science Foundation of China (Grant Nos. 61472070, 61402213).