Entity resolution (ER) is the problem of identi- fying and grouping different manifestations of the same real world object. Algorithmic approaches have been developed where most tasks offer superior performance unde...Entity resolution (ER) is the problem of identi- fying and grouping different manifestations of the same real world object. Algorithmic approaches have been developed where most tasks offer superior performance under super- vised learning. However, the prohibitive cost of labeling training data is still a huge obstacle for detecting duplicate query records from online sources. Furthermore, the unique combinations of noisy data with missing elements make ER tasks more challenging. To address this, transfer learning has been adopted to adaptively share learned common structures of similarity scoring problems between multiple sources. Al- though such techniques reduce the labeling cost so that it is linear with respect to the number of sources, its random sam- piing strategy is not successful enough to handle the ordinary sample imbalance problem. In this paper, we present a novel multi-source active transfer learning framework to jointly select fewer data instances from all sources to train classi- fiers with constant precision/recall. The intuition behind our approach is to actively label the most informative samples while adaptively transferring collective knowledge between sources. In this way, the classifiers that are learned can be both label-economical and flexible even for imbalanced or quality diverse sources. We compare our method with the state-of-the-art approaches on real-word datasets. Our exper- imental results demonstrate that our active transfer learning algorithm can achieve impressive performance with far fewerlabeled samples for record matching with numerous and var- ied sources.展开更多
文摘Entity resolution (ER) is the problem of identi- fying and grouping different manifestations of the same real world object. Algorithmic approaches have been developed where most tasks offer superior performance under super- vised learning. However, the prohibitive cost of labeling training data is still a huge obstacle for detecting duplicate query records from online sources. Furthermore, the unique combinations of noisy data with missing elements make ER tasks more challenging. To address this, transfer learning has been adopted to adaptively share learned common structures of similarity scoring problems between multiple sources. Al- though such techniques reduce the labeling cost so that it is linear with respect to the number of sources, its random sam- piing strategy is not successful enough to handle the ordinary sample imbalance problem. In this paper, we present a novel multi-source active transfer learning framework to jointly select fewer data instances from all sources to train classi- fiers with constant precision/recall. The intuition behind our approach is to actively label the most informative samples while adaptively transferring collective knowledge between sources. In this way, the classifiers that are learned can be both label-economical and flexible even for imbalanced or quality diverse sources. We compare our method with the state-of-the-art approaches on real-word datasets. Our exper- imental results demonstrate that our active transfer learning algorithm can achieve impressive performance with far fewerlabeled samples for record matching with numerous and var- ied sources.