期刊文献+

EntityManager: Managing Dirty Data Based on Entity Resolution 被引量:2

EntityManager: Managing Dirty Data Based on Entity Resolution
原文传递
导出
摘要 Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may cause information loss and bring new inconsistencies. To avoid these problems, we propose EntityManager, a general system to manage dirty data without data cleaning. This system takes real-world entity as the basic storage unit and retrieves query results according to the quality requirement of users. The system is able to handle all kinds of inconsistencies recognized by entity resolution. We elaborate the EntityManager system, covering its architecture, data model, and query processing techniques. To process queries efficiently, our system adopts novel indices, similarity operator and query optimization techniques. Finally, we verify the efficiency and effectiveness of this system and present future research challenges. Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may cause information loss and bring new inconsistencies. To avoid these problems, we propose EntityManager, a general system to manage dirty data without data cleaning. This system takes real-world entity as the basic storage unit and retrieves query results according to the quality requirement of users. The system is able to handle all kinds of inconsistencies recognized by entity resolution. We elaborate the EntityManager system, covering its architecture, data model, and query processing techniques. To process queries efficiently, our system adopts novel indices, similarity operator and query optimization techniques. Finally, we verify the efficiency and effectiveness of this system and present future research challenges.
出处 《Journal of Computer Science & Technology》 SCIE EI CSCD 2017年第3期644-662,共19页 计算机科学技术学报(英文版)
基金 This work was partially supported by the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant No. 2015BAH10F01, the National Natural Science Foundation of China under Grant Nos. U1509216, 61472099, and 61133002, the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province of China under Grant No. LC2016026, and the Ministry of Education (MOE)-Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology.
关键词 dirty data entity resolution uncertain attribute query processing query optimization dirty data, entity resolution, uncertain attribute, query processing, query optimization
  • 相关文献

参考文献2

二级参考文献37

  • 1Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning[C]//Proceedings of the 22nd International Conference on Data Engineering (ICDE '06). Washington, DC, USA: IEEE Computer Society, 2006: 1-5.
  • 2Dong Xin, Halevy A Y, Yu Cong. Data integration with uncertainty[C]//Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB '07), 2007: 687-698.
  • 3Ji Shengyue, Li Guoliang, Li Chen, et al. Efficient interac- tive fuzzy keyword search[C]//Proceedings of the 18th Inter-national Conference on World Wide Web (WWW '09). New York, NY, USA: ACM, 2009: 371-380.
  • 4Hoad T C, Zobel J. Methods for identifying versioned and plagiarized documents[J]. Journal of the American Society for Information Science and Technology, 2003, 54(3): 203-215.
  • 5Broder A Z, Glassman S C, Manasse M S, et al. Syntactic clustering of the Web[J]. Computer Networks and ISDN Systems, 1997, 29(8): 1157-1166.
  • 6Lee H, Ng R T. Similarity join size estimation using locality sensitive Hashing[J]. Proceedings of the VLDB Endowment, 2011, 4(6): 338-349.
  • 7Lee H, Ng R T. Power-law based estimation of set similarity join size[J]. Proceedings of the VLDB Endowment, 2009, 2 (1): 658-669.
  • 8Wu Y-L, Agrawal D, E1 Abbadi A1. Query estimation by adaptive sampling[C]//Proceedings of the 18th International Conference on Data Engineering (ICDE '02), 2002: 639-648.
  • 9Chaudhuri S, Motwani R, Narasayya V. On random sampling over joins[J]. ACM SIGMOD Record, 1999, 28(2): 263-274.
  • 10Jerrum M, Sinclair A. Approximation algorithms for NP-hard problems[M]. Boston, MA: PWS Publishing Co, 1996: 482-520.

共引文献5

同被引文献8

引证文献2

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部