EntityManager： Managing Dirty Data Based on Entity Resolution 被引量：2

EntityManager： Managing Dirty Data Based on Entity Resolution

导出

摘要 Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may cause information loss and bring new inconsistencies. To avoid these problems, we propose EntityManager, a general system to manage dirty data without data cleaning. This system takes real-world entity as the basic storage unit and retrieves query results according to the quality requirement of users. The system is able to handle all kinds of inconsistencies recognized by entity resolution. We elaborate the EntityManager system, covering its architecture, data model, and query processing techniques. To process queries efficiently, our system adopts novel indices, similarity operator and query optimization techniques. Finally, we verify the efficiency and effectiveness of this system and present future research challenges. Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may cause information loss and bring new inconsistencies. To avoid these problems, we propose EntityManager, a general system to manage dirty data without data cleaning. This system takes real-world entity as the basic storage unit and retrieves query results according to the quality requirement of users. The system is able to handle all kinds of inconsistencies recognized by entity resolution. We elaborate the EntityManager system, covering its architecture, data model, and query processing techniques. To process queries efficiently, our system adopts novel indices, similarity operator and query optimization techniques. Finally, we verify the efficiency and effectiveness of this system and present future research challenges.

作者 Xue-Li Liu Hong-Zhi Wang Jian-Zhong Li Hong Gao

机构地区 Massive Data Computing Laboratory

出处《Journal of Computer Science & Technology》 SCIE EI CSCD 2017年第3期644-662,共19页 计算机科学技术学报（英文版）

基金 This work was partially supported by the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant No. 2015BAH10F01, the National Natural Science Foundation of China under Grant Nos. U1509216, 61472099, and 61133002, the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province of China under Grant No. LC2016026, and the Ministry of Education （MOE）-Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology.

关键词 dirty data entity resolution uncertain attribute query processing query optimization dirty data, entity resolution, uncertain attribute, query processing, query optimization

分类号 TP [自动化与计算机技术]

引文网络
相关文献

参考文献2

1刘雪莉,王宏志,李建中,高宏.实体数据库中多相似连接顺序选择策略[J].计算机科学与探索,2012,6(10):865-876. 被引量：3
2张岩,杨龙,王宏志.劣质数据库上阈值相似连接结果大小估计[J].计算机学报,2012,35(10):2159-2168. 被引量：6

二级参考文献37

1Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning[C]//Proceedings of the 22nd International Conference on Data Engineering (ICDE '06). Washington, DC, USA: IEEE Computer Society, 2006: 1-5.
2Dong Xin, Halevy A Y, Yu Cong. Data integration with uncertainty[C]//Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB '07), 2007: 687-698.
3Ji Shengyue, Li Guoliang, Li Chen, et al. Efficient interac- tive fuzzy keyword search[C]//Proceedings of the 18th Inter-national Conference on World Wide Web (WWW '09). New York, NY, USA: ACM, 2009: 371-380.
4Hoad T C, Zobel J. Methods for identifying versioned and plagiarized documents[J]. Journal of the American Society for Information Science and Technology, 2003, 54(3): 203-215.
5Broder A Z, Glassman S C, Manasse M S, et al. Syntactic clustering of the Web[J]. Computer Networks and ISDN Systems, 1997, 29(8): 1157-1166.
6Lee H, Ng R T. Similarity join size estimation using locality sensitive Hashing[J]. Proceedings of the VLDB Endowment, 2011, 4(6): 338-349.
7Lee H, Ng R T. Power-law based estimation of set similarity join size[J]. Proceedings of the VLDB Endowment, 2009, 2 (1): 658-669.
8Wu Y-L, Agrawal D, E1 Abbadi A1. Query estimation by adaptive sampling[C]//Proceedings of the 18th International Conference on Data Engineering (ICDE '02), 2002: 639-648.
9Chaudhuri S, Motwani R, Narasayya V. On random sampling over joins[J]. ACM SIGMOD Record, 1999, 28(2): 263-274.
10Jerrum M, Sinclair A. Approximation algorithms for NP-hard problems[M]. Boston, MA: PWS Publishing Co, 1996: 482-520.

共引文献5

1蒋勋,刘喜文.大数据环境下面向知识服务的数据清洗研究[J].图书与情报,2013(5):16-21. 被引量：47
2蒋勋,徐绪堪.面向知识服务的知识库逻辑结构模型[J].图书与情报,2013(6):23-31. 被引量：23
3张岩,唐兴,王宏志.劣质数据库上查询优化策略[J].小型微型计算机系统,2014,35(11):2410-2415.
4张岩,唐兴.一种劣质数据上统计量的获取方法[J].智能计算机与应用,2014,4(5):26-28.
5刘雅思,程力,李晓.基于长度过滤和动态容错的SNM改进算法[J].计算机应用研究,2017,34(1):147-150. 被引量：9

同被引文献8

1王宏志.大数据质量管理:问题与研究进展[J].科技导报,2014,32(34):78-84. 被引量：34
2刘雪莉,王宏志,李建中,高宏.基于实体的相似性连接算法[J].软件学报,2015,26(6):1421-1437. 被引量：8
3孟小峰,杜治娟.大数据融合研究:问题与挑战[J].计算机研究与发展,2016,53(2):231-246. 被引量：135
4高广尚.面向数据演化的实体解析述评[J].情报学报,2016,35(3):326-336. 被引量：2
5孙琛琛,申德荣,寇月,聂铁铮,于戈.面向实体识别的聚类算法[J].软件学报,2016,27(9):2303-2319. 被引量：8
6杨晓东,李军,王继荣,王芳.基于增量自适应的邻近排序算法优化[J].青岛大学学报（自然科学版）,2017,30(2):53-57. 被引量：2
7An-Zhen Zhang,Jian-Zhong Li,Hong Gao,Yu-Biao Chen,Heng-Zhao Ma,Mohamed Jaward Bah.CrowdOLA： Online Aggregation on Duplicate Data Powered by Crowdsourcing[J].Journal of Computer Science & Technology,2018,33(2):366-379. 被引量：3
8Sai-Sai Gong,Wei Hu,Wei-Yi Ge,Yu-Zhong Qu.Modeling Topic-Based Human Expertise for Crowd Entity Resolution[J].Journal of Computer Science & Technology,2018,33(6):1204-1218. 被引量：1

引证文献2

1Bo-Han Li,Yi Liu,An-Man Zhang,Wen-Huan Wang,Shuo Wan.A Survey on Blocking Technology of Entity Resolution[J].Journal of Computer Science & Technology,2020,35(4):769-793. 被引量：1
2高广尚.实体解析中基于相似性传递的增量分组研究[J].系统工程理论与实践,2019,39(5):1287-1297. 被引量：1

二级引证文献2

1Alireza Arabameri,Fatemeh Rezaie,Subodh Chandra Pal,Artemi Cerda,Asish Saha,Rabin Chakrabortty,Saro Lee.Modelling of piping collapses and gully headcut landforms: Evaluating topographic variables from different types of DEM[J].Geoscience Frontiers,2021,12(6):129-146. 被引量：3
2高广尚.面向演化数据的代表性记录构建方法[J].系统工程,2022,40(3):137-148.

1Chenchen SUN,Derong SHEN,Yue KOU,Tiezheng NIE,Ge YU.A genetic algorithm based entity resolution approach with active learning[J].Frontiers of Computer Science,2017,11(1):147-159. 被引量：1
2LEI Gang.Mixed Attributes Two-Stage-Clustering Entity Resolution[J].通讯和计算机（中英文版）,2015,12(6):297-302.
3杨丹,申德荣,于戈,聂铁铮,寇月.数据空间中时间为中心的集合实体识别策略[J].计算机科学与探索,2012,6(11):974-984. 被引量：4
4楼俊杰,徐从富,郝春亮.基于马尔科夫逻辑网络的实体解析改进算法[J].计算机科学,2010,37(8):243-247. 被引量：10
5韩华,丁永生,刘凤鸣.Intelligent Data Pre-processing Model in Integrated Ocean Observing Network System[J].Journal of Donghua University(English Edition),2009,26(5):499-502.
6SunyongYOO,JaejoonCHOI,MoonshikSHIN,SuhyunHA,KyungrinNOH,HojungNAM,DoheonLEE.Integrative database for multi-compound drug discovery in complementary medicine[J].中国药理学与毒理学杂志,2015,29(S1):98-98.

Journal of Computer Science & Technology

2017年第3期

浏览历史

内容加载中请稍等...