CrowdOLA： Online Aggregation on Duplicate Data Powered by Crowdsourcing 被引量：3

CrowdOLA： Online Aggregation on Duplicate Data Powered by Crowdsourcing

导出

摘要 Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation （OLA）, which provides a quick sketch of massive data before a long wait of the final accurate query result, has drawn significant research attention. However, the direct processing of OLA on duplicate data will lead to incorrect query answers, since sampling from duplicate records leads to an over representation of the duplicate data in the sample. This violates the prerequisite of uniform distributions in most statistical theories. In this paper, we propose CrowdOLA, a novel framework for integrating online aggregation processing with deduplication. Instead of cleaning the whole dataset, Crow~ dOLA retrieves block-level samples continuously from the dataset, and employs a crowd-based entity resolution approach to detect duplicates in the sample in a pay-as-you-go fashion. After cleaning the sample, an unbiased estimator is provided to address the error bias that is introduced by the duplication. We evaluate CrowdOLA on both real-world and synthetic workloads. Experimental results show that CrowdOLA provides a good balance between efficiency and accuracy. Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation （OLA）, which provides a quick sketch of massive data before a long wait of the final accurate query result, has drawn significant research attention. However, the direct processing of OLA on duplicate data will lead to incorrect query answers, since sampling from duplicate records leads to an over representation of the duplicate data in the sample. This violates the prerequisite of uniform distributions in most statistical theories. In this paper, we propose CrowdOLA, a novel framework for integrating online aggregation processing with deduplication. Instead of cleaning the whole dataset, Crow~ dOLA retrieves block-level samples continuously from the dataset, and employs a crowd-based entity resolution approach to detect duplicates in the sample in a pay-as-you-go fashion. After cleaning the sample, an unbiased estimator is provided to address the error bias that is introduced by the duplication. We evaluate CrowdOLA on both real-world and synthetic workloads. Experimental results show that CrowdOLA provides a good balance between efficiency and accuracy.

作者 An-Zhen Zhang Jian-Zhong Li Hong Gao Yu-Biao Chen Heng-Zhao Ma Mohamed Jaward Bah

机构地区 School of Computer Science and Technology

出处《Journal of Computer Science & Technology》 SCIE EI CSCD 2018年第2期366-379,共14页 计算机科学技术学报（英文版）

基金 This work was supported by the National Natural Science Foundation of China under Grant Nos. 61502121, 61472099, and 61602129.

关键词 online aggregation entity resolution crowdsourcing cloud computing online aggregation, entity resolution, crowdsourcing, cloud computing

分类号 TP391.4 [自动化与计算机技术—计算机应用技术] P618.130.1 [天文地球—矿床学]

引文网络
相关文献

同被引文献9

1Ruochen Fan,Xuanrun Wang,Qibin Hou,Hanchao Liu,Tai-Jiang Mu.SpinNet: Spinning convolutional network for lane boundary detection[J].Computational Visual Media,2019,5(4):417-428. 被引量：6
2Yizhi Song,Ruochen Fan,Sharon Huang,Zhe Zhu,Ruofeng Tong.A three-stage real-time detector for traffic signs in large panoramas[J].Computational Visual Media,2019,5(4):403-416. 被引量：1
3Xue-Li Liu,Hong-Zhi Wang,Jian-Zhong Li,Hong Gao.EntityManager： Managing Dirty Data Based on Entity Resolution[J].Journal of Computer Science & Technology,2017,32(3):644-662. 被引量：2
4Jin-Zhao Yuan,Hui Chen,Bin Zhao,Yanyan Xu.Estimation of Vehicle Pose and Position with Monocular Camera at Urban Road Intersections[J].Journal of Computer Science & Technology,2017,32(6):1150-1161. 被引量：3
5史英杰,杜方,尤亚东.MSOLA:基于多维分层采样的大数据在线聚集技术[J].计算机应用研究,2018,35(2):375-380. 被引量：5
6Peng-Peng Chen,Hai-Long Sun,Yi-Li Fang,Jin-Peng Huai.Collusion-Proof Result Inference in Crowdsourcing[J].Journal of Computer Science & Technology,2018,33(2):351-365. 被引量：3
7Yifan Lu,Jiaming Lu,Songhai Zhang,Peter Hall.Traffic signal detection and classification in street views using an attention model[J].Computational Visual Media,2018,4(3):253-266. 被引量：17
8Sai-Sai Gong,Wei Hu,Wei-Yi Ge,Yu-Zhong Qu.Modeling Topic-Based Human Expertise for Crowd Entity Resolution[J].Journal of Computer Science & Technology,2018,33(6):1204-1218. 被引量：1
9Leo Mendiboure,Mohamed-Aymen Chalouf,Francine Krief.Edge Computing Based Applications in Vehicular Environments:Comparative Study and Main Issues[J].Journal of Computer Science & Technology,2019,34(4):869-886. 被引量：4

引证文献3

1史英杰,杜方.MC-OLA:基于马尔可夫链的多表连接在线聚集技术[J].计算机应用研究,2019,36(12):3801-3805.
2Dun Liang,Yuan-Chen Guo,Shao-Kui Zhang,Tai-Jiang Mu,Xiaolei Huang.Lane Detection:A Survey with New Results[J].Journal of Computer Science & Technology,2020,35(3):493-505. 被引量：5
3Bo-Han Li,Yi Liu,An-Man Zhang,Wen-Huan Wang,Shuo Wan.A Survey on Blocking Technology of Entity Resolution[J].Journal of Computer Science & Technology,2020,35(4):769-793. 被引量：1

二级引证文献6

1Alireza Arabameri,Fatemeh Rezaie,Subodh Chandra Pal,Artemi Cerda,Asish Saha,Rabin Chakrabortty,Saro Lee.Modelling of piping collapses and gully headcut landforms: Evaluating topographic variables from different types of DEM[J].Geoscience Frontiers,2021,12(6):129-146. 被引量：3
2叶伟,朱明.基于空间特征聚合的车道线检测算法[J].计算机系统应用,2021,30(12):235-242. 被引量：1
3汪鹏飞,沈庆宏,张维利,董文杰,陈红梅.基于多尺度特征图像分割的车道线提取方法[J].南京大学学报（自然科学版）,2022,58(2):336-344. 被引量：3
4李守彪,武志斐.基于多尺寸分解卷积的车道线检测[J].汽车技术,2022(8):32-37. 被引量：6
5马淑康,蒋华涛,常琳,郑琛.基于注意力机制和特征聚合的车道线检测[J].微电子学与计算机,2022,39(12):40-46. 被引量：4
6鲁维佳,刘泽帅,潘玉恒,李国燕,李慧洁,丛佳.基于循环特征融合的弯道增强车道线检测算法[J].测绘通报,2023(12):25-30.

1Phanuel Mawuli Kofi Segbefia.Lost in Translation Shopping online on Taobao can be tricky if not familiar with Chinese[J].ChinAfrica,2018,10(4):58-58.
2余应敏,张楠.“新同事”来了,财会人依然是主宰[J].财务与会计,2018(5):81-82. 被引量：10
3Wei XIANG,Ling-yun SUN,Wei-tao YOU,Chang-yuan YANG.Crowdsourcing intelligent design[J].Frontiers of Information Technology & Electronic Engineering,2018,19(1):126-138. 被引量：2
4Antonio Nenna,Mario Lusini,Cristiano Spadaccio,Francesco Nappi,Salvatore Matteo Greco,Raffaele Barbato,Elvio Covino,Massimo Chello.Heart rate variability： a new tool to predict complications in adult cardiac surgery[J].Journal of Geriatric Cardiology,2017,14(11):662-668. 被引量：5
5黄娟,张玲,王刚.双相障碍治疗延迟时间及其影响因素[J].临床精神医学杂志,2017,27(6):361-364. 被引量：1
6跑不完的马拉松——跑·趣全世界[J].旅行者,2018(4):70-73.
7Ying Gu,Yanqing Jiao,Xiaoguang Zhou,Aiping Wu,Bater Buhe,Honggang Fu.Strongly coupled Ag/TiO2 heterojunctions for effective and stable photothermal catalytic reduction of 4-nitrophenol[J].Nano Research,2018,11(1):126-141. 被引量：5
8蔡素清,郭正琴,黄蒂娜,陈雪清,左娟.无创DNA产前检测在高龄孕妇中的应用[J].中国优生与遗传杂志,2018,26(1):53-55. 被引量：1
9秦韶刚,吕翔,李桂兰.PCR法在鸡毒支原体污染检测中的应用[J].畜禽业,2018,29(3):7-10.
10向桂华,黄建华,武正陆,陈爱军.腹腔镜修补术治疗老年十二指肠溃疡合并穿孔的疗效分析[J].中华老年多器官疾病杂志,2018,17(2):124-127. 被引量：7

Journal of Computer Science & Technology

2018年第2期

浏览历史

内容加载中请稍等...

CrowdOLA： Online Aggregation on Duplicate Data Powered by Crowdsourcing 被引量：3

同被引文献9

引证文献3

二级引证文献6

相关作者

相关机构

相关主题

浏览历史