期刊文献+
共找到2篇文章
< 1 >
每页显示 20 50 100
CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing 被引量:3
1
作者 an-zhen zhang Jian-Zhong Li +3 位作者 Hong Gao Yu-Biao Chen Heng-Zhao Ma Mohamed Jaward Bah 《Journal of Computer Science & Technology》 SCIE EI CSCD 2018年第2期366-379,共14页
Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate q... Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate query result, has drawn significant research attention. However, the direct processing of OLA on duplicate data will lead to incorrect query answers, since sampling from duplicate records leads to an over representation of the duplicate data in the sample. This violates the prerequisite of uniform distributions in most statistical theories. In this paper, we propose CrowdOLA, a novel framework for integrating online aggregation processing with deduplication. Instead of cleaning the whole dataset, Crow~ dOLA retrieves block-level samples continuously from the dataset, and employs a crowd-based entity resolution approach to detect duplicates in the sample in a pay-as-you-go fashion. After cleaning the sample, an unbiased estimator is provided to address the error bias that is introduced by the duplication. We evaluate CrowdOLA on both real-world and synthetic workloads. Experimental results show that CrowdOLA provides a good balance between efficiency and accuracy. 展开更多
关键词 online aggregation entity resolution crowdsourcing cloud computing
原文传递
Interval Estimation for Aggregate Queries on Incomplete Data 被引量:1
2
作者 an-zhen zhang Jian-Zhong Li Hong Gao 《Journal of Computer Science & Technology》 SCIE EI CSCD 2019年第6期1203-1216,共14页
Incomplete data has been a longstanding issue in the database community, and the subject is yet poorly handled by both theories and practices. One common way to cope with missing values is to complete their imputation... Incomplete data has been a longstanding issue in the database community, and the subject is yet poorly handled by both theories and practices. One common way to cope with missing values is to complete their imputation (filling in) as a preprocessing step before analyses. Unfortunately, not a single imputation method could impute all missing values correctly in all cases. Users could hardly trust the query result on such complete data without any confidence guarantee. In this paper, we propose to directly estimate the aggregate query result on incomplete data, rather than to impute the missing values. An interval estimation, composed of the upper and the lower bound of aggregate query results among all possible interpretations of missing values, is presented to the end users. The ground-truth aggregate result is guaranteed to be among the interval. We believe that decision support applications could benefit significantly from the estimation, since they can tolerate inexact answers, as long as there are clearly defined semantics and guarantees associated with the results. Our main techniques are parameter-free and do not assume prior knowledge about the distribution and missingness mechanisms. Experimental results are consistent with the theoretical results and suggest that the estimation is invaluable to better assess the results of aggregate queries on incomplete data. 展开更多
关键词 INCOMPLETE DATA AGGREGATE QUERY INTERVAL estimation DATA QUALITY
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部