期刊文献+

海量不完整数据的核心数据选择问题的研究 被引量:6

Research on Core-Sets Selection on Massive Incomplete Data
下载PDF
导出
摘要 在大数据时代,越来越多的带有缺失值的数据需要处理,因而数据不完整成为一种常见的数据质量问题.不完整的数据给大数据的查询、挖掘和分析带来了困难.在某些情况下,数据中的很多缺失值是无法被确定的.只能根据用户的需求,在不完整的数据上选择一部分用户感兴趣的核心数据集合,来提高不完整数据的可用性.完整度较高,规模较小,在用户感兴趣的属性上给出更多完整信息的核心数据集合,能够支持高效的查询处理,提高查询结果的准确性和完整性.该文形式化了核心数据选择问题,证明了这至少是一个NP-难问题.由于需要同时优化核心数据集合的完整度、集合的规模以及对于感兴趣属性的覆盖性,现有的基于集合覆盖问题的方法无法解决文中提出的问题.该文提出了一个采用贪心策略,具有理论保证的近似核心数据选择算法ACS.ACS首先判断当前的数据集合是否存在一个满足覆盖性要求的子集合.当这样的子集合存在时,ACS尽量选择完整的元组来组成核心数据集合,当使用完整元组无法满足覆盖性的要求时,ACS选择较少的不完整元组.ACS通过限制选择的次数来获得一个集合大小的上界是运行次数常数倍的子集合,并且保证了对于感兴趣的属性的覆盖比例.通过理论分析可知,ACS能够在近似线性的时间内,找到一个大小至多在给定的大小对数因子内的近似核心数据集合,其中被覆盖的感兴趣的属性的比例至少为(1-1/e),包含的不完整元组的个数至多为给定的核心数据集合的大小,其中e是自然对数的底数.通过在DBLP和NBA球员信息这两个真实数据集合上的实验,表明了所提出的算法ACS的有效性和高效性;通过在规模更大的合成数据上的实验,表明了ACS的良好的扩展性. In the big data era,more and more data with missing values are to be handled,which makes data incompleteness be a common problem of data quality.Incomplete data brings challenges to querying,mining and analyzing on massive data.In some scenarios,many missing values cannot be determined.According to a user’s requirement,only a subset of entire data set with missing values,called core-sets,has to be selected to improve the data usability of the entire incomplete data set.With higher data completeness,smaller size,and more complete information on attributes of a user’s interest,core-sets can support efficient query processing and provide more accurate and complete answers to queries.Core-sets selection problem is formalized in this paper,and such problem is proven to be at least NP-Hard.Because there are three objects to be optimized at the same time:the core-set’s completeness,size and coverage on attributes of a user’s interest,existing methods based on set cover problems cannot solve the core-sets selection problem proposed in this paper.An approximate core-set selection algorithm with theoretical guarantee based on greedy,called ACS,is proposed in this paper.The proposed algorithm ACS firstly determines whether or not there is a subset of the entire data set with required coverage on the attributes of a user’s interest.If such subsets satisfying requirement on coverage exist,ACS selects tuples containing complete values as many as possible to obtain an approximate core-set.But if required coverage cannot be satisfied by only tuples containing complete values,ACS will select a few tuples containing missing values to cover the remaining attributes,but the number of such tuples containing missing values is small.The number of selections in ACS is limited so that ACS can obtain a subset of tuples of which the size is bounded by a constant factor of the number of the selections,and the ratio of attributes of a user’s interest covered is guaranteed as well.By theoretical analysis,within nearly linear time,ACS can select an approximate core-set of size within the logarithm factor to the optimal given size,where the ratio of attributes of interest covered is at least(1-1/e),with incomplete tuples of at most the size of the required core-set,where e is the base of the natural logarithm.Experimental results conducted on two real-world datasets of DBLP and NBA players show effectiveness and efficiency of the proposed algorithm ACS,and experimental results on synthetic datasets with larger scale show scalability of ACS.
作者 刘永楠 李建中 高宏 LIU Yong-Nan ;LI Jian-Zhong ;GAO Hong(School of Computer Science and Technology,Harbin Institute of Technology,Harbin 150001)
出处 《计算机学报》 EI CSCD 北大核心 2018年第4期915-930,共16页 Chinese Journal of Computers
基金 国家自然科学基金(61632010 61502116 61370217 U1509216) 科技部重点研发计划(2016YFB1000703)资助
关键词 数据质量 数据完整性 不完整数据 核心数据选择 近似算法 data quality data completeness incomplete data core-sets selection approximate algorithms
  • 相关文献

同被引文献47

引证文献6

二级引证文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部