摘要
t-closeness模型是数据发布领域中用于抵御相似性攻击和偏斜攻击的一种有效方法,但其采用的EMD(earth mover's distance)距离没有考虑等价类与数据表间敏感属性分布的稳定性,不能全面地衡量分布间距离,在分布间稳定差异过大时会大大提高隐私泄露的风险.针对这种局限,提出了一种SABuk t-closeness模型,它在传统t-closeness模型的基础上,为更加准确地度量分布间距离,以EMD距离与KL散度(kullback-leibler divergence)结合构建距离度量标准.同时,根据敏感属性的层次树结构,对数据表进行语义相似性桶分组划分,然后采用贪心思想生成满足要求的最小等价类,并且运用k-近邻的思想来选取QI(quasi-identifiers)值相似的元组生成等价类.实验结果表明,SABuk t-closeness模型在牺牲少量时间的前提下减少了信息损失,能在有效地保护敏感信息不泄露的同时保持较高的数据效用.
The t-closeness model is an effective model to prevent the data sets from skewness attack and similarity attack. But the EMD (earth mover's distance), which t-closeness used to measure the distance between distributions, is not well considering the stability between distributions, so it is hardly to entirely measure the distance between distributions. When the stability between distributions is too large, it will greatly increase the risk of privacy. Aim to address these limitations and accurately measure the distance between distributions, based on traditional t-closeness, the model of SABuk t-closeness which combined the EMD with KL divergence to construct a new distance measurement is proposed. At the same time, according to the hierarchy of sensitive attribute (SA), it partitions a table into buckets based on the semantic similarity of SA values, and then uses greedy algorithm for generating the minimum groups which is satisfied with the requirement of the distance between distributions. At the end, it has adopted the k-nearest neighbour algorithm to choose similar quasi-identifiers (QI) values. Experimental results indicate that SABuk t-closeness model can bring down the information loss on the premise of consuming a little time, and it can preserve privacy of sensitive data well meanwhile maintaining high data utility.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2014年第1期126-137,共12页
Journal of Computer Research and Development
基金
国家自然科学基金项目(61073041
61073043)
教育部高等学校博士学科点专项科研基金项目(20112304110011
20122304110012)
哈尔滨市科技创新人才研究专项资金项目(优秀学科带头人)(2011RFXXG015)