摘要
数据共享与发布可以有效发挥数据的价值,能够在数智时代推动科技进步和经济社会的发展。在数据共享的同时如何保护数据版权及个人隐私仍是一项巨大的挑战。差分隐私数据合成是数据隐私保护的有效手段,数据持有者通过发布合成数据取代真实数据,一方面可以保护数据隐私,另一方面也可以提高数据的泛用性与可用性。针对差分隐私生成模型合成图像数据样本可用性低的问题,提出了基于隐空间扩散模型的两阶段差分隐私生成模型。首先对原始图像进行差分隐私感知信息压缩,将其从像素空间投射至隐空间中,获得原始敏感数据的脱敏隐向量表示。然后将隐向量输入扩散模型,使其逐渐转变为先验分布,并通过去噪过程进行采样。最后,使用MNIST和Fashion MNIST数据集训练并进行数据合成,结果表明该模型在FID和下游任务准确性上相比DP-Sinkhorn等SOTA模型均有明显提升。
The widespread application of data sharing and publication in the socio-economic domain drives scientific progress and societal development.However,issues related to copyright and privacy,especially concerning personal data,remain critical challenges.Differential privacy data synthesis has emerged as an effective means of protecting data privacy,where data holders can release synthetic data instead of real data,thereby enhancing data utility and availability while preserving privacy.In response to the limited usability of existing differential privacy generation models,this paper proposes a two-stage differential privacy generation model based on the latent space diffusion approach.Firstly,the differential privacy-aware information compression is performed on the original image,and it is projected from the pixel space to the latent space to obtain the desensitized latent vector representation of the original sensitive data.The latent vector is then fed into a diffusion model to gradually transform into a prior distribution and sampled through a denoising process.Experimental results based on the MNIST and Fashion MNIST datasets demonstrate that the proposed model exhibits significant improvements in terms of Fréchet inception distance(FID)and downstream task accuracy compared to state-of-the-art models like DP-Sinkhorn.
作者
葛胤池
张辉
孙浩航
GE Yinchi;ZHANG Hui;SUN Haohang(State Key Laboratory of Complex&Critical Software Environment,Beihang University,Beijing 100191,China)
出处
《计算机科学》
CSCD
北大核心
2024年第3期30-38,共9页
Computer Science
关键词
差分隐私
数据合成
生成模型
自编码器
扩散模型
Differential privacy
Data synthesis
Generative models
Autoencoder
Diffusion models