Privacy Amplification by Missing Data
Privacy preservation is a fundamental requirement in many high-stakes domains such as medicine and finance, where sensitive personal data must be analyzed without compromising individual confidentiality. At the same time, these applications often involve datasets with missing values due to non-response, data corruption, or deliberate anonymization. Missing data is traditionally viewed as a limitation because it reduces the information available to analysts and can degrade model performance. In this work, we take an alternative perspective and study missing data from a privacy preservation standpoint. Intuitively, when features are missing, less information is revealed about individuals, suggesting that missingness could inherently enhance privacy. We formalize this intuition by analyzing missing data as a privacy amplification mechanism within the framework of differential privacy. We show, for the first time, that incomplete data can yield privacy amplification for differentially private algorithms.
💡 Research Summary
The paper “Privacy Amplification by Missing Data” investigates how the presence of missing values in a dataset can serve as a privacy‑enhancing mechanism when combined with differentially private (DP) algorithms. The authors begin by formalizing missing data as a stochastic masking process. For each record, a binary mask m ∈ {0,1}^d determines which features are observed and which are replaced by a special “NA” symbol. The mask is generated by a random feature‑wise mechanism F, and the masks for all n records are drawn independently, yielding a global missing‑data mechanism D = F⊗n. Two standard missingness regimes are considered: Missing Completely At Random (MCAR), where the mask is independent of the data, and Missing At Random (MAR), where the mask may depend on the observed part of the record but not on the unobserved part.
The central construction is the composition of a DP algorithm A (which satisfies (ε, δ)-DP on complete data) with the missing‑data mechanism D, denoted A_D. The process is: (1) sample a mask m ∼ D(z) for the original dataset z; (2) apply the mask to obtain an incomplete dataset \tilde{z}(m); (3) run the DP algorithm A on \tilde{z}(m). The authors adapt the definition of DP to the space of incomplete datasets Z_miss, preserving the standard neighboring‑dataset relation (datasets differing in a single record).
The main theoretical contribution is a privacy‑amplification theorem for MAR (and consequently MCAR) missingness. They prove that there exists a constant p* ∈
Comments & Academic Discussion
Loading comments...
Leave a Comment