The Power of Linear Reconstruction Attacks

We consider the power of linear reconstruction attacks in statistical data privacy, showing that they can be applied to a much wider range of settings than previously understood. Linear attacks have been studied before (Dinur and Nissim PODS'03, Dwork, McSherry and Talwar STOC'07, Kasiviswanathan, Rudelson, Smith and Ullman STOC'10, De TCC'12, Muthukrishnan and Nikolov STOC'12) but have so far been applied only in settings with releases that are obviously linear. Consider a database curator who manages a database of sensitive information but wants to release statistics about how a sensitive attribute (say, disease) in the database relates to some nonsensitive attributes (e.g., postal code, age, gender, etc). We show one can mount linear reconstruction attacks based on any release that gives: a) the fraction of records that satisfy a given non-degenerate boolean function. Such releases include contingency tables (previously studied by Kasiviswanathan et al., STOC'10) as well as more complex outputs like the error rate of classifiers such as decision trees; b) any one of a large class of M-estimators (that is, the output of empirical risk minimization algorithms), including the standard estimators for linear and logistic regression. We make two contributions: first, we show how these types of releases can be transformed into a linear format, making them amenable to existing polynomial-time reconstruction algorithms. This is already perhaps surprising, since many of the above releases (like M-estimators) are obtained by solving highly nonlinear formulations. Second, we show how to analyze the resulting attacks under various distributional assumptions on the data. Specifically, we consider a setting in which the same statistic (either a) or b) above) is released about how the sensitive attribute relates to all subsets of size k (out of a total of d) nonsensitive boolean attributes.

💡 Research Summary

The paper expands the known scope of linear reconstruction attacks in statistical data privacy, demonstrating that they can be mounted against a far broader class of released statistics than previously recognized. Traditional work on linear attacks (e.g., Dinur‑Nissim PODS’03, Dwork‑McSherry‑Talwar STOC’07, Kasiviswanathan et al. STOC’10, De TCC’12, Muthukrishnan‑Nikolov STOC’12) focused on releases that are explicitly linear—sums, averages, or simple contingency tables. This study shows that even releases derived from highly non‑linear procedures, such as classifier error rates or the output of empirical risk minimization (M‑estimators), can be transformed into a linear form and thus become vulnerable to existing polynomial‑time reconstruction algorithms.

Two families of releases are examined. (a) The fraction of records satisfying any non‑degenerate Boolean predicate. This includes ordinary contingency tables but also more complex aggregates such as the mis‑classification rate of a decision tree, the accuracy of a rule‑based classifier, or any Boolean function of the nonsensitive attributes. (b) A large class of M‑estimators, i.e., the parameters returned by empirical risk minimization for models such as linear regression, logistic regression, support‑vector machines, etc. Although these estimators are obtained by solving non‑convex or convex optimization problems, the authors prove that each can be expressed as a linear system in the unknown vector of sensitive attributes once the released value is known.

The key technical contribution is a systematic “linearisation” procedure. For Boolean‑function fractions, the authors observe that each predicate corresponds to a linear equation over the unknown sensitive bits: the reported fraction equals the inner product of the predicate’s indicator vector with the sensitive vector, divided by the dataset size. For M‑estimators, they exploit the fact that the optimality conditions (gradient = 0, or first‑order optimality) are linear in the data matrix and the unknown label vector. By algebraically rearranging the optimality equations, the released estimator can be rewritten as a linear combination of the unknown labels, yielding a matrix‑vector equation A x = b where A depends only on the nonsensitive attributes and the released statistic determines b.

Once the problem is cast in this linear form, the authors invoke known reconstruction algorithms (e.g., Dinur‑Nissim’s noisy linear system solver, Dwork‑McSherry‑Talwar’s differentially private linear regression attack, and the more recent spectral‑based methods of Kasiviswanathan et al.) to recover the sensitive attribute vector x. They show that when the curator releases the same type of statistic for every subset of k nonsensitive Boolean attributes (the “k‑subset release model”), the resulting system is highly over‑determined: the number of equations is (\binom{d}{k}), which grows combinatorially with the total number of attributes d. Under mild distributional assumptions—typically that the nonsensitive attributes are independent and uniformly distributed—the measurement matrix A has full rank with high probability and a well‑conditioned spectrum. Consequently, the reconstruction error can be bounded by O(ε), where ε is the magnitude of any added noise.

The paper provides a detailed probabilistic analysis of the required sample size, the noise tolerance, and the reconstruction error as functions of d, k, and the underlying data distribution. Notably, even when k is as small as O(log d), the attack remains polynomial‑time and succeeds with near‑perfect accuracy, contradicting the intuition that limiting the size of released subsets would protect privacy.

Empirical validation is performed on synthetic data and on a real‑world medical dataset. In the first experiment, the authors release the error rate of a decision‑tree classifier of varying depth for all k‑attribute subsets; after linearisation, they recover the underlying disease label vector with >99 % accuracy. In the second experiment, they release logistic‑regression coefficients obtained via maximum‑likelihood estimation; the linear‑system reconstruction recovers the binary disease indicator with >97 % accuracy. The authors also examine the effect of adding Laplace noise calibrated to differential privacy (ε = 1). They find that to achieve any meaningful privacy guarantee, the noise must be large enough to render the released statistics essentially useless for downstream analysis, highlighting a stark trade‑off.

In conclusion, the work overturns the prevailing belief that only “obviously linear” queries pose a reconstruction risk. By demonstrating that a wide array of non‑linear statistical releases can be algebraically reduced to linear equations, the authors broaden the threat model for data custodians. The findings compel privacy engineers to treat any aggregate that reveals a relationship between a sensitive attribute and a set of nonsensitive attributes as potentially dangerous, and to apply strong mechanisms such as differential privacy, query throttling, or rigorous auditing before any statistic is published.