Privacy via the Johnson-Lindenstrauss Transform

Suppose that party A collects private information about its users, where each user’s data is represented as a bit vector. Suppose that party B has a proprietary data mining algorithm that requires estimating the distance between users, such as clustering or nearest neighbors. We ask if it is possible for party A to publish some information about each user so that B can estimate the distance between users without being able to infer any private bit of a user. Our method involves projecting each user’s representation into a random, lower-dimensional space via a sparse Johnson-Lindenstrauss transform and then adding Gaussian noise to each entry of the lower-dimensional representation. We show that the method preserves differential privacy—where the more privacy is desired, the larger the variance of the Gaussian noise. Further, we show how to approximate the true distances between users via only the lower-dimensional, perturbed data. Finally, we consider other perturbation methods such as randomized response and draw comparisons to sketch-based methods. While the goal of releasing user-specific data to third parties is more broad than preserving distances, this work shows that distance computations with privacy is an achievable goal.

💡 Research Summary

The paper tackles a practical privacy problem: a data‑holding party (A) wants to release user‑specific information that enables a third party (B) to run distance‑based analytics (clustering, nearest‑neighbor search) without revealing any individual bit of a user’s private binary profile. The authors propose a two‑step mechanism. First, each user’s d‑dimensional binary vector is projected into a much lower‑dimensional space using a sparse Johnson‑Lindenstrauss (JL) transform. The transform matrix is random, each row contains only s non‑zero entries drawn from {±1/√s}, which yields an O(s·nnz) multiplication cost while preserving pairwise Euclidean distances up to a (1±ε) factor with high probability. Second, independent Gaussian noise N(0,σ²) is added to every coordinate of the projected vector. By bounding the ℓ₂‑sensitivity of the JL mapping (Δ = √(2·s)/√k for a k‑dimensional output) and applying the standard Gaussian mechanism, the authors prove that choosing σ ≥ Δ·√(2 ln(1.25/δ))/ε guarantees (ε,δ)‑differential privacy.

The privacy proof is straightforward: adjacent datasets differ in a single bit, which changes the JL output by at most Δ in ℓ₂ norm; the Gaussian mechanism then provides the desired privacy guarantee. Because the JL matrix is sparse, Δ can be made small, allowing σ to be modest even for strong privacy parameters.

For utility, the paper shows how to recover an unbiased estimate of the true squared distance between two users i and j from the noisy low‑dimensional representations y_i and y_j: