Achieving Both Valid and Secure Logistic Regression Analysis on Aggregated Data from Different Private Sources
Preserving the privacy of individual databases when carrying out statistical calculations has a long history in statistics and had been the focus of much recent attention in machine learning In this paper, we present a protocol for computing logistic regression when the data are held by separate parties without actually combining information sources by exploiting results from the literature on multi-party secure computation. We provide only the final result of the calculation compared with other methods that share intermediate values and thus present an opportunity for compromise of values in the combined database. Our paper has two themes: (1) the development of a secure protocol for computing the logistic parameters, and a demonstration of its performances in practice, and (2) and amended protocol that speeds up the computation of the logistic function. We illustrate the nature of the calculations and their accuracy using an extract of data from the Current Population Survey divided between two parties.
💡 Research Summary
The paper addresses a fundamental problem in privacy‑preserving data analysis: how to fit a logistic regression model when the underlying data are split across several parties that are unwilling or legally prohibited from sharing their raw records. The authors propose a complete protocol that enables the parties to obtain the maximum‑likelihood estimate of the logistic regression coefficients while revealing nothing beyond the final parameter vector. The solution is built on well‑established techniques from secure multi‑party computation (MPC), in particular additive secret sharing, and on two different ways of securely evaluating the non‑linear logistic (sigmoid) function.
Problem setting and security model
The authors assume P ≥ 2 parties, each holding an additive share of the full design matrix X (n × d) and response vector y (binary). This “additive share” formulation subsumes horizontal, vertical, and even overlapping partitions of the data. The security goal follows the semi‑honest (or “honest‑but‑curious”) model: parties follow the protocol correctly but may keep a transcript of all messages they receive. A protocol is considered secure if a polynomial‑time simulator can generate a view indistinguishable from the real view given only that party’s input and output. Under this definition, the protocol leaks no information about any other party’s private data beyond what can be inferred from the final β̂.
Core cryptographic primitives
- Additive secret sharing – each secret a is split into random shares a₁,…,a_P such that Σ a_j = a (mod B). The shares are uniformly random, so any strict subset reveals no information about a.
- Secure addition and multiplication – addition of shared values is trivial (local addition of shares). Multiplication uses pre‑generated Beaver triples, allowing the parties to compute a·b from shares of a and b with only a few communication rounds.
- Fixed‑point representation – real‑valued quantities (e.g., gradient components) are scaled to integers modulo a large B (e.g., 2⁶⁴) to avoid floating‑point leakage and to keep the arithmetic within a finite field.
Logistic function approximation
The Newton‑Raphson iteration for logistic regression requires evaluating σ(z) = 1/(1 + e^{‑z}) and its derivative σ(z)(1‑σ(z)). Directly computing σ(z) in an MPC setting is costly because it is non‑linear. The authors propose two alternatives:
1. Yao‑based approximation – They embed a “greater‑than” circuit (Yao’s millionaire protocol) to compare z with zero, then use a piecewise linear or binary‑search style approximation of the sigmoid. This method can achieve high accuracy but incurs O(b) encryptions per comparison, where b is the bit‑length of the fixed‑point representation. Consequently, the overall round complexity grows linearly with b, making it impractical for high‑dimensional data.
2. Taylor‑expansion/Euler method – They treat the sigmoid as the solution of the differential equation dσ/dz = σ(1‑σ) with σ(0)=0.5. By applying a second‑order Taylor expansion around the current estimate and using Euler’s method, σ(z) can be approximated using only additions and multiplications of shared values. The approximation error can be made arbitrarily small by increasing the number of Taylor terms or by iterating the update. This approach eliminates the need for any comparison circuit, drastically reducing communication and computation overhead.
Both approximations are proved to preserve the additive‑share structure, allowing the Newton‑Raphson update β_{t+1} = β_t − H^{-1} g to be performed entirely on secret‑shared values. After convergence, the parties jointly reconstruct β̂ by exchanging their final shares.
Protocol outline
The protocol proceeds in iterative rounds, each corresponding to a Newton step:
- Each party locally computes its contribution to Xᵀ·(y − σ(Xβ_t)) and to the Hessian Xᵀ·W·X, where W is a diagonal matrix with entries σ(z_i)(1‑σ(z_i)).
- Using secure addition, the parties aggregate these contributions into secret‑shared gradient g and Hessian H.
- Secure matrix inversion (or more efficiently, solving H·Δ = g via secure linear system solvers) yields the Newton direction Δ, again as secret shares.
- β_t is updated locally by subtracting Δ (still in secret‑shared form).
The only non‑linear step is the evaluation of σ(z_i) for each case i, which is handled by one of the two approximations described above.
Implementation and experimental results
The authors implement the protocol in a prototype system using standard cryptographic libraries (e.g., Paillier encryption for additive homomorphism, and a custom Beaver‑triple generator). They evaluate on a subset of the Current Population Survey (CPS) data, split evenly between two parties. Key findings:
- Accuracy – The β̂ obtained from the secure protocol matches the coefficients from a conventional centralized logistic regression to within 10^{-4} (for the Yao‑based approximation) and 10^{-3} (for the Taylor‑based approximation).
- Performance – For a problem with n ≈ 10 000 cases and d ≈ 20 predictors, the Yao‑based method required roughly 45 minutes of total wall‑clock time (dominated by the comparison circuits), whereas the Taylor‑based method completed in under 8 minutes on the same hardware. Communication volume scaled linearly with n·d for both methods, but the Yao approach incurred an additional factor proportional to the bit‑length b.
- Scalability – Experiments varying n and d show that the Taylor‑based protocol scales roughly O(n·d) in both computation and communication, making it viable for larger data sets, while the Yao‑based protocol’s cost grows faster due to the O(b) factor per sigmoid evaluation.
Discussion and extensions
The paper distinguishes its threat model from differential‑privacy approaches (e.g., Chaudhuri & Monteleoni) that focus on limiting information leakage from the final output. Here the focus is on preventing leakage during the computation itself. Nevertheless, the authors acknowledge that combining their secure computation with differential privacy (by adding calibrated noise to the final β̂) would provide end‑to‑end privacy guarantees.
They also discuss “weak security” alternatives, where less stringent cryptographic primitives (e.g., homomorphic encryption without full secret sharing) could be employed for faster but less provably secure protocols. Moreover, the authors suggest that the same framework can be adapted to other generalized linear models, support vector machines, or even deep neural networks, provided an appropriate secure approximation of the required non‑linear activation functions is devised.
Conclusion
By integrating additive secret sharing, secure linear algebra primitives, and two practical approximations of the sigmoid function, the authors deliver a complete, provably secure protocol for logistic regression on distributed data. The protocol reveals only the final model parameters, protects intermediate values from semi‑honest adversaries, and, with the Taylor‑based approximation, achieves performance suitable for realistic data sizes. This work offers a concrete pathway for privacy‑sensitive domains—such as healthcare, finance, and public policy—to collaboratively train predictive models without exposing raw individual records.
Comments & Academic Discussion
Loading comments...
Leave a Comment