From Inexact Gradients to Byzantine Robustness: Acceleration and Optimization under Similarity
Standard federated learning algorithms are vulnerable to adversarial nodes, a.k.a. Byzantine failures. To solve this issue, robust distributed learning algorithms have been developed, which typically replace parameter averaging by robust aggregations. While generic conditions on these aggregations exist to guarantee the convergence of (Stochastic) Gradient Descent (SGD), the analyses remain rather ad-hoc. This hinders the development of more complex robust algorithms, such as accelerated ones. In this work, we show that Byzantine-robust distributed optimization can, under standard generic assumptions, be cast as a general optimization with inexact gradient oracles (with both additive and multiplicative error terms), an active field of research. This allows for instance to directly show that GD on top of standard robust aggregation procedures obtains optimal asymptotic error in the Byzantine setting. Going further, we propose two optimization schemes to speed up the convergence. The first one is a Nesterov-type accelerated scheme whose proof directly derives from accelerated inexact gradient results applied to our formulation. The second one hinges on Optimization under Similarity, in which the server leverages an auxiliary loss function that approximates the global loss. Both approaches allow to drastically reduce the communication complexity compared to previous methods, as we show theoretically and empirically.
💡 Research Summary
The paper addresses the vulnerability of federated learning to Byzantine (adversarial) clients and proposes a unifying theoretical framework that casts Byzantine‑robust distributed optimization as an instance of optimization with inexact gradient oracles. The authors first formalize the standard federated setting with n clients, of which f are Byzantine, and define the global objective L_H(x) as the average loss over the honest clients. They assume each local loss L_i is μ‑strongly convex and L‑smooth, and introduce a (G,B)‑heterogeneity condition that quantifies the dispersion of honest gradients both absolutely (G) and relatively (B). Existing lower‑bound results show that, because of heterogeneity and the presence of Byzantine nodes, any algorithm’s achievable error is fundamentally limited by G, B, and the fraction f/n.
The key technical contribution is the reduction of this setting to a (ζ²,α)‑inexact gradient oracle model. Using the notion of (f,ν)‑robust aggregation rules (e.g., geometric median, coordinate‑wise trimmed mean, Krum), the paper proves that the aggregated gradient \tilde∇L_H(x) satisfies
‖\tilde∇L_H(x) – ∇L_H(x)‖² ≤ ν G² + ν B² ‖∇L_H(x)‖²,
where ν is the robustness coefficient of the aggregation rule. By setting ζ² = ν G² and α = ν B², the aggregated gradient becomes a (ζ²,α)‑inexact oracle of the true global gradient. This mapping is tight: leveraging results from Ajalloeian & Stich (2020), the authors show that gradient descent with such an oracle converges to a neighborhood of radius ζ²/(2μ(1–α)). Moreover, known lower bounds on ν (ν ≥ f/(n–2f) and ν = O(f/(n–f))) imply that the asymptotic error matches the previously established Byzantine lower bound, confirming that no information is lost in the reduction.
Having established the equivalence, the paper imports two families of accelerated algorithms from the inexact‑gradient literature. The first is a Nesterov‑type accelerated method (Devolder et al., 2014) adapted to the Byzantine setting. By carefully choosing the momentum parameters γ_k and Γ_k, the authors obtain a linear convergence rate of O((p μ/L) log(1/ε)), where p depends on the robustness coefficient ν. This is the first provably accelerated first‑order method that tolerates Byzantine failures under moderate heterogeneity.
The second acceleration strategy leverages “optimization under similarity.” The server is assumed to possess an auxiliary loss \tilde L that approximates the true global loss L_H, with the Hessians of the two functions differing by at most Δ in operator norm. The proposed Prox‑Inexact Gradient under Similarity (PIGS) algorithm performs proximal updates using the auxiliary loss while still using the robustly aggregated gradient of L_H. The analysis, extending Woodworth et al. (2023) to handle both additive (ζ²) and multiplicative (α) errors, shows that PIGS converges linearly at rate O((Δ/μ) log(1/ε)). Since Δ can be much smaller than the smoothness constant L, this yields a substantial reduction in the number of communication rounds compared to plain GD.
Empirical evaluation on MNIST and CIFAR‑10 demonstrates that both accelerated schemes achieve the same final test accuracy as state‑of‑the‑art Byzantine‑robust methods while requiring 2–5× fewer communication rounds. The Nesterov‑type method excels under moderate heterogeneity, whereas PIGS provides the greatest gains when a high‑quality similarity model is available (small Δ). The experiments also confirm robustness up to 30 % Byzantine clients.
In summary, the paper provides a clean, modular reduction of Byzantine‑robust federated learning to the well‑studied inexact‑gradient framework, enabling the direct transfer of advanced optimization techniques such as Nesterov acceleration and similarity‑based proximal methods. The theoretical results are tight with known lower bounds, and the proposed algorithms achieve provably faster convergence and lower communication complexity, representing a significant step forward for secure and efficient federated learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment