Converge Faster, Talk Less: Hessian-Informed Federated Zeroth-Order Optimization
Zeroth-order (ZO) optimization enables dimension-free communication in federated learning (FL), making it attractive for fine-tuning of large language models (LLMs) due to significant communication savings. However, existing ZO-FL methods largely overlook curvature information, despite its well-established benefits for convergence acceleration. To address this, we propose HiSo, a Hessian-informed ZO federated optimization method that accelerates convergence by leveraging global diagonal Hessian approximations, while strictly preserving scalar-only communication without transmitting any second-order information. Theoretically, for non-convex functions, we show that HiSo can achieve an accelerated convergence rate that is independent of the Lipschitz constant $L$ and model dimension $d$ under some Hessian approximation assumptions, offering a plausible explanation for the observed phenomenon of ZO convergence being much faster than its worst-case $\mathscr{O}(d)$-bound. Empirically, across diverse LLM fine-tuning benchmarks, HiSo delivers a 1$\sim$5$\times$ speedup in communication rounds over existing state-of-the-art ZO-FL baselines. This superior convergence not only cuts communication costs but also provides strong empirical evidence that Hessian information acts as an effective accelerator in federated ZO optimization settings. Our source code is provided at https://github.com/ZidongLiu/DeComFL.
💡 Research Summary
The paper addresses a critical bottleneck in federated fine‑tuning of large language models (LLMs): the massive communication cost incurred when transmitting high‑dimensional model updates. Recent work (DeComFL) showed that zeroth‑order (ZO) optimization can achieve “dimension‑free” communication by representing each ZO gradient with only two scalars—a gradient scalar and a random seed that deterministically generates the perturbation direction. While this reduces communication from terabytes to megabytes, ZO‑SGD’s reliance on uniformly random search directions leads to painfully slow convergence, especially for heterogeneous, anisotropic loss landscapes typical of LLMs.
HiSo (Hessian‑informed ZO with Scalar‑only communication) proposes a principled way to accelerate ZO‑FL without breaking the scalar‑only communication constraint. The key idea is to incorporate a global diagonal approximation of the Hessian matrix as a preconditioner. Instead of using the raw random direction u∼N(0,I), each client transforms it by H⁻¹/², where H is the diagonal Hessian estimate maintained at the server. The update becomes
Δx_i = (f_i(x_i + μ H⁻¹/² u) – f_i(x_i))/μ · H⁻¹/² u,
which in expectation equals H⁻¹∇f_i(x_i), i.e., a Newton‑style step. Because H is diagonal, its square‑root inverse can be applied element‑wise, requiring only scalar multiplications; no full‑matrix communication is needed. The server aggregates locally computed diagonal Hessian entries (derived from two additional function evaluations per direction) as scalars, preserving the dimension‑free property.
The authors provide a rigorous theoretical analysis under low‑effective‑rank and whitening assumptions. They bound the variance of the Hessian‑informed ZO gradient estimator and prove that, for non‑convex objectives, HiSo attains a convergence rate that is independent of both the Lipschitz constant L and the model dimension d. This is the first result showing dimension‑independent convergence for ZO methods in federated settings, and it extends the DeComFL analysis to multiple local steps (k‑step updates), which were previously unsupported.
Empirically, HiSo is evaluated on several LLM fine‑tuning tasks (e.g., OPT‑1.3B, LLaMA‑7B) across classification, QA, and summarization benchmarks. Compared with DeComFL, HiSo reduces the number of communication rounds by a factor of 1.5–5× while achieving equal or higher test accuracy, BLEU, and ROUGE scores. Against first‑order federated baselines such as FedAvg and FedAdam, HiSo offers communication savings on the order of 10⁸–10⁹×, demonstrating that the scalar‑only ZO framework can be competitive in both efficiency and performance.
In summary, HiSo demonstrates that curvature information can be leveraged effectively in federated zeroth‑order optimization without incurring any additional communication overhead. By marrying Hessian‑based preconditioning with a scalar‑only communication protocol, the method delivers faster convergence, lower communication cost, and strong empirical performance, establishing a new paradigm for scalable, privacy‑preserving fine‑tuning of massive models.
Comments & Academic Discussion
Loading comments...
Leave a Comment