Heterogeneous Stochastic Momentum ADMM for Distributed Nonconvex Composite Optimization
This paper investigates the distributed stochastic nonconvex and nonsmooth composite optimization problem. Existing stochastic typically rely on uniform step size strictly bounded by global network parameters, such as the maximum node degree or spectral radius. This dependency creates a severe performance bottleneck, particularly in heterogeneous network topologies where the step size must be conservatively reduced to ensure stability. To overcome this limitation, we propose a novel Heterogeneous Stochastic Momentum Alternating Direction Method of Multipliers (HSM-ADMM). By integrating a recursive momentum estimator (STORM), HSM-ADMM achieves the optimal oracle complexity of $\mathcal{O}(ε^{-1.5})$ to reach an $ε$-stationary point, utilizing a strictly single-loop structure and an $\mathcal{O}(1)$ mini-batch size. The core innovation lies in a node-specific adaptive step-size strategy, which scales the proximal term according to local degree information. We theoretically demonstrate this design completely decouples the algorithmic stability from global network properties, enabling robust and accelerated convergence across arbitrary connected topologies without requiring any global structural knowledge. Furthermore, HSM-ADMM requires transmitting only a single primal variable per iteration, significantly reducing communication bandwidth compared to state-of-the-art gradient tracking algorithms. Extensive numerical experiments on distributed nonconvex learning tasks validate the superior efficiency of the proposed HSM-ADMM algorithm.
💡 Research Summary
This paper addresses the challenging problem of distributed stochastic non‑convex composite optimization, where each agent in a connected network possesses a smooth (possibly non‑convex) loss function and a convex but possibly nonsmooth regularizer. Existing distributed methods either rely on uniform step‑sizes that must be bounded by global network parameters (such as the maximum node degree or the spectral gap) or require multiple communication rounds per iteration. Both aspects create severe bottlenecks in heterogeneous networks, where highly connected “hub” nodes force the entire system to adopt a conservatively small step‑size, slowing convergence dramatically.
To overcome these limitations, the authors propose Heterogeneous Stochastic Momentum ADMM (HSM‑ADMM). The algorithm integrates three key ideas:
-
Recursive Momentum Estimation (STORM) – By employing the STORM variance‑reduction technique, the algorithm builds an unbiased, low‑variance estimator of the stochastic gradient using only a single mini‑batch of size O(1). This eliminates the need for periodic full‑gradient passes or double‑loop structures, while still achieving the optimal oracle complexity of (\tilde O(\varepsilon^{-1.5})) for reaching an (\varepsilon)-stationary point.
-
Node‑Specific Adaptive Step‑Sizes – Each agent (i) uses a step‑size (\eta_i) that scales with its local degree (d_i) (e.g., (\eta_i = \eta_0/(d_i+1))). Because the smallest singular value of the constraint matrix (A) equals one, this design completely decouples algorithmic stability from any global spectral property of the graph. Consequently, the “straggler effect” caused by heterogeneous connectivity disappears; every node can update at a rate appropriate to its own connectivity.
-
Communication‑Efficient ADMM Structure – The primal‑dual updates are arranged so that only a single primal variable (x_i) needs to be exchanged with neighbors at each iteration. No separate gradient‑tracker or auxiliary variable is transmitted, halving the per‑iteration communication load compared with gradient‑tracking methods.
The paper provides a rigorous convergence analysis under standard assumptions: smoothness of the expected loss, bounded variance of stochastic gradients, and a connected undirected graph. The analysis shows that with properly chosen penalty parameter (\rho) and momentum decay (\beta), the augmented Lagrangian decreases geometrically, leading to the claimed sample complexity. Importantly, the analysis does not require a bounded data‑heterogeneity term (i.e., the variance among local gradients), making the method robust to highly non‑IID data distributions.
Empirical evaluations cover a range of network topologies (random Erdős‑Rényi, star, chain) and non‑convex learning tasks (deep neural network training with (\ell_1) regularization). HSM‑ADMM consistently outperforms state‑of‑the‑art baselines such as DSGT, SPPDM, ProxGT‑SA, ProxGT‑SR‑O, DEEPSTORM, and Prox‑DASA. The gains are twofold: faster convergence (often 2–3× fewer iterations) and reduced communication (only one vector per round). The advantage is especially pronounced in heterogeneous graphs where hub nodes would otherwise dictate a tiny global step‑size.
In summary, HSM‑ADMM delivers the first distributed ADMM‑based method that simultaneously (i) attains the optimal stochastic non‑convex oracle complexity with O(1) mini‑batch size, (ii) removes any dependence on global graph parameters through degree‑based adaptive step‑sizes, and (iii) halves communication overhead by transmitting a single variable per iteration. The work opens avenues for future extensions to asynchronous updates, time‑varying graphs, and privacy‑preserving protocols.
Comments & Academic Discussion
Loading comments...
Leave a Comment