On the Limits of Momentum in Decentralized and Federated Optimization

Reading time: 5 minute
...

📝 Abstract

Recent works have explored the use of momentum in local methods to enhance distributed SGD. This is particularly appealing in Federated Learning (FL), where momentum intuitively appears as a solution to mitigate the effects of statistical heterogeneity. Despite recent progress in this direction, it is still unclear if momentum can guarantee convergence under unbounded heterogeneity in decentralized scenarios, where only some workers participate at each round. In this work we analyze momentum under cyclic client participation, and theoretically prove that it remains inevitably affected by statistical heterogeneity. Similarly to SGD, we prove that decreasing step-sizes do not help either: in fact, any schedule decreasing faster than $Θ\left(1/t\right)$ leads to convergence to a constant value that depends on the initialization and the heterogeneity bound. Numerical results corroborate the theory, and deep learning experiments confirm its relevance for realistic settings.

💡 Analysis

Recent works have explored the use of momentum in local methods to enhance distributed SGD. This is particularly appealing in Federated Learning (FL), where momentum intuitively appears as a solution to mitigate the effects of statistical heterogeneity. Despite recent progress in this direction, it is still unclear if momentum can guarantee convergence under unbounded heterogeneity in decentralized scenarios, where only some workers participate at each round. In this work we analyze momentum under cyclic client participation, and theoretically prove that it remains inevitably affected by statistical heterogeneity. Similarly to SGD, we prove that decreasing step-sizes do not help either: in fact, any schedule decreasing faster than $Θ\left(1/t\right)$ leads to convergence to a constant value that depends on the initialization and the heterogeneity bound. Numerical results corroborate the theory, and deep learning experiments confirm its relevance for realistic settings.

📄 Content

Modern deep learning applications demand intensive training on large amount of data, often distributed across decentralized silos or user personal devices. To address such system constraints and comply with data regulations, learning algorithms have evolved towards more advanced and flexible systems that enable decentralized training at a global scale. In such systems, not all workers participate at each training step, due to local faults, network issues or simply temporary unavailability. Moreover, they cannot usually exchange their data, either because of efficiency or privacy concerns. These are the main premises of Federated Learning (FL), a paradigm focused on privacy-preserving training from decentralized data. Algorithms of this kind usually consist of an iterative two-step process involving 1) local training at client-side, each on its own private data, and 2) global optimization at the server, using aggregated local updates. While this scheme promotes efficiency by looser synchronization, statistical heterogeneity among clients’ data and partial client participation expose the optimization to client drift and biased server updates.

Aiming for an effective solution to these problems, research has recently shifted towards extending momentum [1] to distributed algorithms. For example, a plethora of momentumbased FL algorithms have been proposed to overcome the adverse effects of data heterogeneity [2]- [9]. Similarly, momen- * Corresponding author tum is appealing in distributed learning to reduce the overall communication overhead [10], and recently has been scaled up to more decentralized environments [11]. However, on a theoretical level, we only have a partial understanding of how momentum affects convergence in a decentralized regimen. [8] proved that momentum can converge under unbounded heterogeneity when all clients participate at each round (full participation). [9] went a step further, proposing a novel Generalized Heavy-Ball Momentum (GHBM) formulation that achieves the same convergence guarantees but with a more general cyclic partial participation assumption. Yet, it is unclear whether the same result can be further extended to classical momentum under the same cyclic partial participation assumption and without bounded heterogeneity.

This work provides a clear answer to this question: can (classical) momentum enable convergence under unbounded heterogeneity in decentralized settings with partial participation?

The answer is negative: even with (classical) momentum, the convergence rate relies on the heterogeneity bound. This further confirms that GHBM [9] is, to the best of our knowledge, the only momentum-based distributed algorithm circumventing this limitation.

Contributions. We summarize our main results below.

• We formally prove that, under cyclic sampling of clients, momentum does not eliminate the effect of data heterogeneity -a well recognized problem in decentralized and federated learning. • We further consider decreasing step-sizes, revealing that any schedule decaying faster than Θ(1/t) leads to convergence to a constant depending on the initialization and on the heterogeneity bound. • We validate the theory with numerical results on our theoretical problem, and extend the experimentation to deep learning problems, showing the relevance of our findings for realistic scenarios.

Gradient Descent (GD) and its variants have long been objective of study in the context of finite-sum optimization problems. Restricting the gradient calculation to single function components (i.e. a small subset of data) at each iteration, those methods trade off noisy updates for computational efficiency. Most of the analyses address SGD or shuffling gradient methods [12]- [14]. [15] provides sharp lower bounds on SGD for decreasing step-sizes, while [16] prove dimension-independent lower bounds over all possible sequences of diminishing stepsizes. The recent work of [17] studies the convergence rate of IGD at small iteration count.

While in all cases an heterogeneity bound is necessary, the above works consider algorithms without momentum. Since it has been proved that momentum has a variance reduction effect [18], it is not clear i) if the fundamental reliance on the heterogeneity remains even with momentum, and ii) if decreasing step-sizes play a role. In this work we analyze the simplest setting in which momentum could intuitively bring an advantage w.r.t. heterogeneous objectives: as we show, this corresponds to an instance of the IGD algorithm with momentum.

We study the effect of momentum in heterogeneous settings by considering a minimal setup with two heterogeneous clients. Our analysis is based on modeling the algorithm dynamics as a discrete-time linear system, and it reveals a clear decomposition: the zero-input response captures objectives shared by all clients, while the zero-state response isolates heterogeneous ones. This formulation unveils the source of convergence limitatio

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut