Effectiveness of Distributed Gradient Descent with Local Steps for Overparameterized Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In distributed training of machine learning models, gradient descent with local iterative steps, commonly known as Local (Stochastic) Gradient Descent (Local-(S)GD) or Federated averaging (FedAvg), is a very popular method to mitigate communication burden. In this method, gradient steps based on local datasets are taken independently in distributed compute nodes to update the local models, which are then aggregated intermittently. In the interpolation regime, Local-GD can converge to zero training loss. However, with many potential solutions corresponding to zero training loss, it is not known which solution Local-GD converges to. In this work we answer this question by analyzing implicit bias of Local-GD for classification tasks with linearly separable data. For the interpolation regime, our analysis shows that the aggregated global model obtained from Local-GD, with arbitrary number of local steps, converges exactly to the model that would be obtained if all data were in one place (centralized model) ‘‘in direction’’. Our result gives the exact rate of convergence to the centralized model with respect to the number of local steps. We also obtain the same implicit bias with a learning rate independent of number of local steps with a modified version of the Local-GD algorithm. Our analysis provides a new view to understand why Local-GD can still perform well with a very large number of local steps even for heterogeneous data. Lastly, we also discuss the extension of our results to Local-SGD and non-separable data.

💡 Research Summary

This paper investigates the implicit bias of distributed training algorithms that perform multiple local updates before communication, specifically Local Gradient Descent (Local‑GD) and its stochastic variant (Local‑SGD), in the over‑parameterized regime. The authors focus on binary classification with linearly separable data, where many parameter vectors achieve zero training loss, and ask which of these solutions the global model converges to when using Local‑GD with an arbitrary number of local steps (L).

The main contributions are:

Implicit bias characterization – Building on the classic result that centralized gradient descent on linearly separable data converges to the max‑margin direction (Soudry et al., 2018), the authors prove that the aggregated global model obtained by Local‑GD also converges to the same direction, regardless of the number of local steps. The convergence rate of the direction is (O(1/\log(Lk))) where (k) is the communication round, and the training loss decays as (O(1/(Lk))) when the learning rate is set to (\eta = \Theta(1/L)).
Learning‑rate‑independent variant – By solving each local sub‑problem (or a weakly regularized version) to near optimality, a modified Local‑GD algorithm achieves the same implicit bias without requiring (\eta) to depend on (L). This shows that massive local computation can be performed without sacrificing the final solution.
Extension to Local‑SGD – When each local step samples a mini‑batch without replacement, the same bias holds because each mini‑batch is a subset of the global data, preserving the projection structure used in the analysis.
Linear regression intuition – In the over‑parameterized linear regression case, each node’s local GD converges to the minimum‑norm interpolant within its data subspace. The global model’s deviation from the centralized minimum‑norm solution is repeatedly projected onto the orthogonal complement of the union of the node subspaces, leading to exponential decay of the error (Theorem 1).
Empirical validation – Experiments on synthetic over‑parameterized linear and logistic regression, as well as fine‑tuning the last layer of large language models, confirm the theoretical predictions. Even with heterogeneous data distributions and hundreds of local steps, the global model aligns with the centralized max‑margin solution and achieves comparable test performance.
Practical implications – Prior convergence analyses for federated learning often require the number of local steps to be (O(\sqrt{T})) to retain optimal communication complexity. This work shows that in the over‑parameterized setting, that restriction can be lifted: arbitrarily many local updates still lead to the same implicit bias, explaining why federated averaging works well in practice even with large (L).

Overall, the paper provides a rigorous answer to the “which solution does Local‑GD converge to?” question, establishing that in over‑parameterized linear models the global model obtained after any number of local updates converges to the same max‑margin (or minimum‑norm) direction as centralized gradient descent. This insight bridges a gap between empirical success of federated learning with many local steps and its theoretical understanding, offering new guidance for algorithm design and hyper‑parameter selection in distributed and federated training scenarios.

Effectiveness of Distributed Gradient Descent with Local Steps for Overparameterized Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment