A Distributed Frank-Wolfe Algorithm for Communication-Efficient Sparse Learning
Learning sparse combinations is a frequent theme in machine learning. In this paper, we study its associated optimization problem in the distributed setting where the elements to be combined are not centrally located but spread over a network. We address the key challenges of balancing communication costs and optimization errors. To this end, we propose a distributed Frank-Wolfe (dFW) algorithm. We obtain theoretical guarantees on the optimization error $\epsilon$ and communication cost that do not depend on the total number of combining elements. We further show that the communication cost of dFW is optimal by deriving a lower-bound on the communication cost required to construct an $\epsilon$-approximate solution. We validate our theoretical analysis with empirical studies on synthetic and real-world data, which demonstrate that dFW outperforms both baselines and competing methods. We also study the performance of dFW when the conditions of our analysis are relaxed, and show that dFW is fairly robust.
💡 Research Summary
The paper addresses the problem of learning a sparse linear combination of “atoms” (features, dictionary elements, etc.) when these atoms are distributed across a network of machines. Formally, given a matrix A ∈ ℝ^{d×n} whose columns are the atoms and a convex, differentiable loss g, the goal is to minimize f(α)=g(Aα) subject to an ℓ₁‑norm constraint ‖α‖₁ ≤ β. In a centralized setting the Frank‑Wolfe (FW) algorithm solves this problem by repeatedly selecting the coordinate with the largest absolute gradient entry and performing a convex combination update; it converges to an ε‑approximate solution in O(1/ε) iterations, producing a solution with O(1/ε) non‑zero entries (a coreset).
When the atoms are spread over N nodes connected by an undirected graph G, naïvely gathering the full gradient at each iteration would incur prohibitive communication costs, especially when n (the total number of atoms) is huge. The authors propose a Distributed Frank‑Wolfe algorithm (dFW) that preserves the essential structure of FW while drastically reducing communication. In each round: (1) every node computes the index j_i(k) of the largest‑magnitude component of its local gradient and broadcasts the corresponding gradient value g_i(k) and a partial sum S_i(k) used for the stopping criterion; (2) all nodes determine the node i(k) with the globally largest |g_i(k)|, and that node broadcasts the selected atom a_{j(k)} and its index; (3) all nodes perform the standard FW update α←(1−γ)α+γ·sgn(−g_{i(k)}(k))·β·e_{j(k)}. The communication per round consists of a few real numbers plus the selected atom (a vector of dimension d).
Theoretical analysis yields two main results. Theorem 2 shows that dFW terminates after O(1/ε) rounds and uses a total of O((B·d + N·B)/ε) real numbers, where B is the cost of broadcasting a single scalar. Crucially, the bound does not depend on n, meaning the algorithm scales to massive atom sets. The authors also prove a matching lower bound of Ω(d/ε) on the communication required by any deterministic algorithm to obtain an ε‑approximation, establishing that dFW is optimal in the communication‑complexity sense.
To mitigate synchronization bottlenecks caused by heterogeneous computational resources, the paper introduces an approximate variant. Each node clusters its local atoms using the greedy m‑center algorithm, retaining only the cluster centers as candidate atoms. The FW steps are then performed on this reduced set. Lemma 1 shows that if the cluster radius r_opt(m) decreases as O(1/k), the additional error introduced by the approximation is O(1/k), which does not affect the overall ε‑optimality. This variant reduces per‑iteration computation and balances load across nodes, while preserving the same communication guarantees.
The authors illustrate three concrete applications: (i) distributed‑feature LASSO regression, where each node holds a subset of features for all training examples; (ii) kernel SVM with distributed examples, where the atoms live in a high‑dimensional (or infinite‑dimensional) kernel space but only the original training points need to be exchanged; and (iii) ℓ₁‑AdaBoost. In all cases the algorithm selects a single atom per iteration, keeping the model sparse.
Extensive experiments on synthetic data and real‑world datasets (including a sensor‑network LASSO and a large‑scale kernel SVM) compare dFW against baseline local‑greedy schemes and ADMM. Results demonstrate that dFW achieves the same prediction accuracy while transmitting significantly fewer bytes, especially when the solution is sparse. In a realistic distributed cluster, dFW attains notable speed‑ups over a centralized implementation and remains robust under random communication drops and asynchronous updates. The approximate variant further cuts runtime on heterogeneous clusters without sacrificing accuracy.
In summary, the paper contributes a communication‑optimal distributed algorithm for sparse learning, provides rigorous upper and lower bounds that match, and validates the method on practical large‑scale problems. The work bridges the gap between the elegant theoretical properties of Frank‑Wolfe and the practical constraints of modern distributed machine‑learning systems, offering a ready‑to‑use tool for scenarios where bandwidth is limited but sparsity is essential.
Comments & Academic Discussion
Loading comments...
Leave a Comment