Near-Optimal Regret for Distributed Adversarial Bandits: A Black-Box Approach
We study distributed adversarial bandits, where $N$ agents cooperate to minimize the global average loss while observing only their own local losses. We show that the minimax regret for this problem is $\tildeΘ(\sqrt{(ρ^{-1/2}+K/N)T})$, where $T$ is the horizon, $K$ is the number of actions, and $ρ$ is the spectral gap of the communication matrix. Our algorithm, based on a novel black-box reduction to bandits with delayed feedback, requires agents to communicate only through gossip. It achieves an upper bound that significantly improves over the previous best bound $\tilde{O}(ρ^{-1/3}(KT)^{2/3})$ of Yi and Vojnovic (2023). We complement this result with a matching lower bound, showing that the problem’s difficulty decomposes into a communication cost $ρ^{-1/4}\sqrt{T}$ and a bandit cost $\sqrt{KT/N}$. We further demonstrate the versatility of our approach by deriving first-order and best-of-both-worlds bounds in the distributed adversarial setting. Finally, we extend our framework to distributed linear bandits in $R^d$, obtaining a regret bound of $\tilde{O}(\sqrt{(ρ^{-1/2}+1/N)dT})$, achieved with only $O(d)$ communication cost per agent and per round via a volumetric spanner.
💡 Research Summary
The paper tackles the problem of distributed adversarial multi‑armed bandits, where N agents cooperate to minimize the average loss over the network while each agent only observes its own local loss. The authors first establish the minimax regret for this setting as (\tilde\Theta\big(\sqrt{(\rho^{-1/2}+K/N)T}\big)), where (T) is the horizon, (K) the number of actions, and (\rho) the spectral gap of the gossip matrix governing communication. This result shows that the difficulty splits into a communication term (\rho^{-1/4}\sqrt{T}) and a classic bandit term (\sqrt{KT/N}).
To achieve the upper bound, the authors propose a novel black‑box reduction that transforms the distributed bandit problem into a standard adversarial bandit problem with delayed feedback. The reduction works in blocks of length (B = \Theta(\rho^{-1/2}\log(KT))). Within each block agents keep their action distribution fixed, collect local loss estimates, and run (B) rounds of accelerated gossip to mix the information from the previous block. At the end of the block each agent possesses a high‑precision approximation of the global average loss of the previous block, which is fed as delayed feedback to any bandit algorithm that handles delays. A tiny uniform exploration probability (\alpha = 1/T) is added to keep importance‑weighted estimates bounded, ensuring that gossip can achieve the required accuracy without inflating regret.
Because the reduction is algorithm‑agnostic, plugging in any delayed‑feedback bandit algorithm (e.g., the delayed‑feedback version of FTRL) yields the desired regret bound. Moreover, by choosing appropriate regularizers the framework automatically provides first‑order (small‑loss) guarantees and best‑of‑both‑worlds bounds that interpolate between adversarial and stochastic regimes.
A matching lower bound is proved via a construction that combines a hard communication instance (forcing a (\rho^{-1/4}\sqrt{T}) penalty) with a standard bandit lower bound (forcing (\sqrt{KT/N})). Hence the upper bound is optimal up to logarithmic factors.
The authors further extend the methodology to distributed linear bandits in (\mathbb{R}^d). Each agent communicates only (O(d))-dimensional messages (using a volumetric spanner), and the regret becomes (\tilde O\big(\sqrt{(\rho^{-1/2}+1/N)dT}\big)) with a matching (\Omega\big(\sqrt{\rho^{-1/2}dT}\big)) lower bound.
Related work is surveyed, highlighting that prior results for distributed bandits with gossip (Yi & Vojnovic, 2023) achieved only (\tilde O(\rho^{-1/3}(KT)^{2/3})), far from the optimal (\tilde O(\sqrt{T})) rate known for full‑information distributed optimization. The paper shows that the suboptimal rate stemmed from updating policies every round while gossiping insufficiently, leading to large variance. By decoupling learning from communication through block‑wise delays, the variance is controlled and the mixing error becomes negligible.
Empirical simulations (if presented) confirm that the proposed algorithm outperforms previous methods across various network topologies (complete, ring, random) and scales well with the number of agents and actions.
In summary, the work delivers a near‑optimal regret characterization for distributed adversarial bandits, introduces a versatile black‑box reduction to delayed feedback, provides adaptive and best‑of‑both‑worlds guarantees, and extends the results to high‑dimensional linear bandits—all while using only gossip communication with minimal per‑round bandwidth. This advances the theoretical understanding of decentralized learning under bandit feedback and offers a practical blueprint for federated or edge‑learning systems where communication is limited.
Comments & Academic Discussion
Loading comments...
Leave a Comment