Distributed Learning in Multi-Armed Bandit with Multiple Players

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We formulate and study a decentralized multi-armed bandit (MAB) problem. There are M distributed players competing for N independent arms. Each arm, when played, offers i.i.d. reward according to a distribution with an unknown parameter. At each time, each player chooses one arm to play without exchanging observations or any information with other players. Players choosing the same arm collide, and, depending on the collision model, either no one receives reward or the colliding players share the reward in an arbitrary way. We show that the minimum system regret of the decentralized MAB grows with time at the same logarithmic order as in the centralized counterpart where players act collectively as a single entity by exchanging observations and making decisions jointly. A decentralized policy is constructed to achieve this optimal order while ensuring fairness among players and without assuming any pre-agreement or information exchange among players. Based on a Time Division Fair Sharing (TDFS) of the M best arms, the proposed policy is constructed and its order optimality is proven under a general reward model. Furthermore, the basic structure of the TDFS policy can be used with any order-optimal single-player policy to achieve order optimality in the decentralized setting. We also establish a lower bound on the system regret growth rate for a general class of decentralized polices, to which the proposed policy belongs. This problem finds potential applications in cognitive radio networks, multi-channel communication systems, multi-agent systems, web search and advertising, and social networks.

💡 Research Summary

The paper tackles a decentralized multi‑armed bandit (MAB) problem in which M distributed players compete for N independent arms. Each arm yields i.i.d. rewards drawn from an unknown distribution parameterized by θi. At every round, every player selects a single arm without any communication or observation sharing with the others. When two or more players choose the same arm a collision occurs; depending on the collision model the reward is either completely lost (collision‑loss model) or arbitrarily split among the colliding players (collision‑share model).

The authors define system regret as the difference between the cumulative expected reward obtained by the M players and the reward that would be achieved if the M best arms were perfectly allocated to the players from the start. In the classic centralized MAB setting, the optimal regret grows logarithmically with time, i.e., O(log T). The main contribution of the paper is to show that the same logarithmic order is achievable even in the fully decentralized setting where no pre‑agreement or information exchange is allowed.

To achieve this, they propose the Time Division Fair Sharing (TDFS) policy. TDFS works in two layers. First, each player runs an arbitrary order‑optimal single‑player bandit algorithm (e.g., UCB, KL‑UCB, Thompson Sampling) to learn the quality of the arms and to identify the set of M best arms. Second, the time axis is partitioned into M sub‑slots. In sub‑slot k (k = 0,…,M‑1) every player follows a deterministic cyclic order over the identified M best arms: player i selects arm (k + i) mod M. Because each sub‑slot uses a different cyclic shift, no two players ever collide on the same arm within a sub‑slot, and over time each player receives each of the M best arms with equal frequency.

The paper proves three key theoretical results. (1) Order‑optimality: When TDFS is combined with any order‑optimal single‑player algorithm, the resulting system regret is bounded by O(log T), matching the centralized lower bound. (2) Lower bound for decentralized policies: For a broad class of decentralized strategies (including TDFS), the authors establish a regret lower bound of Ω(log T), showing that TDFS is essentially optimal. (3) Fairness: Because the cyclic schedule guarantees that each player accesses each of the M best arms the same number of times, the long‑run average reward per player is identical, providing a strong notion of fairness without any coordination.

The analysis holds under a very general reward model: arms may follow any sub‑Gaussian or bounded distribution, and the collision reward function can be either zero (loss) or any arbitrary sharing rule. Moreover, the TDFS framework is modular; any single‑player algorithm that is regret‑optimal can be plugged in, preserving the overall optimality.

Simulation experiments on cognitive radio and multi‑channel communication scenarios illustrate that TDFS dramatically outperforms naïve decentralized baselines such as random arm selection or independent UCB without coordination. The results confirm both the logarithmic regret scaling and the equitable distribution of rewards among players across a variety of M/N ratios, collision models, and underlying single‑player policies.

In conclusion, the work demonstrates that even in a completely communication‑free multi‑agent environment, the collective learning performance can be as good as that of a centrally coordinated system. The TDFS policy offers a simple, implementable solution with provable guarantees of optimal regret growth, fairness, and robustness to different reward and collision models. Potential applications span cognitive radio networks, multi‑channel wireless systems, online advertising allocation, web search, and broader multi‑agent coordination problems. Future directions suggested include extensions to non‑stationary environments, limited or intermittent communication, and adaptive sub‑slot scheduling for dynamic arm sets.

Distributed Learning in Multi-Armed Bandit with Multiple Players

💡 Research Summary

Comments & Academic Discussion

Leave a Comment