Optimal Adaptive Learning in Uncontrolled Restless Bandit Problems
In this paper we consider the problem of learning the optimal policy for uncontrolled restless bandit problems. In an uncontrolled restless bandit problem, there is a finite set of arms, each of which when pulled yields a positive reward. There is a player who sequentially selects one of the arms at each time step. The goal of the player is to maximize its undiscounted reward over a time horizon T. The reward process of each arm is a finite state Markov chain, whose transition probabilities are unknown by the player. State transitions of each arm is independent of the selection of the player. We propose a learning algorithm with logarithmic regret uniformly over time with respect to the optimal finite horizon policy. Our results extend the optimal adaptive learning of MDPs to POMDPs.
💡 Research Summary
The paper tackles the learning problem for uncontrolled restless bandit problems (URBP), a class of partially observable Markov decision processes (POMDPs) where each of K arms evolves according to its own finite‑state Markov chain independent of the player’s actions. The player observes only the state (and reward) of the arm it selects at each discrete time step and aims to maximize the undiscounted cumulative reward over a finite horizon T. Unlike the classic “weak regret” setting that compares against the best static arm, this work adopts a “strong regret” metric: the difference between the cumulative reward of the learning algorithm and that of the optimal dynamic policy that would be chosen if the true transition matrices were known.
The authors first review related literature, noting that most prior work on restless bandits focuses on weak regret and that logarithmic regret lower bounds are known for i.i.d. bandits. They also discuss optimal adaptive learning for fully observable MDPs, where logarithmic strong regret has been achieved, but point out that extending these results to POMDPs is non‑trivial because the information state space is uncountably infinite and irreducibility may not hold.
The problem formulation introduces the notation: each arm k has state space S_k, transition matrix P_k, stationary distribution π_k, and reward equal to the state value (r_k(x)=x) for simplicity. The system state is the Cartesian product of all arm states. Strong regret R_T is defined as the expected difference between the optimal finite‑horizon policy (with full knowledge of P) and the policy generated by the learning algorithm up to time T.
The core contribution is a model‑based learning algorithm that alternates between exploration epochs and exploitation epochs. During exploration, the algorithm plays a selected arm for a geometrically increasing number of consecutive steps, thereby collecting enough state transition samples to estimate each P_k. Hoeffding‑type concentration inequalities are used to construct confidence intervals for the estimated transition probabilities; the width of these intervals shrinks as O(√(log t / n_k(t))) where n_k(t) is the number of observed transitions for arm k up to time t.
Exploitation relies on a finite partition of the information‑state space. The authors define a set of partitions {C_1,…,C_M} based on system parameters such as the minimum stationary probability π_min, the maximum reward r_max, and the largest state‑space size S_max. Within each partition the optimal policy computed from the current estimates is guaranteed to differ from the true optimal policy by at most ε, provided the partition granularity is chosen appropriately. This partitioning reduces the otherwise infinite belief space to a manageable finite set, enabling the algorithm to compute a “virtual optimal policy” efficiently.
The regret analysis proceeds in two parts. First, the loss incurred during exploration is bounded by O(K r_max S_max log T), because each arm needs only O(log T) exploration epochs to shrink its confidence interval sufficiently. Second, the approximation error introduced by the partitioning contributes at most a constant times ε T, which can be made sub‑logarithmic by selecting ε = O(1/T). Combining these yields an overall strong regret R_T = O(log T) when the algorithm knows an upper bound on a function of the system parameters (essentially a bound on π_min·r_max·S_max).
When such a bound is unavailable, the authors propose a variant that simultaneously estimates the bound while exploring. This leads to a “near‑logarithmic” regret of order O(log T · log log T). Additional variants are presented that either relax the need for the bound entirely (by adaptive epoch lengths) or reduce computational overhead by merging observed state sequences into continuous sample paths.
Extensive simulations compare the proposed algorithm against several baselines, including UCB‑type policies for restless bandits and Thompson‑sampling approaches. Results show that the new method consistently achieves lower cumulative regret and exhibits the predicted logarithmic growth, while the variants achieve comparable performance with reduced runtime, especially for larger numbers of arms and larger state spaces.
In conclusion, the paper delivers the first instance‑dependent logarithmic‑regret guarantee for strong regret in uncontrolled restless bandits, effectively extending optimal adaptive learning from fully observable MDPs to a subclass of POMDPs. The key technical tools are confidence‑interval‑driven exploration, finite partitioning of the belief space, and careful analysis of the exploration‑exploitation trade‑off. The work opens avenues for future research on removing the parameter‑bound assumption entirely, handling continuous state spaces, and applying the framework to real‑world problems such as dynamic spectrum access and target tracking where system dynamics are initially unknown.
Comments & Academic Discussion
Loading comments...
Leave a Comment