Risk-Aware Decision Making in Restless Bandits: Theory and Algorithms for Planning and Learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In restless bandits, a central agent is tasked with optimally distributing limited resources across several bandits (arms), with each arm being a Markov decision process. In this work, we generalize the traditional restless bandits problem with a risk-neutral objective by incorporating risk-awareness, which is particularly important in various real-world applications especially when the decision maker seeks to mitigate downside risks. We establish indexability conditions for the case of a risk-aware objective and provide a solution based on Whittle index for the first time for the planning problem with finite-horizon non-stationary and for infinite-horizon stationary Markov decision processes. In addition, we address the learning problem when the true transition probabilities are unknown by proposing a Thompson sampling approach and show that it achieves bounded regret that scales sublinearly with the number of episodes and quadratically with the number of arms. The efficacy of our method in reducing risk exposure in restless bandits is illustrated through a set of numerical experiments in the contexts of machine replacement and patient scheduling applications under both planning and learning setups.

💡 Research Summary

This paper extends the classic restless bandit (RB) framework, which traditionally optimizes expected cumulative reward, by incorporating risk‑aware objectives. Each arm is modeled as a finite‑state Markov decision process (MDP) with two actions (passive/active). The authors allow a general, non‑decreasing, Lipschitz‑continuous utility function (U_i(\cdot)) for each arm, turning the objective into the maximization of the expected utility of the cumulative reward, thereby capturing risk‑averse, risk‑neutral, or risk‑seeking attitudes.

The first major contribution is the derivation of indexability conditions for this risk‑aware setting. By relaxing the hard constraint “at most M arms active per period” to an average‑activation constraint and introducing a Lagrange multiplier (\lambda), the global problem decomposes into N independent sub‑problems. The authors prove that, when each utility function is monotone and satisfies a super‑additivity (or “L‑convex”) property, the optimal policy for a single arm switches monotonically from passive to active as (\lambda) increases. This monotone switching guarantees indexability, enabling the definition of a risk‑aware Whittle index for each state‑time pair.

Two planning regimes are treated:

Finite‑horizon non‑stationary RB (FNRB) – transition matrices and rewards may vary with time. The authors develop a dynamic‑programming recursion that computes, for each time step, the value of being active versus passive under a given (\lambda). The index is the critical (\lambda) at which the two actions become equally valuable.
Infinite‑horizon discounted RB – the transition and reward functions are stationary, and a discount factor (\beta) is applied. Here a fixed‑point Bellman equation is solved iteratively to obtain the discounted value function; the index is again the (\lambda) that equalizes the action‑value functions.

Both constructions yield closed‑form or efficiently computable indices, preserving the scalability that made Whittle’s original heuristic attractive.

The second major contribution addresses learning when the transition probabilities are unknown. The authors adopt a Bayesian setting: each unknown transition probability has a prior distribution, and the performance metric is Bayesian regret, i.e., the expected gap between the learning policy and an oracle that knows the true model. They propose a Thompson Sampling (TS) algorithm that, at the beginning of each episode, samples a complete set of transition matrices from the current posterior, computes the corresponding risk‑aware Whittle indices, and selects the top‑M arms. After observing the outcomes, the posterior is updated.

A regret analysis shows that the TS policy attains a regret bound of
\

Risk-Aware Decision Making in Restless Bandits: Theory and Algorithms for Planning and Learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment