Lagrangian Index Policy for Restless Bandits with Average Reward

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study the Lagrangian Index Policy (LIP) for restless multi-armed bandits with long-run average reward. In particular, we compare the performance of LIP with the performance of the Whittle Index Policy (WIP), both heuristic policies known to be asymptotically optimal under certain natural conditions. Even though in most cases their performances are very similar, in the cases when WIP shows bad performance, LIP continues to perform very well. We then propose reinforcement learning algorithms, both tabular and NN-based, to obtain online learning schemes for LIP in the model-free setting. The proposed reinforcement learning schemes for LIP require significantly less memory than the analogous schemes for WIP. We calculate analytically the Lagrangian index for the restart model, which applies to the optimal web crawling and the minimization of the weighted age of information. We also give a new proof of asymptotic optimality in case of homogeneous arms as the number of arms goes to infinity, based on exchangeability and de Finetti’s theorem.

💡 Research Summary

The paper addresses the challenging problem of restless multi‑armed bandits (RMAB) under a long‑run average‑reward criterion with a hard per‑stage constraint that exactly M out of N arms must be active at each time step. While the classic Whittle index policy (WIP) has become the de‑facto heuristic for such problems, it relies on the notion of indexability and requires solving a separate fixed‑point equation for each state to compute the Whittle index. This can be computationally prohibitive and, more importantly, the indexability condition may fail for many practical models, leaving WIP without a theoretical foundation.

The authors propose the Lagrangian Index Policy (LIP) as an alternative. Starting from the Lagrangian relaxation of the average‑constraint problem, they introduce a single scalar Lagrange multiplier λ* that enforces the average activation constraint. For each arm and each state x, they define the Lagrangian index γ(x) = Qλ*(x, 1) − Qλ*(x, 0), where Qλ* denotes the action‑value function of the arm when the subsidy λ* is applied to the passive action. Unlike the Whittle index, which is the value of a subsidy that equalises the two actions, γ(x) is obtained by evaluating the Q‑functions at the already computed λ*. Consequently, the entire index vector can be derived after a single scalar optimisation, dramatically reducing computational effort and eliminating the need for indexability.

Two model‑free reinforcement‑learning (RL) schemes are developed to learn λ* and the associated Q‑functions online. The first scheme is a tabular two‑time‑scale Q‑learning algorithm. The fast time‑scale updates the Q‑values using a relative‑value iteration (including a stabilising offset f(Q)), while the slow time‑scale updates λ by a stochastic gradient step that pushes the empirical average number of active arms toward M. Under standard Robbins‑Monro step‑size conditions (∑aₙ = ∞, ∑aₙ² < ∞, βₙ = o(aₙ)), the authors prove almost‑sure convergence of both Q‑values to the optimal Qλ* and of λₙ to λ*. The second scheme replaces the tabular Q‑function with a deep neural network (DQN). Although a rigorous convergence proof is not provided, empirical results show that the DQN‑based LIP uses far less memory than analogous DQN implementations of the Whittle index, and it scales to large state spaces.

A major theoretical contribution is a new proof of asymptotic optimality of LIP when the arms are homogeneous and N → ∞. Traditional proofs of Whittle‑policy optimality rely on exchangeability of the relaxed problem and on the existence of a global attractor. Here, the authors invoke de Finetti’s theorem for exchangeable sequences to show that the empirical distribution of arm states converges to a mixture of i.i.d. processes, and that the Lagrangian dual problem admits a unique optimal λ*. Consequently, the LIP, which simply activates the M arms with the largest γ(x), achieves the same average reward as the optimal constrained policy in the infinite‑arm limit, without any indexability assumption.

To illustrate the practical relevance, the paper analyses the “restart” model, a classic RMAB where each arm evolves as a two‑state Markov chain that either restarts to a fresh state (e.g., a newly crawled web page) or ages while passive. The authors derive a closed‑form expression for the Lagrangian index in this setting, showing that γ(x) depends linearly on the restart probability and the weight assigned to information age. Numerical experiments compare LIP, WIP, and the optimal policy (computed via exhaustive enumeration for small N). Results indicate that LIP matches or exceeds WIP’s performance across a range of parameters, especially in regimes where the Whittle index is ill‑defined or numerically unstable. Moreover, LIP’s computation time is an order of magnitude lower because only a single λ* needs to be estimated.

The experimental section also evaluates the tabular and DQN‑based LIP on a benchmark problem taken from the literature that is believed to be non‑indexable. In this case, LIP consistently outperforms WIP, confirming the practical advantage of dropping the indexability requirement. Memory consumption measurements demonstrate that LIP’s RL algorithms require roughly half the storage of their Whittle counterparts, a significant benefit for embedded or large‑scale systems.

In summary, the paper makes four key contributions:

Conceptual – Introduces the Lagrangian index γ(x) as a computationally cheap, indexability‑free alternative to the Whittle index.
Algorithmic – Provides two RL algorithms (tabular and deep) that learn λ* and Q‑values online, with provable convergence for the tabular case.
Theoretical – Offers a novel asymptotic optimality proof for homogeneous arms using exchangeability and de Finetti’s theorem, independent of indexability.
Empirical – Demonstrates on the restart model and a non‑indexable benchmark that LIP achieves equal or superior average reward while using substantially less computation and memory than Whittle‑based methods.

Overall, the Lagrangian Index Policy broadens the toolbox for restless bandit problems, delivering a scalable, theoretically sound, and practically efficient heuristic that can be deployed in domains such as web crawling, age‑of‑information minimisation, resource allocation in queueing networks, and beyond.

Lagrangian Index Policy for Restless Bandits with Average Reward

💡 Research Summary

Comments & Academic Discussion

Leave a Comment