Adaptive Shortest-Path Routing under Unknown and Stochastically Varying Link States

We consider the adaptive shortest-path routing problem in wireless networks under unknown and stochastically varying link states. In this problem, we aim to optimize the quality of communication between a source and a destination through adaptive path selection. Due to the randomness and uncertainties in the network dynamics, the quality of each link varies over time according to a stochastic process with unknown distributions. After a path is selected for communication, the aggregated quality of all links on this path (e.g., total path delay) is observed. The quality of each individual link is not observable. We formulate this problem as a multi-armed bandit with dependent arms. We show that by exploiting arm dependencies, a regret polynomial with network size can be achieved while maintaining the optimal logarithmic order with time. This is in sharp contrast with the exponential regret order with network size offered by a direct application of the classic MAB policies that ignore arm dependencies. Furthermore, our results are obtained under a general model of link-quality distributions (including heavy-tailed distributions) and find applications in cognitive radio and ad hoc networks with unknown and dynamic communication environments.

💡 Research Summary

The paper tackles the adaptive shortest‑path routing problem in wireless networks where each link’s quality evolves according to an unknown stochastic process. After a path is used, only the aggregate performance (e.g., total delay, total loss) of that path is observed; the individual link qualities remain hidden. This partial‑feedback setting makes the problem fundamentally different from classic routing, and it can be naturally cast as a multi‑armed bandit (MAB) problem with dependent arms: each link is an elementary arm, and a path corresponds to a subset of arms whose rewards are summed.

The authors first point out that treating every possible path as an independent arm leads to an exponential number of arms in the size of the network, which in turn yields an exponential regret bound with respect to the number of links. To avoid this curse of dimensionality, they exploit the linear structure of the feedback: the observed path reward is a linear combination of the unknown link means. By maintaining estimates for each link and updating them from the observed path sums, the algorithm can infer information about many links simultaneously.

The core algorithm is a link‑level Upper Confidence Bound (UCB) scheme adapted to the combinatorial setting. For each link i, the algorithm keeps an empirical mean (\hat\mu_i(t)) and a confidence radius (\beta_i(t)) derived from concentration inequalities that are valid even for heavy‑tailed distributions (using martingale‑based Bernstein‑type bounds). The optimistic estimate of a candidate path p at time t is then \