Online Bandits with (Biased) Offline Data: Adaptive Learning under Distribution Mismatch
Traditional online learning models are typically initialized from scratch. By contrast, contemporary real-world applications often have access to historical datasets that can potentially enhanced the online learning processes. We study how offline data can be leveraged to facilitate online learning in stochastic multi-armed bandits and combinatorial bandits. In our study, the probability distributions that govern the offline data and the online rewards can be different. We first show that, without a non-trivial upper bound on their difference, no non-anticipatory policy can outperform the classical Upper Confidence Bound (UCB) policy, even with the access to offline data. In complement, we propose an online policy MIN-UCB for multi-armed bandits. MIN-UCB outperforms the UCB when such an upper bound is available. MIN-UCB adaptively chooses to utilize the offline data when they are deemed informative, and to ignore them otherwise. We establish that MIN-UCB achieves tight regret bounds, in both instance independent and dependent settings. We generalize our approach to the combinatorial bandit setting by introducing MIN-COMB-UCB, and we provide corresponding instance dependent and instance independent regret bounds. We illustrate how various factors, such as the biases and the size of offline datasets, affect the utility of offline data in online learning. We discuss several applications and conduct numerical experiments to validate our findings.
💡 Research Summary
The paper addresses a practical gap in bandit learning: many real‑world decision problems start with a historical (offline) dataset that may be biased relative to the environment in which the online learning will take place. The authors formalize a stochastic K‑armed bandit model in which an offline warm‑start phase provides samples drawn from a distribution (P_{\text{off}}), while the subsequent online phase generates rewards from a potentially different distribution (P_{\text{on}}). The decision maker’s goal is to maximize cumulative reward during the online phase.
A first major contribution is an impossibility result: without any prior knowledge about how far apart (P_{\text{off}}) and (P_{\text{on}}) are, no non‑anticipatory algorithm can achieve a regret smaller than the classic UCB bound (\sum_{a:\Delta(a)>0}\frac{\log T}{\Delta(a)}). In other words, offline data alone cannot guarantee improvement over vanilla UCB.
To overcome this limitation the authors assume the availability of a valid bias bound (V), an arm‑wise upper bound on (|\mu^{\text{off}}_a-\mu^{\text{on}}_a|). With this auxiliary information they design MIN‑UCB, an adaptive UCB‑type algorithm that decides for each arm whether to rely on the offline empirical mean (plus a confidence term that incorporates (V(a)) and the number of offline samples (T_S(a))) or to ignore the offline data and behave like standard UCB. The algorithm automatically “turns on’’ the offline information when it is informative (small bias, many samples) and “turns off’’ otherwise.
Theoretical analysis yields two families of regret guarantees:
- Instance‑dependent bound – For each sub‑optimal arm (a) the regret contribution is
\
Comments & Academic Discussion
Loading comments...
Leave a Comment