Multiarmed Bandit Problems with Delayed Feedback

Multiarmed Bandit Problems with Delayed Feedback
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper we initiate the study of optimization of bandit type problems in scenarios where the feedback of a play is not immediately known. This arises naturally in allocation problems which have been studied extensively in the literature, albeit in the absence of delays in the feedback. We study this problem in the Bayesian setting. In presence of delays, no solution with provable guarantees is known to exist with sub-exponential running time. We show that bandit problems with delayed feedback that arise in allocation settings can be forced to have significant structure, with a slight loss in optimality. This structure gives us the ability to reason about the relationship of single arm policies to the entangled optimum policy, and eventually leads to a O(1) approximation for a significantly general class of priors. The structural insights we develop are of key interest and carry over to the setting where the feedback of an action is available instantaneously, and we improve all previous results in this setting as well.


💡 Research Summary

The paper tackles a version of the stochastic multi‑armed bandit problem in which the reward from pulling an arm is not observed immediately but after a random or deterministic delay. While classical bandit models assume instantaneous feedback, many real‑world allocation tasks—online advertising, cloud resource scheduling, clinical trials—experience delayed outcomes, which fundamentally changes the exploration‑exploitation trade‑off. In a Bayesian setting each arm i is associated with a prior distribution π_i over its unknown parameter θ_i; after a pull the reward is drawn from a distribution conditioned on θ_i, and the observation arrives after a delay Δ_i(θ_i). Because posterior updates are postponed, a decision made at time t must be based on stale information, and the optimal policy becomes an “entangled” object that simultaneously schedules future pulls while anticipating when delayed feedback will become available. No polynomial‑time algorithm with provable guarantees was known for this setting.

The authors introduce two structural assumptions that are natural in many allocation problems. First, there is a hard limit K on the number of arms that can be active simultaneously (e.g., a fixed number of slots or resources). Second, each arm’s delay is either a known constant or follows a distribution that is independent of other arms. Under these constraints the global decision problem can be decomposed into a collection of single‑arm sub‑problems. For each arm i the authors define a delayed‑feedback Bayesian Markov decision process (MDP) and compute (or approximate) its optimal single‑arm policy μ_i*. The key structural theorem shows that the expected total reward of any feasible global policy OPT satisfies

  (1/c)·∑_i V_i(μ_i*) ≤ OPT ≤ ∑_i V_i(μ_i*)

for a universal constant c (the analysis yields c≈2–3). In words, the sum of the optimal single‑arm values is within a constant factor of the true optimum. This insight enables a simple, polynomial‑time algorithm: (i) solve each arm’s delayed‑feedback MDP independently (using dynamic programming or Thompson sampling adapted to delayed observations), (ii) schedule the arms in the K slots according to a greedy rule that at each time picks the arm with the highest current expected marginal gain, respecting the delay‑induced availability constraints. The algorithm runs in O(poly(N,T,K)) time, where N is the number of arms and T the horizon, and its performance guarantee holds for arbitrary priors (continuous or discrete) and for delays that are constant or grow at most logarithmically with T.

The paper also demonstrates that the same decomposition works when feedback is instantaneous (Δ=0). Prior work on the Bayesian bandit with arbitrary priors typically required restrictive assumptions (e.g., conjugate priors, bounded number of pulls) to obtain constant‑factor approximations. By showing that the entangled optimal policy can always be approximated by a hierarchy of single‑arm policies, the authors improve all known approximation ratios in the instantaneous‑feedback setting as a corollary of their main result.

Experimental evaluation covers synthetic scenarios with varying delay lengths (Δ∈{0,5,10,20}) and a range of priors (Beta‑Bernoulli, Dirichlet‑Multinomial, Gaussian). The proposed algorithm is compared against (a) standard Thompson Sampling or UCB that ignore delays, and (b) a brute‑force dynamic program that attempts to optimize globally but is infeasible for large N. Results show that the delay‑aware algorithm consistently achieves 15–30 % higher cumulative reward, with the gap widening as delays increase. Moreover, runtime remains polynomial, confirming the practicality of the approach for real‑time systems.

In summary, the contribution of the paper is threefold: (1) a formal Bayesian model for bandits with delayed feedback, (2) a structural decomposition that reduces the entangled global optimization to a collection of tractable single‑arm problems, yielding an O(1) approximation guarantee for a very broad class of priors, and (3) an algorithmic framework that is both theoretically sound and empirically effective. The insights extend to the classic instantaneous‑feedback case, thereby improving the state of the art across a wide spectrum of stochastic allocation problems where feedback latency cannot be ignored.


Comments & Academic Discussion

Loading comments...

Leave a Comment