Learning to Allocate Resources with Censored Feedback
We study the online resource allocation problem in which at each round, a budget $B$ must be allocated across $K$ arms under censored feedback. An arm yields a reward if and only if two conditions are satisfied: (i) the arm is activated according to an arm-specific Bernoulli random variable with unknown parameter, and (ii) the allocated budget exceeds a random threshold drawn from a parametric distribution with unknown parameter. Over $T$ rounds, the learner must jointly estimate the unknown parameters and allocate the budget so as to maximize cumulative reward facing the exploration–exploitation trade-off. We prove an information-theoretic regret lower bound $Ω(T^{1/3})$, demonstrating the intrinsic difficulty of the problem. We then propose RA-UCB, an optimistic algorithm that leverages non-trivial parameter estimation and confidence bounds. When the budget $B$ is known at the beginning of each round, RA-UCB achieves a regret of order $\widetilde{\mathcal{O}}(\sqrt{T})$, and even $\mathcal{O}(\mathrm{poly}\text{-}\log T)$ under stronger assumptions. As for unknown, round dependent budget, we introduce MG-UCB, which allows within-round switching and infinitesimal allocations, and matches the regret guarantees of RA-UCB. We then validate our theoretical results through experiments on real-world datasets.
💡 Research Summary
The paper addresses an online resource allocation problem in which a fixed or unknown budget B must be distributed among K arms at each of T rounds. Each arm i has two hidden parameters: an activation probability p_i governing a Bernoulli random variable Y_{t,i} and a threshold parameter λ_i that determines a random resource requirement X_{t,i} drawn from a parametric distribution G(·;λ_i). An arm yields a reward of 1 iff both (i) Y_{t,i}=1 and (ii) the allocated amount x_{t,i} exceeds the threshold X_{t,i}. Crucially, feedback is censored: only when an arm succeeds is the threshold X_{t,i} observed; otherwise the learner receives no information about whether the failure was due to inactivity or insufficient resources. Consequently, the learner must simultaneously estimate p_i and λ_i despite ambiguous failures.
The authors first establish a fundamental information‑theoretic lower bound on regret. By constructing a family of instances where all arms are identical except for a single arm whose λ is perturbed by a small ε, they show that any algorithm must allocate a non‑negligible fraction of the total budget to explore the distinguished arm. Because information about an arm scales with the amount of budget allocated to it, identifying the optimal arm requires at least Ω(T^{1/3}) rounds of exploration, yielding a regret lower bound of the same order. This bound is slower than the classic Ω(√T) lower bound for stochastic multi‑armed bandits but faster than the O(1) bound for full‑information settings, reflecting the hybrid nature of the problem.
To match this lower bound, the paper proposes two algorithms:
- RA‑UCB (Known‑Budget Upper Confidence Bound)
- Structured Exploration: Every K rounds the algorithm deliberately allocates a large portion of the budget to a single arm, guaranteeing a high probability of success and thus obtaining a clean observation of X_{t,i}. This yields high‑quality samples for estimating λ_i.
- Optimistic Exploitation: Using the current estimates (p̂_i, λ̂_i) and confidence intervals, the algorithm computes an upper confidence bound on the expected success probability p̂_i·G(x; λ̂_i) for each arm. It then solves a deterministic allocation problem that maximizes the sum of these UCB values subject to the budget constraint.
- Parameter Estimation: λ_i is estimated via the conditional mean μ(λ)=E
Comments & Academic Discussion
Loading comments...
Leave a Comment