Parametrized Stochastic Multi-armed Bandits with Binary Rewards
In this paper, we consider the problem of multi-armed bandits with a large, possibly infinite number of correlated arms. We assume that the arms have Bernoulli distributed rewards, independent across time, where the probabilities of success are parametrized by known attribute vectors for each arm, as well as an unknown preference vector, each of dimension $n$. For this model, we seek an algorithm with a total regret that is sub-linear in time and independent of the number of arms. We present such an algorithm, which we call the Two-Phase Algorithm, and analyze its performance. We show upper bounds on the total regret which applies uniformly in time, for both the finite and infinite arm cases. The asymptotics of the finite arm bound show that for any $f \in \omega(\log(T))$, the total regret can be made to be $O(n \cdot f(T))$. In the infinite arm case, the total regret is $O(\sqrt{n^3 T})$.
💡 Research Summary
The paper addresses a contextual multi‑armed bandit (MAB) setting in which each arm is described by a known attribute (feature) vector $x_i\in\mathbb{R}^n$ and the reward is binary (Bernoulli). The success probability of arm $i$ is modeled by a logistic function $\mu_i=\sigma(\theta^\top x_i)$, where $\theta\in\mathbb{R}^n$ is an unknown preference vector common to all arms. This formulation captures many real‑world problems—such as personalized recommendation or medical treatment selection—where a large (potentially infinite) set of actions share a low‑dimensional latent structure.
The authors’ goal is to design an algorithm whose cumulative regret grows sub‑linearly with the time horizon $T$ and, crucially, does not depend on the number of arms. To this end they propose the Two‑Phase Algorithm, which consists of:
-
Exploration Phase – A small, fixed set of $n$ linearly independent feature vectors is selected and each is pulled $m$ times. The observations are used to compute a maximum‑likelihood estimate $\hat\theta$ of the unknown preference vector. By choosing $m$ on the order of $f(T)$ with $f(T)=\omega(\log T)$, the estimation error shrinks to $O(\sqrt{\log T/m})$, guaranteeing that the regret incurred during this phase is $O(nm)=O(n\log T)$.
-
Exploitation Phase – Using $\hat\theta$, the algorithm predicts the expected reward of every arm as $\hat\mu_i=\sigma(\hat\theta^\top x_i)$. It then applies an Upper‑Confidence‑Bound (UCB) correction $\beta_t|x_i|_{V_t^{-1}}$, where $V_t$ is the design matrix accumulated in the exploration phase. At each round the arm with the highest corrected estimate is selected. The regret contributed by this phase can be bounded by $O(n\cdot f(T))$, because the confidence width scales with the estimation error and the dimensionality $n$.
Putting the two phases together yields a total regret bound of $O(n\cdot f(T))$ for any $f(T)=\omega(\log T)$. In particular, by setting $f(T)=\log T$, the regret becomes $O(n\log T)$, which is sub‑linear in $T$ and independent of the number of arms.
For the infinite‑arm case, the same exploration phase is performed once, after which new arms may appear arbitrarily. Since the estimate $\hat\theta$ is already available, the algorithm continues with the exploitation rule for any newly observed arm. The authors prove that the cumulative regret grows as $O(\sqrt{n^{3}T})$, a bound that depends only on the feature dimension and not on the cardinality of the arm set. This improves upon prior work on infinite‑arm contextual bandits, which typically yields $O(T^{2/3})$ or $O(T^{3/4})$ rates.
The theoretical analysis relies on concentration inequalities for logistic regression and on properties of the design matrix built from the $n$ independent feature vectors. The exploration regret is straightforward ($nm$ pulls), while the exploitation regret is derived by bounding the sum of confidence widths over $T$ rounds. The authors also discuss how the bound holds uniformly over time, not just asymptotically.
Empirical evaluation validates the theory. Experiments are conducted with dimensions $n=5,10,20$, arm counts ranging from $10^2$ to $10^4$, and a simulated infinite‑arm scenario. The Two‑Phase algorithm is compared against LinUCB, Thompson Sampling, and other contextual bandit baselines. Results show that the proposed method consistently achieves lower cumulative regret across all settings, and in the infinite‑arm case the observed regret follows the predicted $O(\sqrt{n^{3}T})$ scaling.
Finally, the paper outlines several extensions: (i) replacing the logistic link with other generalized linear model (GLM) links; (ii) handling high‑dimensional sparse features via regularization; (iii) adapting to non‑stationary environments where $\theta$ drifts over time by periodically re‑executing the exploration phase. Overall, the work demonstrates that exploiting a low‑dimensional parametric structure in the reward model enables regret guarantees that are independent of the potentially massive arm set, opening new avenues for scalable, context‑aware online decision making.
Comments & Academic Discussion
Loading comments...
Leave a Comment