We consider a multi-armed bandit problem in a setting where each arm produces a noisy reward realization which depends on an observable random covariate. As opposed to the traditional static multi-armed bandit problem, this setting allows for dynamically changing rewards that better describe applications where side information is available. We adopt a nonparametric model where the expected rewards are smooth functions of the covariate and where the hardness of the problem is captured by a margin parameter. To maximize the expected cumulative reward, we introduce a policy called Adaptively Binned Successive Elimination (abse) that adaptively decomposes the global problem into suitably "localized" static bandit problems. This policy constructs an adaptive partition using a variant of the Successive Elimination (se) policy. Our results include sharper regret bounds for the se policy in a static bandit problem and minimax optimal regret bounds for the abse policy in the dynamic problem.
1. Introduction. The seminal paper [19] introduced an important class of sequential optimization problems, otherwise known as multi-armed bandits. These models have since been used extensively in such fields as statistics, operations research, engineering, computer science and economics. The traditional multi-armed bandit problem can be described as follows. Consider K ≥ 2 statistical populations (arms), where at each point in time it is possible to sample from (pull) only one of them and receive a random reward dictated by the properties of the sampled population. The objective is to devise a sampling policy that maximizes expected cumulative rewards over a finite time horizon. The difference between the performance of a given sampling policy and that of an oracle, that repeatedly samples from the (1) Successive Elimination (se) is dedicated to the static bandit case. It is the cornerstone of the other policies that deal with covariates. During a first phase, this policy explores the different arms, builds estimates and eliminates sequentially suboptimal arms; when only one arm remains, it is pulled until the horizon is reached. A variant of se was originally introduced in [8]. However, it was not tuned to minimize the regret as other measures of performance were investigated in this paper. We prove new regret bounds for this policy that improve upon the canonical papers [14] and [4].
(2) Binned Successive Elimination (bse) follows a simple principle to solve the problem with covariates. It consists of grouping similar covariates into bins and then looks only at the average reward over each bin. These bins are viewed as indexing “local” bandit problems, solved by the aforementioned se policy. We prove optimal regret bounds, polynomial in the horizon but only for a restricted class of difficult problems. For the remaining class of easy problems, the bse policy is suboptimal.
(3) Adaptively Binned Successive Elimination (abse) overcomes a severe limitation of the naive bse. Indeed, if the problem is globally easy (this is characterized by the margin condition), the bse policy employs a fixed and too fine discretization of the covariate space. Instead, the abse policy partitions the space of covariates in a fashion that adapts to the local difficulty of the problem: cells are smaller when different arms are hard to distinguish and bigger when one arm dominates the other. This adaptive partitioning allows us to prove optimal regrets bounds for the whole class of problems.
The optimal polynomial regret bounds that we prove are much larger than the logarithmic bounds proved in the static case. Nevertheless, it is important to keep in mind that they are valid for a much more flexible model that incorporates covariates. In the particular case where K = 2 and the problem is difficult, these bounds improve upon the results of [18] by removing a logarithmic factor that is idiosyncratic to the exploration vs. exploitation dilemma encountered in bandit problems. Moreover, it follows immediately from the previous minimax lower bounds of [2] and [18], that these bounds are optimal in a minimax sense and thus cannot be further improved. It reveals an interesting and somewhat surprising phenomenon: the price to pay for the partial information in the bandit problem is dominated by the price to pay for nonparametric estimation. Indeed the bound on the regret that we obtain in the bandit setup for K = 2 is of the same order as the best attainable bound in the full information case, where at each round, the operator receives the reward from only one arm but observes the rewards of both arms. An important example of the full information case is sequential binary classification.
Our policies for the problem with covariates fall into the family of “plugin” policies as opposed “minimum contrast” policies; a detailed account of the differences and similarities between these two setups in the full information case can be found in [2]. Minimum contrast type policies have already received some attention in the bandit literature with side information, aka contextual bandits, in the papers [15] and also [13]. A related problem online convex optimization with side information was studied in [11], where the authors use a discretization technique similar to the one employed in this paper. It is worth noting that the cumulative regret in these papers is defined in a weaker form compared to the traditional bandit literature, since the cumulative reward of a proposed policy is compared to that of the best policy in a certain restricted class of policies. Therefore, bounds on the regret depend, among other things, on the complexity of said class of policies. Plug-in type policies have received attention in the context of the continuum armed bandit problem, where as the name suggests there are uncountably many arms. Notable entries in that stream of work are [16] and [20], who impose a smoothness condition both on the space of arms and the
This content is AI-processed based on open access ArXiv data.