In this paper, we consider the problem of multi-armed bandits with a large, possibly infinite number of correlated arms. We assume that the arms have Bernoulli distributed rewards, independent across time, where the probabilities of success are parametrized by known attribute vectors for each arm, as well as an unknown preference vector, each of dimension $n$. For this model, we seek an algorithm with a total regret that is sub-linear in time and independent of the number of arms. We present such an algorithm, which we call the Two-Phase Algorithm, and analyze its performance. We show upper bounds on the total regret which applies uniformly in time, for both the finite and infinite arm cases. The asymptotics of the finite arm bound show that for any $f \in \omega(\log(T))$, the total regret can be made to be $O(n \cdot f(T))$. In the infinite arm case, the total regret is $O(\sqrt{n^3 T})$.
The stochastic multi-armed bandit problem is the following: suppose we are allowed to choose to "pull," or play, any one of m slot machines (also known as one-armed bandits) in each of T timesteps, where each slot machine generates a reward according to its own distribution which is unknown to us. The parameters of the reward distributions are correlated between machines, but the rewards themselves are independent across machines, and independent and identically distributed across timesteps. The choice of which arm to pull may be a function of the sequence of past pulls and the sequence of past rewards. If our goal is to maximize the total reward obtained, taking expectation over the randomness of the outcomes, ideally we would pull the arm with the largest mean at every timestep. However, we do not know in advance which arm has the largest mean, so a certain amount of exploration is required. Too much exploration, though, wastes time that could be spent reaping the reward offered by the best arm. This exemplifies the fundamental trade-off between exploration and exploitation present in a wide class of online machine learning problems.
We consider a model for multi-armed bandit problems in which a large number of arms are present, where the expected rewards of the arms are coupled through an unknown parameter of lower dimension. Now, it is no longer necessary for each arm to be investigated in order to estimate the expected reward from that arm. Instead, we can estimate the underlying parameter; in this way, each pull can yield Research supported in part by AFOSR MURI FA 9550-10-1-0573. information about multiple arms. We present a simple algorithm, as well as bounds on the expected total regret as a function of time horizon when using this algorithm. While possibly sub-optimal, these bounds are independent of the number of arms.
This model is applicable to certain e-commerce applications: suppose an online retailer has a large number of related products, and wishes to maximize revenue or profit coming from a certain set of customers. If the preferences of this set of customers are known, the list of items which are displayed can be sorted in descending order of expected revenue or profit. However, we may not know a priori what this preference vector is, so we wish to learn online by sequentially presenting each user with an item, observing whether the user buys the item, and then updating an internal estimate of the preference vector.
As a concrete example, imagine an online camera store, with hundreds of different camera models in stock. However, there are perhaps closer to ten features which people will compare when deciding which, if any, to purchase. There are permanent features of the camera itself, such as megapixel count, brand name, and year of introduction, as well as extrinsic features, such as price, review scores, and item popularity. All of these features might be considered by the customer in order to decide whether or not to buy the camera. If bought, the store gains a profit corresponding to the item. A key distinction of our model, when compared to previous work, is the incorporation of this inherently binary choice customers are faced with: to buy or not to buy.
Our model consists of a multi-armed bandit with a set U consisting of m arms (items) and n underlying parameters (attributes), where m ≥ n and potentially m n. We will interchangeably also think of U as being a n × m matrix, where each arm u is an n-dimensional attribute vector, and is one of the columns of U . Furthermore, we will assume that rank(U ) = n. There is also a constant but unknown preference vector z * ∈ R n . The quality β u = u T z * of arm u is a scalar indicating how desirable the item is to a user. We will use the logistic function f to define the expected reward of an arm u, assuming a particular z, as
.
Thus, the expected rewards of all of the arms are coupled through z * . For notational simplicity, we define α * u = α u (z * ) . Let the set of equally best arms be
Define the expected reward of a best arm to be
At each timestep t up to a finite time horizon T , a policy will choose to pull exactly one arm, call this arm C t , and a reward X t will be obtained, where X t ∼ Ber(α * Ct ). We wish to find policies g which maximize the total expected reward, T t=1 X t , or equivalently, minimize the expected total regret,
For an introduction and survey of classical multi-armed bandit problems and their variations, see Mahajan and Teneketzis [1]. One of the earliest breakthroughs on the classical multi-armed bandit problem came from Gittins and Jones [2], who showed that under geometric discounting, the optimal policy assigns an index to each arm, now known as the Gittins index, and pulls the arm with the largest Gittins index. Other proofs of this optimality have been given later by Weber [3] and Tsitsiklis [4]. Whittle [5] proved that a similar index-based result is nearly optimal in the “restless bandit”
This content is AI-processed based on open access ArXiv data.