Contextual Bandit Learning with Predictable Rewards

Contextual Bandit Learning with Predictable Rewards
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Contextual bandit learning is a reinforcement learning problem where the learner repeatedly receives a set of features (context), takes an action and receives a reward based on the action and context. We consider this problem under a realizability assumption: there exists a function in a (known) function class, always capable of predicting the expected reward, given the action and context. Under this assumption, we show three things. We present a new algorithm—Regressor Elimination— with a regret similar to the agnostic setting (i.e. in the absence of realizability assumption). We prove a new lower bound showing no algorithm can achieve superior performance in the worst case even with the realizability assumption. However, we do show that for any set of policies (mapping contexts to actions), there is a distribution over rewards (given context) such that our new algorithm has constant regret unlike the previous approaches.


💡 Research Summary

Contextual bandits model a sequential decision‑making problem where, at each round, a learner observes a context, selects an action, and receives a reward that depends on both. Traditional analyses often assume an agnostic setting: the learner’s hypothesis class may not contain the true expected‑reward function, and regret bounds are derived under this worst‑case premise. The paper “Contextual Bandit Learning with Predictable Rewards” departs from this tradition by imposing a realizability assumption: there exists a function f* within a known function class 𝔽 that exactly predicts the conditional expectation of the reward for every context‑action pair. Under this stronger assumption the authors develop a new algorithm, Regressor Elimination (RE), prove matching upper and lower regret bounds, and demonstrate that for certain policy families the algorithm enjoys constant regret, outperforming prior methods.

The problem formulation is precise. The context space 𝔛, action set 𝔄, and reward range


Comments & Academic Discussion

Loading comments...

Leave a Comment