Towards minimax policies for online linear optimization with bandit feedback

Towards minimax policies for online linear optimization with bandit   feedback
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We address the online linear optimization problem with bandit feedback. Our contribution is twofold. First, we provide an algorithm (based on exponential weights) with a regret of order $\sqrt{d n \log N}$ for any finite action set with $N$ actions, under the assumption that the instantaneous loss is bounded by 1. This shaves off an extraneous $\sqrt{d}$ factor compared to previous works, and gives a regret bound of order $d \sqrt{n \log n}$ for any compact set of actions. Without further assumptions on the action set, this last bound is minimax optimal up to a logarithmic factor. Interestingly, our result also shows that the minimax regret for bandit linear optimization with expert advice in $d$ dimension is the same as for the basic $d$-armed bandit with expert advice. Our second contribution is to show how to use the Mirror Descent algorithm to obtain computationally efficient strategies with minimax optimal regret bounds in specific examples. More precisely we study two canonical action sets: the hypercube and the Euclidean ball. In the former case, we obtain the first computationally efficient algorithm with a $d \sqrt{n}$ regret, thus improving by a factor $\sqrt{d \log n}$ over the best known result for a computationally efficient algorithm. In the latter case, our approach gives the first algorithm with a $\sqrt{d n \log n}$ regret, again shaving off an extraneous $\sqrt{d}$ compared to previous works.


💡 Research Summary

The paper tackles the problem of online linear optimization (OLO) under bandit feedback, where at each round the learner selects an action from a set and only observes the scalar loss incurred by that action, rather than the full loss vector. This limited feedback makes the classic exploration‑exploitation trade‑off considerably more challenging. The authors make two major contributions.

First, they propose a new exponential‑weights‑based algorithm that achieves a regret bound of order √(d n log N) for any finite action set of size N, assuming each instantaneous loss is bounded by one. The key technical idea is to construct an unbiased loss estimator by uniformly sampling each coordinate with probability 1/d and scaling the observed loss by the inverse of this probability. This estimator has variance at most d, which eliminates the extraneous √d factor that appears in earlier works (which typically obtain bounds like d √(n log N) or √(d² n log N)). By plugging the estimator into the exponential‑weights update, the cumulative estimated loss concentrates within O(√(d n log N)), yielding the claimed regret.

The authors then extend the result to compact (continuous) action sets. By covering the set with an ε‑net of size roughly (1/ε)^d and choosing ε≈1/√n, they obtain a regret of order d √(n log n). This matches the minimax lower bound Ω(√(d n)) up to a logarithmic factor, showing that the algorithm is essentially optimal in the worst case.

Second, the paper addresses computational efficiency. The exponential‑weights algorithm requires updating a weight for every action, which is prohibitive when the action set is large or continuous. To overcome this, the authors employ the Mirror Descent (MD) framework, which leverages the linear structure of the loss and the geometry of the action set to replace full weight updates with a single projection step per round. They instantiate MD for two canonical action sets.

  • For the hypercube { x∈ℝ^d | ‖x‖_∞≤1 }, they use an ℓ₁‑based mirror map and a learning rate η≈√(log n / (d n)). The resulting algorithm attains a regret of d √n, improving on the previous best computationally efficient bound d √(n log n) by a factor of √(log n). The per‑round cost is O(d), making it scalable to high dimensions.

  • For the Euclidean ball { x∈ℝ^d | ‖x‖_2≤1 }, they adopt the standard ℓ₂‑mirror map (quadratic regularizer) with the same learning‑rate scaling. This yields a regret of √(d n log n), shaving off the extra √d term that appeared in earlier bandit‑linear‑optimization algorithms. The projection onto the Euclidean ball reduces to a simple normalization, again leading to O(d) time per round.

The paper also provides a minimax lower‑bound argument showing that any algorithm must suffer regret at least Ω(√(d n)) in this setting, confirming that the proposed bounds are optimal up to logarithmic factors.

Empirical experiments on synthetic data and real‑world tasks (such as ad placement and portfolio selection) corroborate the theory: the MD‑based algorithms achieve regret close to the theoretical predictions while being orders of magnitude faster than naïve exponential‑weights implementations.

In summary, the work delivers (1) a refined exponential‑weights scheme that eliminates unnecessary dimension dependence in the regret bound for bandit linear optimization, and (2) computationally efficient Mirror‑Descent algorithms that attain near‑optimal regret for important action sets like the hypercube and Euclidean ball. The results close a long‑standing gap between information‑theoretic limits and algorithmic practicality in the bandit linear‑optimization literature, and they open avenues for extending these techniques to non‑linear losses, time‑varying action sets, and distributed learning environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment