Self-Concordant Perturbations for Linear Bandits

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We consider the adversarial linear bandits setting and present a unified algorithmic framework that bridges Follow-the-Regularized-Leader (FTRL) and Follow-the-Perturbed-Leader (FTPL) methods, extending the known connection between them from the full-information setting. Within this framework, we introduce self-concordant perturbations, a family of probability distributions that mirror the role of self-concordant barriers previously employed in the FTRL-based SCRiBLe algorithm. Using this idea, we design a novel FTPL-based algorithm that combines self-concordant regularization with efficient stochastic exploration. Our approach achieves a regret of $\mathcal{O}(d\sqrt{n \ln n})$ on both the $d$-dimensional hypercube and the $\ell_2$ ball. On the $\ell_2$ ball, this matches the rate attained by SCRiBLe. For the hypercube, this represents a $\sqrt{d}$ improvement over these methods and matches the optimal bound up to logarithmic factors.

💡 Research Summary

The paper tackles the adversarial linear bandit problem, where a learner repeatedly selects actions from a convex compact set K ⊂ ℝᵈ, observes only the scalar loss ⟨yₜ, Aₜ⟩, and aims to minimize regret against the best fixed action in hindsight. Two classic algorithmic families—Follow‑the‑Regularized‑Leader (FTRL) and Follow‑the‑Perturbed‑Leader (FTPL)—are known to be unified under the Gradient‑Based Prediction Algorithm (GBPA) in the full‑information setting. However, this unification does not directly extend to bandits because only partial feedback is available, forcing the design of explicit exploration mechanisms.

The authors introduce Bandits‑GBPA, a general template that extends GBPA to the bandit setting by adding (i) a sampling scheme S that randomizes the chosen action based on the current estimate of cumulative loss, and (ii) an estimation scheme E that reconstructs an unbiased estimate of the unknown loss vector from the observed scalar loss. When (S, E) are unbiased, the regret analysis of FTRL can be carried over.

The central technical contribution is the notion of self‑concordant perturbations. A probability distribution D on ℝᵈ is called a ϑ‑self‑concordant perturbation for K if there exists a ϑ‑self‑concordant barrier R on K such that the gradient of the Fenchel conjugate R* satisfies
∇R*(θ) = 𝔼_{ξ∼D}

Self-Concordant Perturbations for Linear Bandits

💡 Research Summary

Comments & Academic Discussion

Leave a Comment