Single Index Bandits: Generalized Linear Contextual Bandits with Unknown Reward Functions

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Generalized linear bandits have been extensively studied due to their broad applicability in real-world online decision-making problems. However, these methods typically assume that the expected reward function is known to the users, an assumption that is often unrealistic in practice. Misspecification of this link function can lead to the failure of all existing algorithms. In this work, we address this critical limitation by introducing a new problem of generalized linear bandits with unknown reward functions, also known as single index bandits. We first consider the case where the unknown reward function is monotonically increasing, and propose two novel and efficient algorithms, STOR and ESTOR, that achieve decent regrets under standard assumptions. Notably, our ESTOR can obtain the nearly optimal regret bound $\tilde{O}_T(\sqrt{T})$ in terms of the time horizon $T$. We then extend our methods to the high-dimensional sparse setting and show that the same regret rate can be attained with the sparsity index. Next, we introduce GSTOR, an algorithm that is agnostic to general reward functions, and establish regret bounds under a Gaussian design assumption. Finally, we validate the efficiency and effectiveness of our algorithms through experiments on both synthetic and real-world datasets.

💡 Research Summary

The paper introduces a fundamentally new formulation of contextual bandits called the Single Index Bandit (SIB), where the expected reward is an unknown, continuously differentiable function $f$ of a linear predictor $x^\top\theta^*$. This setting removes the standard assumption in generalized linear bandits (GLBs) that the link function $f$ is known a priori. The authors argue that misspecifying $f$ can render all existing GLB algorithms ineffective, often leading to linear regret. To address this, they develop a suite of algorithms that operate without any knowledge of $f$.

The core technical contribution is a novel estimator based on Stein’s method. By applying Stein’s identity, they show that $\mathbb{E}

Single Index Bandits: Generalized Linear Contextual Bandits with Unknown Reward Functions

💡 Research Summary

Comments & Academic Discussion

Leave a Comment