Efficient Adversarial Attacks on High-dimensional Offline Bandits
Bandit algorithms have recently emerged as a powerful tool for evaluating machine learning models, including generative image models and large language models, by efficiently identifying top-performing candidates without exhaustive comparisons. These methods typically rely on a reward model, often distributed with public weights on platforms such as Hugging Face, to provide feedback to the bandit. While online evaluation is expensive and requires repeated trials, offline evaluation with logged data has become an attractive alternative. However, the adversarial robustness of offline bandit evaluation remains largely unexplored, particularly when an attacker perturbs the reward model (rather than the training data) prior to bandit training. In this work, we fill this gap by investigating, both theoretically and empirically, the vulnerability of offline bandit training to adversarial manipulations of the reward model. We introduce a novel threat model in which an attacker exploits offline data in high-dimensional settings to hijack the bandit’s behavior. Starting with linear reward functions and extending to nonlinear models such as ReLU neural networks, we study attacks on two Hugging Face evaluators used for generative model assessment: one measuring aesthetic quality and the other assessing compositional alignment. Our results show that even small, imperceptible perturbations to the reward model’s weights can drastically alter the bandit’s behavior. From a theoretical perspective, we prove a striking high-dimensional effect: as input dimensionality increases, the perturbation norm required for a successful attack decreases, making modern applications such as image evaluation especially vulnerable. Extensive experiments confirm that naive random perturbations are ineffective, whereas carefully targeted perturbations achieve near-perfect attack success rates …
💡 Research Summary
This paper investigates a critical vulnerability in the offline evaluation of machine learning models using bandit algorithms. Bandits are efficient tools for identifying top-performing generative models without exhaustive comparisons, often relying on publicly available reward models (e.g., from Hugging Face) for feedback. While offline evaluation with logged data is cost-effective, its adversarial robustness remains largely unexplored. The authors introduce a novel threat model where an attacker perturbs the reward model’s weights before bandit training, exploiting access to offline datasets.
The core finding is that even imperceptibly small perturbations to a reward model’s parameters can drastically hijack a bandit’s decision-making process. The research progresses from linear reward functions to nonlinear models like ReLU neural networks, demonstrating attacks on two real-world Hugging Face evaluators for generative images: one measuring aesthetic quality and another assessing compositional alignment.
A key theoretical contribution is the proof of a striking high-dimensional effect: as the input dimensionality (d) increases, the ℓ2-norm of the perturbation required for a successful attack decreases, formally scaling as Õ(d^{-1/2}). This makes modern high-dimensional applications like image evaluation particularly vulnerable. The analysis leverages high-dimensional concentration inequalities and random matrix theory.
The paper formulates the attack as a constrained optimization problem. Three attack designs are proposed: 1) Full-Trajectory Attack, which forces the bandit to follow a predetermined sequence of arm pulls; 2) Trajectory-Free Attack, which only prevents the optimal arm from being selected; and 3) Online Score-Aware (OSA) Attack, an efficient online heuristic that adds constraints only when needed, dramatically reducing computational cost while maintaining near-perfect success rates. For linear reward models, these attacks reduce to tractable convex Quadratic Programs.
Extensive experiments validate the attacks on synthetic data and the two real-world Hugging Face evaluators. The attacks are shown to be effective across different bandit algorithms (UCB, ETC, ε-greedy). The study also proposes a simple defense mechanism that can partially mitigate attacks under certain conditions, though a complete defense remains an open challenge, highlighting an important direction for future secure offline evaluation.
Comments & Academic Discussion
Loading comments...
Leave a Comment