Efficient Swap Regret Minimization in Combinatorial Bandits

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper addresses the problem of designing efficient no-swap regret algorithms for combinatorial bandits, where the number of actions $N$ is exponentially large in the dimensionality of the problem. In this setting, designing efficient no-swap regret translates to sublinear – in horizon $T$ – swap regret with polylogarithmic dependence on $N$. In contrast to the weaker notion of external regret minimization - a problem which is fairly well understood in the literature - achieving no-swap regret with a polylogarithmic dependence on $N$ has remained elusive in combinatorial bandits. Our paper resolves this challenge, by introducing a no-swap-regret learning algorithm with regret that scales polylogarithmically in $N$ and is tight for the class of combinatorial bandits. To ground our results, we also demonstrate how to implement the proposed algorithm efficiently – that is, with a per-iteration complexity that also scales polylogarithmically in $N$ – across a wide range of well-studied applications.

💡 Research Summary

The paper tackles the long‑standing open problem of achieving efficient no‑swap‑regret learning in adversarial combinatorial bandits, where the action space size N grows exponentially with the problem dimension (N = O(d^m)). While external regret can be minimized with regret bounds that depend only polylogarithmically on N, prior work on swap regret either incurred polynomial dependence on N or required exponential per‑round computation, making them impractical for large‑scale combinatorial settings.

The authors introduce a novel algorithmic framework called Swap‑ComBCP that attains swap regret O(T·log(d·log T)/log T) while maintaining per‑iteration time O(poly(d,m)). The key ideas are:

Master‑ScaleLearner Architecture – A master learner maintains a distribution over actions and coordinates K parallel “ScaleLearners”. Each ScaleLearner k operates on a different laziness scale: it updates its policy only every H_k = ⌈T/2^k⌉ rounds, keeping the policy fixed in between. By mixing these learners uniformly, the master’s swap regret can be decomposed into a sum of external regrets of the individual ScaleLearners (Lemma 3.2).
Lazy‑CombAlg / Lazy‑CombBCP – Each ScaleLearner runs an abstract combinatorial bandit algorithm called Lazy‑CombAlg. The concrete implementation Lazy‑CombBCP (Algorithm 2) uses barycentric spanners to obtain a low‑dimensional basis for the action polytope and Carathéodory decomposition to represent any mixed strategy with O(d) support. This yields an update rule that runs in O(poly(d,m)) time, avoiding the exponential blow‑up typical of naïve combinatorial bandit methods.
Unbiased Reward Estimation under Partial Feedback – The master observes only the scalar reward r_t = R_t·M_t. It constructs an unbiased estimator of the full reward vector and broadcasts it to all ScaleLearners. Although unbiased with respect to the master’s policy, the estimator is biased for a given ScaleLearner’s policy; handling this bias is the main technical hurdle.
Variance Control via Co‑occurrence Matrix – Standard external‑regret analyses bound terms like E

Efficient Swap Regret Minimization in Combinatorial Bandits

💡 Research Summary

Comments & Academic Discussion

Leave a Comment