Efficient Learning in Large-Scale Combinatorial Semi-Bandits
A stochastic combinatorial semi-bandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to combinatorial constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we consider efficient learning in large-scale combinatorial semi-bandits with linear generalization, and as a solution, propose two learning algorithms called Combinatorial Linear Thompson Sampling (CombLinTS) and Combinatorial Linear UCB (CombLinUCB). Both algorithms are computationally efficient as long as the offline version of the combinatorial problem can be solved efficiently. We establish that CombLinTS and CombLinUCB are also provably statistically efficient under reasonable assumptions, by developing regret bounds that are independent of the problem scale (number of items) and sublinear in time. We also evaluate CombLinTS on a variety of problems with thousands of items. Our experiment results demonstrate that CombLinTS is scalable, robust to the choice of algorithm parameters, and significantly outperforms the best of our baselines.
💡 Research Summary
The paper addresses the challenge of learning in stochastic combinatorial semi‑bandit problems when the number of ground items L is extremely large. Traditional algorithms estimate each item’s mean reward separately, leading to regret that scales at least as O(√L), which is prohibitive for modern applications such as large‑scale recommendation, advertising, or network routing. To overcome this limitation, the authors assume that each item is described by a d‑dimensional feature vector and that the expected weight vector (\bar w) lies (or is close) to the linear subspace spanned by a known feature matrix Φ∈ℝ^{L×d}. In the coherent case, (\bar w = Φθ^) for some unknown parameter vector θ^∈ℝ^d. This linear generalization reduces the effective number of parameters from L to d, enabling statistically efficient learning even when L is infinite.
Two algorithms are proposed: Combinatorial Linear Thompson Sampling (CombLinTS) and Combinatorial Linear Upper Confidence Bound (CombLinUCB). Both maintain a posterior mean (\bar θ_t) and covariance Σ_t over θ^* using Kalman‑filter style updates after each round’s semi‑bandit feedback. CombLinTS draws a sample θ_t from the current Gaussian posterior and feeds the induced weight vector Φθ_t to an offline combinatorial optimization oracle, which returns the action A_t that maximizes the sampled linear objective. CombLinUCB constructs an optimistic estimate for each item, (\hat w_t(e)=φ_e^T\bar θ_t + c\sqrt{φ_e^T Σ_t φ_e}), and again calls the oracle on these upper‑confidence values. The oracle can be any exact or approximate algorithm for the underlying combinatorial problem (e.g., matching, shortest path, knapsack), and the learning algorithms inherit its computational complexity.
Theoretical analysis focuses on the “coherent Gaussian case”: (1) the true mean satisfies (\bar w = Φθ^); (2) the prior on θ^ is N(0,λ²I); (3) observation noise is i.i.d. N(0,σ²); and (4) λ≥σ. Under these conditions, the authors prove a Bayes regret bound for CombLinTS:
(R_{\text{Bayes}}(n) = O!\big(d\sqrt{n\log(1+nσ²/λ²)}\big)).
A similar expected regret bound holds for CombLinUCB. Crucially, the bounds are independent of L and grow sub‑linearly in the horizon n, matching the best known rates for linear bandits but now applied to a combinatorial action space. The proofs adapt standard linear‑bandit concentration arguments to the semi‑bandit feedback setting and exploit the fact that the oracle’s solution is optimal for the sampled (or optimistic) weight vector, allowing the regret to be expressed in terms of the estimation error of θ^*.
From a computational standpoint, each round requires only one oracle call and O(d²) matrix operations for updating Σ_t and (\bar θ_t). Hence the algorithms are scalable to thousands of items as long as the underlying combinatorial problem can be solved efficiently (or approximated with guarantees).
Empirical evaluation tests CombLinTS on synthetic problems with up to 10,000 items and on two real‑world datasets (movie recommendation and online ad matching). The experiments compare against several baselines, including LinUCB, CUCB, and other combinatorial bandit methods that do not exploit feature generalization. Results show that CombLinTS consistently achieves lower cumulative regret, converges faster, and is robust to the choice of hyper‑parameters λ and σ. Sensitivity analysis indicates that while overly small λ hampers exploration, a moderate λ combined with a reasonable σ yields stable performance across problem sizes.
In summary, the paper makes three major contributions: (1) it introduces a linear generalization framework for combinatorial semi‑bandits that eliminates dependence on the number of items; (2) it provides two practical, oracle‑based algorithms with provable L‑independent regret guarantees; and (3) it validates the approach empirically on large‑scale problems, demonstrating both statistical efficiency and computational practicality. The work opens avenues for extending the methodology to contextual settings, non‑linear feature mappings, and dynamic environments where the feature matrix may evolve over time.
Comments & Academic Discussion
Loading comments...
Leave a Comment