Combinatorial Network Optimization with Unknown Variables: Multi-Armed Bandits with Linear Rewards

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In the classic multi-armed bandits problem, the goal is to have a policy for dynamically operating arms that each yield stochastic rewards with unknown means. The key metric of interest is regret, defined as the gap between the expected total reward accumulated by an omniscient player that knows the reward means for each arm, and the expected total reward accumulated by the given policy. The policies presented in prior work have storage, computation and regret all growing linearly with the number of arms, which is not scalable when the number of arms is large. We consider in this work a broad class of multi-armed bandits with dependent arms that yield rewards as a linear combination of a set of unknown parameters. For this general framework, we present efficient policies that are shown to achieve regret that grows logarithmically with time, and polynomially in the number of unknown parameters (even though the number of dependent arms may grow exponentially). Furthermore, these policies only require storage that grows linearly in the number of unknown parameters. We show that this generalization is broadly applicable and useful for many interesting tasks in networks that can be formulated as tractable combinatorial optimization problems with linear objective functions, such as maximum weight matching, shortest path, and minimum spanning tree computations.

💡 Research Summary

The paper tackles a fundamental scalability limitation of classic multi‑armed bandit (MAB) formulations, where each arm is treated as an independent stochastic source. In many networked systems the number of feasible actions (paths, matchings, spanning trees, etc.) grows combinatorially, yet the underlying uncertainty can be captured by a modest set of unknown parameters – for example, link delays, channel qualities, or edge costs. The authors formalize this situation as a linear‑reward combinatorial bandit: each arm a is associated with a known feature vector x(a)∈ℝ^d, and the stochastic reward is r_t(a)=θ·x(a)+η_t, where θ∈ℝ^d is an unknown parameter vector and η_t is sub‑Gaussian noise. The key insight is that regret can be bounded in terms of the dimension d of the parameter space rather than the total number of arms |A|, which may be exponential in d.

Two algorithms are proposed. The first, UCB‑L (Upper Confidence Bound for Linear rewards), maintains a regularized least‑squares estimate (\hat θ_t) and a covariance matrix V_t. For each arm it computes an optimistic upper confidence bound
(U_t(a)=\hat θ_t^\top x(a)+\alpha\sqrt{x(a)^\top V_t^{-1}x(a)})
and selects the arm with the largest bound. The second, TS‑L (Thompson Sampling for Linear rewards), adopts a Bayesian view: it places a Gaussian prior on θ, updates the posterior after each observation, draws a sample (\tilde θ_t) from the posterior, and chooses the arm maximizing (\tilde θ_t^\top x(a)). Both algorithms reduce the per‑round decision to solving a standard combinatorial optimization problem (e.g., maximum‑weight matching, shortest‑path, minimum‑spanning‑tree) with the current optimistic or sampled weight vector. Consequently, the computational overhead beyond the underlying optimization routine is only O(d^2) for matrix updates and O(d) for evaluating the confidence term.

The theoretical contribution is a regret analysis that yields logarithmic growth in time and polynomial dependence on d. Under standard assumptions (bounded feature vectors, sub‑Gaussian noise, and a regularization parameter λ>0), the authors prove that with high probability the true θ lies inside the confidence ellipsoid defined by V_t. This “optimism in the face of uncertainty” guarantees that the instantaneous regret is bounded by the width of the ellipsoid along the chosen arm’s feature direction. Summing over T rounds and using properties of the matrix trace, they obtain a regret bound of order O(d log T) for UCB‑L; a similar bound holds for TS‑L via standard Bayesian regret techniques. Importantly, the bound does not depend on |A|, demonstrating that the exponential explosion of arms does not affect performance as long as the linear structure holds.

The paper then illustrates the framework on three canonical network problems:

Maximum‑weight bipartite matching – each edge’s weight is an unknown parameter; a matching’s total reward is the sum of selected edges. Feature vectors are binary indicators of edge inclusion. Experiments show that both UCB‑L and TS‑L achieve regret reductions of an order of magnitude compared with naïve K‑armed bandit approaches that treat each matching as an independent arm.
Shortest‑path routing – link delays are unknown; the delay of a path is the linear sum of its constituent links. By embedding Dijkstra’s algorithm as the oracle that solves the optimistic linear program, the policies quickly learn the delay parameters and converge to near‑optimal routing with regret scaling as O(log T).
Minimum‑spanning‑tree construction – edge costs are unknown; the tree cost is linear in the selected edges. Using Kruskal’s algorithm as the underlying optimizer, the policies again exhibit logarithmic regret and maintain linear storage in d (the number of distinct edges).

Across all experiments, storage requirements are limited to the d‑dimensional estimate (\hat θ_t) and the d×d covariance matrix, i.e., O(d^2) memory, regardless of the exponential number of feasible arms. The per‑round computational cost is dominated by the combinatorial oracle, which is unchanged from the deterministic version of the problem, making the approach practical for large‑scale networks.

The authors discuss extensions such as handling non‑linear reward functions via local linearization, adapting to time‑varying feature vectors (dynamic topologies), and developing distributed versions where each node maintains a local estimate of θ and exchanges minimal summary statistics. They also outline potential applications beyond networking, including online ad allocation, recommendation systems with combinatorial constraints, and resource management in cloud environments.

In summary, the paper presents a general, scalable solution to combinatorial bandits with linear rewards, achieving logarithmic regret, linear‑in‑d storage, and computational complexity that matches existing deterministic combinatorial optimization algorithms. By decoupling the exponential arm space from the learning difficulty, it opens the door to real‑time, data‑driven decision making in large‑scale networked systems and other domains where actions can be expressed as linear functions of a small set of unknown parameters.

Combinatorial Network Optimization with Unknown Variables: Multi-Armed Bandits with Linear Rewards

💡 Research Summary

Comments & Academic Discussion

Leave a Comment