Bandits with Single-Peaked Preferences and Limited Resources
We study an online stochastic matching problem in which an algorithm sequentially matches $U$ users to $K$ arms, aiming to maximize cumulative reward over $T$ rounds under budget constraints. Without structural assumptions, computing the optimal matching is NP-hard, making online learning computationally infeasible. To overcome this barrier, we focus on single-peaked preferences – a well-established structure in social choice theory, where users’ preferences are unimodal with respect to a common order over arms. We devise an efficient algorithm for the offline budgeted matching problem, and leverage it into an efficient online algorithm with a regret of $\tilde O(UKT^{2/3})$. Our approach relies on a novel PQ tree-based order approximation method. If the single-peaked structure is known, we develop an efficient UCB-like algorithm that achieves a regret bound of $\tilde O(U\sqrt{TK})$.
💡 Research Summary
The paper tackles an online stochastic matching problem with per‑round budget constraints, where a learner must repeatedly assign U users to K arms over T rounds while respecting a total cost B per round. In the general setting, computing the optimal matching is NP‑hard (Theorem 1), making sublinear‑regret online learning infeasible without an exponential‑time oracle. To overcome this computational barrier, the authors impose a structural assumption on the expected reward matrix Θ: the preferences of all users are single‑peaked with respect to a common ordering of the arms. Formally, there exists a total order ≺ such that after permuting the columns of Θ according to ≺, each row becomes unimodal (a perfectly single‑peaked, PSP, matrix). This notion originates from social‑choice theory and is known to simplify many combinatorial problems.
The first technical contribution is an offline algorithm, SP‑Matching, that solves the budgeted matching problem exactly in polynomial time for any PSP matrix. The key insight (Lemma 4) is that, given a fixed set S of selected arms, the optimal assignment for each user is the arm in S closest to that user’s peak. Using this “closest‑to‑peak” property, the authors design a dynamic‑programming procedure that runs in O(K²B + K²U) time, far faster than the exponential time required for arbitrary matrices.
Building on this offline solver, the paper presents two online algorithms:
-
MvM (Match‑via‑Maximal) – assumes the single‑peaked order and each user’s peak index are known in advance (the order may be known but the exact utilities are not). At each round the algorithm constructs a confidence set of plausible reward matrices, extracts a maximal matrix that element‑wise dominates all matrices in the set, and feeds this optimistic matrix to SP‑Matching. This UCB‑style approach achieves a regret of (\tilde O(U\sqrt{TK})), matching the lower bound for general preferences (Theorem 3) while remaining computationally efficient because SP‑Matching replaces the NP‑hard oracle.
-
EMC (Explore‑then‑Commit) – handles the fully unknown case where neither the order nor the peaks are given. EMC first spends an exploration phase uniformly sampling arms to obtain empirical reward estimates. It then runs Extract‑Order, a novel procedure based on PQ‑trees, to recover an approximate single‑peaked order from these estimates. The estimated rewards are projected onto the nearest PSP matrix, and the offline SP‑Matching algorithm is invoked to obtain a matching that is committed for the remaining rounds. Careful concentration and approximation analysis yields a regret bound of (\tilde O(UKT^{2/3})). All steps run in polynomial time in U, K, and B.
Statistically, the authors prove that the single‑peaked assumption does not simplify learning: even when the order and peaks are known, any algorithm suffers a regret lower bound of (\Omega(\max{U\sqrt{T},\sqrt{TK}})); when peaks are unknown the bound becomes (\Omega(U\sqrt{TK})) (Theorem 3). Thus the advantage of the SP structure is purely computational.
The paper situates its contributions within related work on combinatorial bandits, budgeted matching, and single‑peaked preferences. Prior approaches to computationally hard bandit problems resort to α‑regret, comparing against efficiently computable approximations rather than the true optimum. By exploiting single‑peakedness, this work achieves standard regret guarantees without sacrificing optimality, and introduces a PQ‑tree based order extraction technique that extends classic single‑peaked recognition algorithms to the stochastic, noisy setting.
In summary, the authors deliver:
- a polynomial‑time exact offline solver for budgeted matching under single‑peaked preferences;
- an efficient UCB‑type online algorithm (MvM) with optimal (\tilde O(U\sqrt{TK})) regret when the SP order is known;
- an explore‑then‑commit algorithm (EMC) that learns the order and attains (\tilde O(UKT^{2/3})) regret when the structure is unknown.
These results bridge the gap between theoretical optimality and practical tractability for large‑scale recommendation, advertising, or resource‑allocation systems where users exhibit unimodal preferences along a common attribute spectrum.
Comments & Academic Discussion
Loading comments...
Leave a Comment