Competing Bandits in Matching Markets
Stable matching, a classical model for two-sided markets, has long been studied with little consideration for how each side’s preferences are learned. With the advent of massive online markets powered by data-driven matching platforms, it has become necessary to better understand the interplay between learning and market objectives. We propose a statistical learning model in which one side of the market does not have a priori knowledge about its preferences for the other side and is required to learn these from stochastic rewards. Our model extends the standard multi-armed bandits framework to multiple players, with the added feature that arms have preferences over players. We study both centralized and decentralized approaches to this problem and show surprising exploration-exploitation trade-offs compared to the single player multi-armed bandits setting.
💡 Research Summary
The paper introduces a novel framework that merges multi‑armed bandits (MAB) with two‑sided stable matching, targeting markets where one side (the “agents”) does not know its preferences for the other side (the “arms”) and must learn them through stochastic rewards. Each arm has a known, fixed ranking over agents, and when multiple agents pull the same arm only the highest‑ranked agent receives the reward, reflecting competition for scarce resources. The authors formalize two notions of regret based on stable matchings: agent‑optimal regret (the loss relative to the best stable matching for the agents) and agent‑pessimal regret (the loss relative to the worst stable matching for the agents). These regret definitions capture the interplay between learning, competition, and the preferences of the arms, which is absent in traditional single‑player or multi‑player bandit models.
Two algorithmic settings are examined. In the centralized setting, agents report a ranking of arms to a platform at each round; the platform then computes a conflict‑free matching using the Gale‑Shapley (GS) algorithm. The authors propose two concrete strategies:
- Explore‑then‑Commit (ETC) – During an initial exploration phase of length hK each agent samples every arm exactly h times, guaranteeing unbiased empirical means. After exploration, agents rank arms by their empirical means and the platform runs GS to produce the agent‑optimal stable matching, which is then fixed for the remainder of the horizon. By bounding the probability that an agent submits an invalid ranking (using Hoeffding’s inequality) and applying a union bound over arms, the paper shows that the expected agent‑optimal regret is at most
\
Comments & Academic Discussion
Loading comments...
Leave a Comment