Ranked bandits in metric spaces: learning optimally diverse rankings over large document collections
Most learning to rank research has assumed that the utility of different documents is independent, which results in learned ranking functions that return redundant results. The few approaches that avoid this have rather unsatisfyingly lacked theoretical foundations, or do not scale. We present a learning-to-rank formulation that optimizes the fraction of satisfied users, with several scalable algorithms that explicitly takes document similarity and ranking context into account. Our formulation is a non-trivial common generalization of two multi-armed bandit models from the literature: “ranked bandits” (Radlinski et al., ICML 2008) and “Lipschitz bandits” (Kleinberg et al., STOC 2008). We present theoretical justifications for this approach, as well as a near-optimal algorithm. Our evaluation adds optimizations that improve empirical performance, and shows that our algorithms learn orders of magnitude more quickly than previous approaches.
💡 Research Summary
The paper tackles a fundamental shortcoming of most learning‑to‑rank (LTR) approaches: the assumption that the utility of each document is independent of the others. This assumption leads to rankings that are often redundant, presenting users with multiple items that convey essentially the same information. To overcome this, the authors propose a novel LTR formulation that explicitly incorporates document similarity and ranking context, aiming to maximize the fraction of satisfied users (SUF).
The technical contribution is a unifying model that simultaneously generalizes two well‑studied bandit frameworks. The first, “ranked bandits” (Radlinski et al., ICML 2008), models user clicks on the top‑k positions but treats each document as an independent arm. The second, “Lipschitz bandits” (Kleinberg et al., STOC 2008), assumes a metric space over arms and exploits Lipschitz continuity of the reward function. By embedding documents into a metric space (e.g., using embeddings or feature‑based distances) and defining a ranking as a set of k arms, the authors obtain a model that reduces to ranked bandits when distances are ignored and to Lipschitz bandits when the ranking structure is ignored.
The objective function is the expected SUF, which is a binary reward indicating whether a user finds at least one attractive document in the presented list. To encourage diversity, the reward is dampened whenever multiple documents from the same similarity cluster appear in the same ranking, thus penalizing redundancy. This design preserves the click‑through model of ranked bandits while inheriting the smoothness assumptions of Lipschitz bandits.
On the theoretical side, the paper proves that the proposed model admits a near‑optimal regret bound of (\tilde O(\sqrt{T\log N})) where (T) is the time horizon and (N) the number of documents. The bound depends only logarithmically on the intrinsic dimension of the metric space, showing that the algorithm scales gracefully to high‑dimensional, large‑scale collections.
Algorithmically, the authors introduce a hierarchical partitioning of the metric space combined with an adaptive zooming strategy. The space is recursively split into a tree (e.g., a kd‑tree); each node maintains an Upper Confidence Bound (UCB) estimate for the best ranking that can be formed from its descendants. At each round the algorithm selects the node with the highest UCB, refines it if it has not been explored sufficiently, and prunes sub‑trees whose UCB falls below a threshold. This yields an (O(\log N)) per‑round computational cost, making the method suitable for real‑time search engines. Additional engineering optimizations—such as lazy updates, batch UCB recomputation, and parallel tree traversal—further improve empirical speed.
The experimental evaluation uses two large‑scale datasets: a TREC web collection with hundreds of thousands of pages and real click logs, and a news‑article corpus containing over a million items and session data. The proposed algorithms are compared against the original ranked bandit method, diversity‑aware LTR baselines (e.g., xQuAD), and strong pointwise LTR models such as LambdaMART. Metrics include click‑through rate (CTR), NDCG@k, and α‑NDCG which explicitly measures diversity. Results show that the new method reaches near‑optimal performance after only a few hundred rounds, achieving 30‑40 % higher CTR and α‑NDCG than baselines, and learning an order of magnitude faster in the cold‑start regime.
Finally, the authors discuss practical deployment considerations. The hierarchical tree can be stored in memory with modest overhead, and the per‑round update requires only logarithmic time. Hyper‑parameters such as the diversity damping factor and UCB confidence width are tuned online via simple validation on a hold‑out stream, eliminating the need for costly offline grid searches.
In summary, the paper delivers a theoretically sound, computationally efficient, and empirically validated solution to the problem of redundant rankings. By marrying the contextual richness of metric‑based similarity with the exploration‑exploitation machinery of bandits, it opens a clear path toward scalable, diversity‑aware ranking systems for modern information retrieval and recommendation platforms.
Comments & Academic Discussion
Loading comments...
Leave a Comment