Online Learning with Preference Feedback

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose a new online learning model for learning with preference feedback. The model is especially suited for applications like web search and recommender systems, where preference data is readily available from implicit user feedback (e.g. clicks). In particular, at each time step a potentially structured object (e.g. a ranking) is presented to the user in response to a context (e.g. query), providing him or her with some unobserved amount of utility. As feedback the algorithm receives an improved object that would have provided higher utility. We propose a learning algorithm with provable regret bounds for this online learning setting and demonstrate its effectiveness on a web-search application. The new learning model also applies to many other interactive learning problems and admits several interesting extensions.

💡 Research Summary

The paper introduces a novel online learning framework tailored to settings where feedback comes in the form of user preferences rather than explicit numerical rewards. This situation is common in web search and recommender systems, where users implicitly indicate their preferences through clicks, selections, or other interactions. The authors formalize the “online preference learning” model: at each round t a context xₜ is observed, the algorithm proposes a structured object yₜ (e.g., a ranking), and the user returns an improved object (\bar y_t) that would have yielded higher utility. The true utility function U(x, y) is assumed linear, (U(x,y)=w^{}!\cdot!\phi(x,y)), with an unknown parameter vector w and a bounded joint feature map φ.

A key assumption is α‑informative feedback: the utility gain of the returned object over the algorithm’s proposal equals a fraction α (0<α≤1) of the maximal possible gain, minus a non‑negative slack ξₜ that captures noise. Under this condition the authors propose the Preference Perceptron algorithm, a simple perceptron‑style update: initialize w₁=0, predict (y_t = \arg\max_{y} w_t!\cdot!\phi(x_t,y)), receive (\bar y_t), then set (w_{t+1}=w_t + \phi(x_t,\bar y_t) - \phi(x_t,y_t)). This update pushes the weight vector toward features of the better object while pulling it away from the inferior one.

Theoretical analysis yields a regret bound (Theorem 1):
\

Online Learning with Preference Feedback

💡 Research Summary

Comments & Academic Discussion

Leave a Comment