Expanding the Capabilities of Reinforcement Learning via Text Feedback
The success of RL for LLM post-training stems from an unreasonably uninformative source: a single bit of information per rollout as binary reward or preference label. At the other extreme, distillation offers dense supervision but requires demonstrations, which are costly and difficult to scale. We study text feedback as an intermediate signal: richer than scalar rewards, yet cheaper than complete demonstrations. Textual feedback is a natural mode of human interaction and is already abundant in many real-world settings, where users, annotators, and automated judges routinely critique LLM outputs. Towards leveraging text feedback at scale, we formalize a multi-turn RL setup, RL from Text Feedback (RLTF), where text feedback is available during training but not at inference. Therefore, models must learn to internalize the feedback in order to improve their test-time single-turn performance. To do this, we propose two methods: Self Distillation (RLTF-SD), which trains the single-turn policy to match its own feedback-conditioned second-turn generations; and Feedback Modeling (RLTF-FM), which predicts the feedback as an auxiliary objective. We provide theoretical analysis on both methods, and empirically evaluate on reasoning puzzles, competition math, and creative writing tasks. Our results show that both methods consistently outperform strong baselines across benchmarks, highlighting the potential of RL with an additional source of rich supervision at scale.
💡 Research Summary
The paper addresses a fundamental limitation of current large language model (LLM) post‑training pipelines, which typically rely on extremely sparse supervision such as binary rewards or preference labels. While such signals enable reinforcement learning at scale, they convey almost no information about why a model’s output failed or how to fix it. At the opposite extreme, distillation or imitation learning provides dense supervision in the form of demonstrations, but collecting high‑quality demonstrations for frontier models is prohibitively expensive. The authors propose to exploit textual feedback—natural‑language critiques, error explanations, or suggested corrections—as an intermediate source of supervision that is richer than scalar rewards yet cheaper than full demonstrations.
RL from Text Feedback (RLTF) Framework
The authors formalize a multi‑turn interaction: given an initial prompt (x_0), a policy (\pi) generates a first‑turn output (y_0). A feedback provider (M) (human annotator, automated judge, etc.) returns a textual critique (c_0). The critique is concatenated to the dialogue to form an augmented prompt (x_1 = f(x_0, y_0, c_0)). The same policy then produces a revised output (y_1). The reward function (R) is evaluated only on the original prompt (x_0) (e.g., correctness of the answer). The standard multi‑turn RL objective would maximize the sum of rewards across turns, but this does not guarantee improvement of the single‑turn performance that matters at inference time when feedback is unavailable.
Thus the central research question becomes: How can we use the feedback‑augmented trajectories (\tau = (x_0, y_0, c_0, x_1, y_1)) collected during training to improve the single‑turn expected reward \
Comments & Academic Discussion
Loading comments...
Leave a Comment