Expanding the Capabilities of Reinforcement Learning via Text Feedback

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The success of RL for LLM post-training stems from an unreasonably uninformative source: a single bit of information per rollout as binary reward or preference label. At the other extreme, distillation offers dense supervision but requires demonstrations, which are costly and difficult to scale. We study text feedback as an intermediate signal: richer than scalar rewards, yet cheaper than complete demonstrations. Textual feedback is a natural mode of human interaction and is already abundant in many real-world settings, where users, annotators, and automated judges routinely critique LLM outputs. Towards leveraging text feedback at scale, we formalize a multi-turn RL setup, RL from Text Feedback (RLTF), where text feedback is available during training but not at inference. Therefore, models must learn to internalize the feedback in order to improve their test-time single-turn performance. To do this, we propose two methods: Self Distillation (RLTF-SD), which trains the single-turn policy to match its own feedback-conditioned second-turn generations; and Feedback Modeling (RLTF-FM), which predicts the feedback as an auxiliary objective. We provide theoretical analysis on both methods, and empirically evaluate on reasoning puzzles, competition math, and creative writing tasks. Our results show that both methods consistently outperform strong baselines across benchmarks, highlighting the potential of RL with an additional source of rich supervision at scale.

💡 Research Summary

The paper addresses a fundamental limitation of current large language model (LLM) post‑training pipelines, which typically rely on extremely sparse supervision such as binary rewards or preference labels. While such signals enable reinforcement learning at scale, they convey almost no information about why a model’s output failed or how to fix it. At the opposite extreme, distillation or imitation learning provides dense supervision in the form of demonstrations, but collecting high‑quality demonstrations for frontier models is prohibitively expensive. The authors propose to exploit textual feedback—natural‑language critiques, error explanations, or suggested corrections—as an intermediate source of supervision that is richer than scalar rewards yet cheaper than full demonstrations.

RL from Text Feedback (RLTF) Framework

The authors formalize a multi‑turn interaction: given an initial prompt (x_0), a policy (\pi) generates a first‑turn output (y_0). A feedback provider (M) (human annotator, automated judge, etc.) returns a textual critique (c_0). The critique is concatenated to the dialogue to form an augmented prompt (x_1 = f(x_0, y_0, c_0)). The same policy then produces a revised output (y_1). The reward function (R) is evaluated only on the original prompt (x_0) (e.g., correctness of the answer). The standard multi‑turn RL objective would maximize the sum of rewards across turns, but this does not guarantee improvement of the single‑turn performance that matters at inference time when feedback is unavailable.

Thus the central research question becomes: How can we use the feedback‑augmented trajectories (\tau = (x_0, y_0, c_0, x_1, y_1)) collected during training to improve the single‑turn expected reward \

Expanding the Capabilities of Reinforcement Learning via Text Feedback

💡 Research Summary

RL from Text Feedback (RLTF) Framework

Comments & Academic Discussion

Leave a Comment