Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability
Language models trained on large-scale datasets have been shown to learn features that encode abstract concepts such as factuality or intent. Such features are traditionally used for test-time monitoring or steering. We present an alternative affordance: features as scalable supervision for open-ended tasks. We consider the case of hallucination-reduction as a desirable, yet open-ended behavior and design a reinforcement learning (RL) pipeline, titled RLFR (Reinforcement Learning from Feature Rewards), that uses features as reward functions. Grounded in a novel probing framework that identifies candidate hallucinated claims, our pipeline teaches a model to intervene and correct its completions when it is uncertain of their factuality. Furthermore, the pipeline enables scalable test-time compute, guided once more by our reward features. This end-to-end process operationalized on Gemma-3-12B-IT results in a policy that is 58% less likely to hallucinate compared to the original model (when run in tandem with our probing harness), while preserving performance on standard benchmarks. Taken together, by grounding supervision in the language of features, this paper introduces a novel paradigm in the use of interpretability for learning open-ended tasks.
💡 Research Summary
The paper introduces a novel reinforcement‑learning framework called RLFR (Reinforcement Learning from Feature Rewards) that turns internal latent features of large language models (LLMs) into dense, low‑cost reward signals for open‑ended tasks. The authors focus on hallucination reduction—a behavior that is costly to verify because it often requires external fact‑checking or LLM judges. Building on a growing body of interpretability work that shows LLMs encode abstract concepts such as factuality, intent, or harmfulness in their hidden representations, the authors propose to read out these concepts with lightweight probes and treat the probe outputs as calibrated uncertainty estimates.
The RLFR pipeline consists of four stages. First, a localization probe identifies spans (called “entities”) in a generated text that contain factual claims. Second, a classification probe labels each span as either a hallucination or a supported claim, using labels generated by an expensive external grader (Gemini 2.5 Pro with web search). Third, the policy (initialized from a pretrained model, Gemma‑3‑12B‑IT) is asked to intervene on each flagged span, choosing among “maintain”, “retract”, or “correct” and then generating a revised snippet. Fourth, the revised snippet is graded by the same external grader; the resulting scalar reward is then approximated by a feature‑based reward model trained to predict the grader’s output from the frozen base model’s activations. This amortizes the high cost of the grader: during training the cheap probe‑based reward replaces the expensive grader, while at test time the same features can be used for Best‑of‑N sampling to further improve output quality.
Experiments are conducted on the LongFact++ dataset (≈20 K long‑form factual questions). The authors train the two probes using attention‑based lightweight networks, showing that a two‑step pipeline (localization → classification) yields high true‑positive rates for hallucination detection. After RL training, the resulting policy reduces the probability of hallucination by 58 % compared with the original Gemma‑3‑12B‑IT when evaluated with the probing harness, while preserving performance on standard benchmarks such as MMLU, ARC, and TruthfulQA. Moreover, the feature‑based reward is roughly 90× cheaper to compute than the gold‑standard grader, demonstrating substantial efficiency gains.
The paper’s key contribution is the reconceptualization of internal model features as scalable supervision signals rather than merely monitoring tools. By leveraging the fact that pretrained models already encode latent variables relevant to the target behavior, the authors achieve sample‑efficient RL: the policy learns from dense signals that are already present in the model, avoiding the need for large amounts of human‑annotated preference data. The approach also mitigates classic reward‑model pitfalls such as non‑identifiability and reward hacking, because the low‑expressivity probes constrain the space of possible reward functions.
Limitations include the reliance on well‑calibrated probes; if the probes misestimate uncertainty, the policy could be mis‑guided. The current work focuses exclusively on hallucination reduction, and extending the method to other open‑ended objectives (e.g., helpfulness, fairness, creativity) will require new probing definitions and validation. Additionally, the experiments are limited to a single 12‑B parameter model; scalability to larger or multimodal models remains an open question.
In summary, the authors demonstrate that interpretability tools can be repurposed to provide inexpensive, dense reward signals for reinforcement learning, opening a new paradigm for aligning LLMs on behaviors that are otherwise too costly to verify directly. This bridges the gap between interpretability research and alignment, suggesting a promising direction for future work on scalable, open‑ended model supervision.
Comments & Academic Discussion
Loading comments...
Leave a Comment