ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
Reinforcement Learning (RL) post-training alignment for language models is effective, but also costly and unstable in practice, owing to its complicated training process. To address this, we propose a training-free inference method to sample directly from the optimal RL policy. The transition probability applied to Masked Language Modeling (MLM) consists of a reference policy model and an energy term. Based on this, our algorithm, Energy-Guided Test-Time Scaling (ETS), estimates the key energy term via online Monte Carlo, with a provable convergence rate. Moreover, to ensure practical efficiency, ETS leverages modern acceleration frameworks alongside tailored importance sampling estimators, substantially reducing inference latency while provably preserving sampling quality. Experiments on MLM (including autoregressive models and diffusion language models) across reasoning, coding, and science benchmarks show that our ETS consistently improves generation quality, validating its effectiveness and design.
💡 Research Summary
The paper introduces Energy‑Guided Test‑Time Scaling (ETS), a training‑free inference technique that samples directly from the optimal reinforcement‑learning (RL) policy for large language models (LLMs). Traditional RL‑from‑human‑feedback (RLHF) pipelines require costly post‑training, large preference datasets, and suffer from unstable dynamics and hyper‑parameter sensitivity. The authors observe that the KL‑regularized RL objective admits a closed‑form optimal policy:
(p^{*}(x|y) \propto p_{\text{ref}}(x|y)\exp\big(r(y,x)/\lambda\big)),
where (p_{\text{ref}}) is a fixed reference model and (r) is a reward function. Rather than approximating this distribution via gradient‑based training, ETS directly samples from it at inference time.
To make this feasible, the authors work within a unified Masked Language Modeling (MLM) framework that subsumes both autoregressive models (ARMs) and diffusion language models (DLMs). In MLM, generation proceeds as a backward Markov chain from a fully masked sequence to the final output, with a mask set (M_t) determining the decoding order. The optimal backward transition kernel can be factorized into (1) the reference model’s transition (p_{\text{ref}}(x_s|x_t,y)) and (2) an “energy” term
(E(y,x_s)=\mathbb{E}{p{\text{ref}}(x_0|y,x_s)}!\big
Comments & Academic Discussion
Loading comments...
Leave a Comment