REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge

Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLM-as-a-Judge. However, standard Reinforcement Learning (RL) methods typically rely on binary rewards (e.…

Authors: Yasi Zhang, Tianyu Chen, Mingyuan Zhou

REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge
REAL: Regr ession-A war e Reinf or cement Lear ning f or LLM-as-a-J udge Y asi Zhang 1 * Tianyu Chen 2 * Mingyuan Zhou 2 Oscar Leong 1 Y ing Nian W u 1 Michal Lukasik 3 † Abstract Large language models (LLMs) are increasingly deployed as automated e v aluators that assign nu- meric scores to model outputs, a paradigm kno wn as LLM-as-a-Judge. Howe v er , standard Rein- forcement Learning (RL) methods typically rely on binary rew ards (e.g., 0-1 accuracy), thereby ignoring the ordinal structure inherent in regres- sion tasks; for instance, they fail to recognize that predicting 4 is significantly better than predict- ing 1 when the ground truth is 5. Con versely , existing regression-aw are approaches are often confined to Supervised Fine-T uning (SFT), lim- iting their ability to explore optimal reasoning paths. T o bridge this g ap, we propose REAL ( RE gression- A ware Reinforcement L earning), a principled RL frame work designed to optimize regression re wards, and also proven to be optimal for correlation metrics. A key technical challenge is that the re gression objectiv e is explicitly polic y- dependent, thus in validating standard polic y gra- dient methods. T o address this, we employ the generalized policy gradient estimator , which natu- rally decomposes optimization into tw o comple- mentary components: (1) exploration o ver Chain- of-Thought (CoT) trajectory , and (2) regression- aw are prediction refinement of the final score. Ex- tensiv e experiments across model scales (8B to 32B) demonstrate that REAL consistently outper- forms both regression-aw are SFT baselines and standard RL methods, e xhibiting significantly bet- ter generalization on out-of-domain benchmarks. On Qwen3-32B specifically , we achieve g ains of +8.40 Pearson and +7.20 Spearman correlation ov er the SFT baseline, and +18.30/+11.20 ov er the base model. These findings highlight the criti- cal value of inte grating re gression objectiv es into RL exploration for accurate LLM e valuation. * Equal contrib ution 1 Univ ersity of California, Los Angeles 2 The Univ ersity of T exas at Austin 3 Google Research † Now at Google DeepMind. Correspondence to: Y asi Zhang < yas- minzhang@ucla.edu > , T ianyu Chen < tianyuchen@ute xas.edu > . Pr eprint. Marc h 19, 2026. T raining Data … Sample K Generations with CoT s Regression-A ware Prediction Refinement Exploration Over Reasoning T rajectory Generalized Policy Gradient Policy-Dependent Reward Advantage Prediction W eight … … Update … F igure 1. Overview of the REAL framework. REAL addresses the limitations of standard RL in LLM-as-a-Judge tasks by optimiz- ing a policy-dependent re gression re ward. The framework employs a generalized policy gradient that leads to a gradient update that decomposes into two terms: (1) Exploration Over Reasoning T rajectory ; and (2) Regression-A ware Prediction Refinement . This enables principled optimization of ordinal structures that stan- dard RL with binary rewards typically ignores. The full algorithm is in Alg. 2 . 1. Introduction Large Language Models (LLMs) ( Brown et al. , 2020 ; Achiam et al. , 2023 ) hav e ev olved beyond mere content generators to become sophisticated e valuators, a paradigm known as LLM-as-a-Judge ( Gu et al. , 2024 ; Chiang & Lee , 2023 ). This role is no w central to diverse applications, from assessing text quality ( Ouyang et al. , 2022 ) and instruc- tion following ( Zhu et al. , 2025 ) to safety alignment ( Zhu et al. , 2024 ) and preference modeling ( Kim et al. , 2023 ). In these contexts, the model must produce a numeric score that accurately reflects quality , correctness, or preference intensity . Despite the regression nature of these tasks, standard prac- tices—ex emplified by Prometheus 1 and 2 ( Kim et al. , 2023 ; 2024 )—rely on traditional SFT via cross-entrop y loss. By treating numeric scores as categorical tokens, these meth- ods ignore the inherent ordinal structure of the data. Re- cent advancements, such as Regression-A ware Fine-T uning (RAFT) ( Lukasik et al. , 2025 ; 2024 ), hav e begun to bridge this gap by optimizing models using regression losses o ver expected numerical predictions. Extensions like TRACT ( Chiang et al. , 2025 ) further incorporate Chain-of-Thought (CoT) supervision to improv e reasoning faithfulness. Ho w- 1 REAL: Regression-A war e Reinforcement Lear ning for LLM-as-a-Judge ev er , these approaches remain confined to the SFT regime: they depend on static ground-truth trajectories and lack a principled mechanism for exploring model-generated rea- soning pathways. T ransitioning from regression-aware SFT to Reinforcement Learning (RL) is, therefore, a natural progression to en- able exploration o ver CoT pathways. This shift allo ws the model to search for optimal reasoning trajectories guided by downstream numerical objectives. Howe ver , standard RL post-training framew orks ( Schulman et al. , 2017 ; Guo et al. , 2025 ; Ahmadian et al. , 2024 ) typically rely on rule-based verifiers to produce binary re wards ( Guo et al. , 2025 ) (e.g., 0–1 accuracy), creating a mismatch between the LLM-as- a-Judge task’ s regression nature and categorical RL opti- mization. For instance, standard RL fails to recognize that predicting a 4 is significantly better than predicting a 1 when the ground truth is 5. Empirically , we also find that standard RL leads to suboptimal learning behavior , specifically a collapse in correlation metrics (see Fig. 2 and T ab . 2 ). T o address this challenge, we introduce Re gression-A ware Reinforcement Learning (REAL), a principled frame work designed specifically for LLM-as-a-Judge and pointwise ev aluation tasks, as presented in Fig. 1 . Howe ver , inte- grating regression-a ware objectiv es into RL introduces a significant theoretical challenge: the resulting reward func- tions depend explicitly on the policy parameters themselves. T o resolve the resulting policy-dependency re ward issue, we employ the generalized policy gradient ( Schulman et al. , 2015 ) that e xplicitly accounts for parameter -dependent re- wards. This deri vation decomposes the learning process into two complementary components: (1) A CoT polic y-gradient term that encourages exploration ov er reasoning trajectories based on do wnstream regression-aware re wards; and (2) A prediction refinement term that pro vides regression-aware supervision to the final numerical prediction via standard backpropagation. This decomposition clarifies how rea- soning exploration and numerical accuracy can be jointly optimized within a unified framew ork. Theoretically , Sec. 3 proves that minimizing the re gression loss leads to optimal optimization in terms of correlation metrics. Empirically , REAL consistently outperforms prior regression-a ware SFT methods ( Lukasik et al. , 2025 ; Chi- ang et al. , 2025 ) across multiple benchmarks (See Fig. 3 ). Notably , REAL exhibits superior generalization to out-of- domain tasks, highlighting the importance of modeling re- gression structures directly within RL. While concurrent works like J1 ( Whitehouse et al. , 2025 ) also apply RL to the LLM-as-a-Judge setting, they fail to e xploit the ordinal nature of numeric scoring. T o the best of our knowledge, we are the first to successfully integrate a re gression-aware objectiv e directly into the RL training pipeline for LLM ev aluators. F igure 2. Evaluation performance during RL training. Both stan- dard RL with binary reward (i.e., r acc = 1 ( y = y ∗ ) ) and REAL with regression-a ware rew ard (i.e., Eq. 5 ) were initialized from the SO T A SFT checkpoint, i.e., TRA CT ( Chiang et al. , 2025 ). Stan- dard RL results in suboptimal performance in correlation metrics compared to our proposed approach: REAL. F igure 3. Performance gains on Qwen3-32B. Our method achieves an av erage increase of +8.40 Pearson and +7.20 Spearman cor- relation ov er the SFT baseline, and +18.30/+11.20 ov er the base model. In summary , our contributions are threefold: 1. W e propose REAL, a principled regression-a ware RL framew ork that bridges the gap between standard RL and regression-based tasks by introducing a polic y- dependent regression loss, as presented in Fig. 1 . This formulation explicitly models the ordinal structure of ev aluation tasks, addressing the inherent limitations of binary re wards (See Fig. 2 ). Furthermore, we provide a theoretical proof demonstrating that minimizing this regression loss also optimizes correlation metrics, as presented in Sec. 3 . 2. W e employ a generalized polic y gradient estimator that enables mathematically sound optimization of policy- dependent objectives. This deriv ation rev eals a func- tional decomposition that decouples the exploration of reasoning trajectories (CoT) from the optimization of final numerical predictions. 3. Experiments across model scales (8B to 32B) demon- strate that REAL consistently outperforms both 2 REAL: Regression-A war e Reinforcement Lear ning for LLM-as-a-Judge regression-a ware SFT baselines and standard RL meth- ods. REAL exhibits superior generalization on out-of- domain benchmarks; specifically , on Qwen3-32B, we achiev e gains of +8.40 Pearson and +7.20 Spearman correlation ov er the SFT baseline, and +18.30/+11.20 ov er the base model, as shown in Fig. 3 . 2. Background 2.1. LLMs and Reinfor cement Learning LLMs are typically trained in phases, starting from pre- training ( Kaplan et al. , 2020 ; Brown et al. , 2020 ), follow- ing with multiple stages of post-training ( Ouyang et al. , 2022 ), most notably an alignment phase where an LLM is trained to improv e its reasoning capabilities via Reinforce- ment Learning (RL) ( Guo et al. , 2025 ). Let x denote the input prompt, c the chain-of-thought (CoT), and y the final answer . W e define D as the training set, which, depending on the context, may consist of prompts { x } , prompt-target pairs { ( x, y ∗ ) } , or triplets { ( x, c ∗ , y ∗ ) } . Here, the super- script ∗ denotes ground truth, as opposed to a predicted value from the model. Furthermore, let π θ ( · | x ) represent the distrib ution over token sequences, typically the CoT reasoning c followed by the final answer y , as generated by a language model with learnable parameters θ . Fundamen- tally , the objective of reinforcement learning ( Sutton et al. , 1999 ) is to maximize a rew ard: L ( θ ) = E x ∼D , ( c,y ) ∼ π θ ( ·| x ) [ r ( x, y )] , (1) where the re ward function r is commonly an accuracy-based rew ard in the form of 1 ( y = y ∗ ) if the ground-truth label y ∗ is av ailable in the training dataset. Since the reward r ( x, y ) in LLM applications is typically only av ailable after the entire sequence is generated, it is common to treat the full completion as a single action, re- sulting in a bandit-style formulation. This enables the use of the REINFORCE gradient estimator ( W illiams , 1992 ; Nguyen et al. , 2017 ). Under this formulation, the policy gradient can be written as: ∇ θ L ( θ ) = E x ∼D , ( c,y ) ∼ π θ ( ·| x ) [ r ( x, y ) ∇ θ log π θ ( y , c | x )] . (2) In this formulation, the reward r ( x, y ) serves as a scalar weighting factor on the polic y gradient, amplifying up- dates for high-reward generations while attenuating those associated with low rew ards. W e note that the deriv ation of the gradient relies on the assumption that the reward does not explicitly depend on the policy parameters, i.e., ∇ θ r ( x, y ) = 0 . 2.2. Regression-A ware Inference and Finetuning in LLMs Standard supervised fine-tuning (SFT) for LLMs typically relies on cross-entropy loss ( Brown et al. , 2020 ). This objec- ti ve is suboptimal for regression tasks as it treats all incorrect numerical tokens equally , failing to account for their met- ric distance from the target value. T o address this, sev eral methods hav e been proposed to align LLM outputs with the inherent structure of numerical targets. Regression-A ware Fine-T uning (RAFT) ( Lukasik et al. , 2025 ) replaces token-lev el classification with a squared error loss. This loss is computed between the ground-truth score and an expected-value predictor , which applies a Bayes- optimal decision rule to the model’ s output distribution. Specifically , gi ven an input x and ground-truth reasoning c ∗ and label y ∗ from a dataset D , the RAFT objectiv e to minimize is defined as: L RAFT ( θ ) = E ( x,c ∗ ,y ∗ ) ∼D [( ˆ y θ ( x, c ∗ ) − y ∗ ) 2 | {z } regression loss − λ log π θ ( c ∗ , y ∗ | x ) | {z } next token prediction loss ] , (3) where ˆ y θ ( x, c ) = P k ∈K k · π θ ( k | x, c ) denotes the expected val ue of the numerical prediction, originally proposed as the RAIL pr edictor (see Alg. 1 ) for regression-aware inference ( Lukasik et al. , 2024 ). Here, K = { 0 , 1 , . . . , 9 } represents the set of single-digit numerical tokens, and λ is a hyperpa- rameter balancing the re gression objectiv e with the standard next-token prediction (NTP) loss. The RAIL predictor has been sho wn to offer “free-lunch” improvements in Pear- son and Spearman correlation ( Lukasik et al. , 2025 ; Chiang et al. , 2025 ), a finding we corroborate in T ab . 2 . Algorithm 1 RAIL predictor ( Lukasik et al. , 2024 ) Require: Input prompt x , LLM π θ 1: Generate chain of thought c ∼ π θ ( ·| x ) # Compute expected value o ver digit token set K 2: ˆ y θ ← P k ∈K k · π θ ( k | x, c ) 3: r eturn RAIL predictor ˆ y θ Building on this foundation, TRA CT ( Chiang et al. , 2025 ) extends the RAFT framework by incorporating self- generated reasoning paths. TRA CT utilizes a two-stage SFT pipeline: initially , a temporary model π temp is trained via the RAFT objectiv e (Eq. 3 ). Subsequently , a refined dataset is synthesized by pairing the original inputs x with reasoning trajectories c sampled from the temporary policy , c ∼ π temp ( ·| x ) , while retaining the ground-truth labels y ∗ . The final model is then trained on this augmented triplet dataset { ( x, c, y ∗ ) } using the same RAFT objecti ve. While this self-generation strategy significantly improves perfor - mance, it remains fundamentally a static SFT procedure. It 3 REAL: Regression-A war e Reinforcement Lear ning for LLM-as-a-Judge lacks a principled mechanism for continuous, acti ve explo- ration of the reasoning space—a limitation we ov ercome by transitioning to a reinforcement learning framew ork. 3. Optimal Decision Rule f or LLM-as-a-Judge Evaluation Metrics Standard approaches for LLM post-training often frame re- inforcement learning using a binary accuracy re ward (i.e., the indicator function 1 { ˆ y = y ∗ } ), strictly optimizing for prediction correctness while performing inference via stan- dard stochastic decoding. W e argue that this paradigm is theoretically suboptimal for benchmarks primarily ev aluated on correlation metrics, such as LLM-as-a-Judge ( Gu et al. , 2024 ; Chiang & Lee , 2023 ; Zheng et al. , 2023 ; Liu et al. , 2023a ; Zhu et al. , 2024 ; Ouyang et al. , 2022 ; Zhu et al. , 2025 ). Prior works on re gression-aware LLMs sho wed that other decision rules are optimal for regression loss ( Lukasik et al. , 2025 ; Chiang et al. , 2025 ), but an optimal rule for correlation has not been discussed in these works, despite ev aluating on correlation. A fundamental challenge in optimizing for correlation is that it is inherently a population statistic defined over a set of predictions, rather than a single data point. Conse- quently , correlation cannot be directly utilized as a standard per-sample re ward signal in reinforcement learning. W e formally demonstrate that this limitation can be bypassed by identifying a shared optimal estimator: the squared error loss, a tractable per-sample objecti ve, serv es as the natural proxy for maximizing Pearson correlation. Lemma 3.1 (Optimality of Squared Error for Pearson Cor- relation) . Consider the following distributions: the distri- bution o ver input prompts x ∼ P D ( . ) , the distribution o ver chains of thought (CoTs) c ∼ π θ ( . | x ) generated by the LLM policy π θ , and the distribution over tar gets conditioned on the inputs, y ∗ ∼ P ( . | x ) . In this case, we assume the conditional independence c ⊥ y ∗ | x , yielding the joint dis- tribution: P θ ( x, c, y ∗ ) = P D ( x ) π θ ( c | x ) P ( y ∗ | x ) . All ex- pectations E [ · ] below ar e taken with r espect to P θ ( x, c, y ∗ ) . Define the posterior mean µ ( x, c ) ≜ E [ y ∗ | x, c ] and the squar ed-err or risk R ( ˆ y ) ≜ E h  ˆ y ( x, c ) − y ∗  2 i . Assume V ar( y ∗ ) > 0 and V ar( µ ( x, c )) > 0 . The risk R ( ˆ y ) is mini- mized (a.s.) by ˆ y ∗ ( x, c ) = µ ( x, c ) , and the P earson corr ela- tion ρ ( ˆ y ( x, c ) , y ∗ ) is maximized by any positive af fine trans- form of µ : ˆ y ( x, c ) = a µ ( x, c ) + b, a > 0 , b ∈ R . In par- ticular , the squared-err or minimizer ˆ y ∗ ( x, c ) = E [ y ∗ | x, c ] is also optimal for maximizing P earson correlation. In other wor ds, by minimizing the squared err or , we implicitly train the predictor to be the optimal estimator for the P earson corr elation metric. The proof, provided in Appendix B , establishes that the Pearson correlation is maximized by any positive linear transformation of the conditional expectation (Lemma B.1 ), and that the squared error loss is uniquely minimized by that same conditional expectation (Lemma B.2 ). By minimizing the squared loss, we implicitly train the predictor to be the optimal estimator for the Pearson correlation metric. This re- sult aligns with empirical findings in RAFT ( Lukasik et al. , 2025 ) and TRA CT ( Chiang et al. , 2025 ), which demon- strate that regression-aw are objectives are ef fective for both squared error and correlation-based ev aluation metrics. 4. REAL: Regression-A ware Reinf orcement Learning In this section, we introduce Regression-A w are Rein- forcement Learning (REAL), a framework that leverages regression-based re wards to optimize LLM e valuators. As theoretically demonstrated in Sec. 3 , this formulation is optimal for maximizing correlation metrics. W e first note that REAL e xtends the principles of regression- aware supervised fine-tuning—specifically RAFT ( Lukasik et al. , 2025 ) and TRACT ( Chiang et al. , 2025 )—into the reinforcement learning regime. By transitioning to RL, the model is no longer confined to static, ground-truth reason- ing but is instead empowered to explore and refine its o wn self-generated reasoning trajectories. A key distinction in our framework is that the resulting regression reward is policy-dependent, which necessitates the use of a gener- alized policy gradient to ensure v alid optimization. This estimator naturally decomposes the training process into two distinct components: the exploration of reasoning pathways and the refinement of numerical predictions. 4.1. REAL Objective W e define the REAL objectiv e to minimize L REAL ( θ ) as: L REAL ( θ ) = E ( x,y ∗ ) ∼D ,c ∼ π θ ( ·| x ) [( ˆ y θ ( x, c ) − y ∗ ) 2 − λ log π θ ( y ∗ | x, c )] , (4) In other words, we are augmenting the regression loss with an auxiliary log-likelihood term on the final answer tok ens. When setting λ = 0 we recov er the exact regression loss, which is of key importance for endowing the model with better numerical prediction abilities. Augmenting with the log-likelihood term is also in line with the SFT objectiv e design in TRA CT ( Chiang et al. , 2025 ). Consequently , the implicit reward function in our framework is defined as: r REAL ( θ , x, c ) = − ( ˆ y θ ( x, c ) − y ∗ ) 2 + λ log π θ ( y ∗ | c, x ) . (5) Crucially , unlike standard RL post training where the reward is provided by a fix ed external model, our re ward function r REAL explicitly depends on the policy parameters θ (via 4 REAL: Regression-A war e Reinforcement Lear ning for LLM-as-a-Judge the estimator ˆ y θ and probability π θ ). 4.2. Generalized Policy Gradient with Policy-Dependent Reward Functions In standard LLM post-training methodologies, the re ward function is typically modeled as an e xternal signal derived from the en vironment or a fixed preference model, inde- pendent of the current polic y π θ (i.e., ∇ θ r = 0 ). Howe ver , our frame work introduces a re ward function r REAL ( θ , x, c ) (Eq. 5 ) that explicitly depends on the policy parameters θ . Consequently , the standard policy gradient formulation is insufficient. In this section, we formalize the generalized policy gradient ( Schulman et al. , 2015 ) for this class of policy-dependent re ward functions. Lemma 4.1 (Generalized Policy Gradient with Polic y-De- pendent Rewards for Regression) . Let the objective be L ( θ ) = E x ∼D E c ∼ π θ ( ·| x ) [ r ( θ , x, c )] . The gradient ∇ θ L ( θ ) is given by: ∇ θ L ( θ ) = E ( x,y ∗ ) ∼D ,c ∼ π θ ( ·| x ) h r ( θ , x, c ) ∇ θ log π θ ( c | x ) | {z } T erm 1: CoT Update + ∇ θ r ( θ , x, c ) | {z } T erm 2: Prediction Update i . (6) The first term optimizes the CoT generation, weighted by the r ewar d; the second term pr ovides supervision for fi- nal answer prediction thr ough standard bac kpropa gation. Specifically , by substituting the REAL r ewar d function (Eq. 5 ) into the general gradient e xpression (Eq. 6 ), we obtain the gradient for the REAL objective (Eq. 4 ). Specifically , T erm 2 can be e xpanded as: T erm 2 = − 2( ˆ y θ ( x, c ) − y ∗ ) ∇ θ ˆ y θ ( x, c ) + λ ∇ θ log π θ ( y ∗ | x, c ) (7) wher e the gradient of the predicted value is ∇ θ ˆ y θ ( x, c ) = P k ∈K k · ∇ θ π θ ( k | x, c ) . Distinguishing Reasoning Exploration fr om Prediction Refinement The REAL gradient (i.e., Eq. 6 ) explicitly decomposes optimization into two distinct parts: exploring reasoning trajectories (T erm 1) and refining score prediction (T erm 2). This separation is critical because standard RL approaches like GRPO ( Shao et al. , 2024 ) treat the reasoning chain c and final answer y as a homogeneous sequence, applying a uniform update rule. In contrast, our framework addresses their structural differences: the CoT update allows exploration via policy gradients, whereas the final score targets a kno wn ground truth, allowing for direct correction via standard backpropagation. 4.3. Stabilization and Implementation T o enhance the stability of the REAL objectiv e during train- ing, we utilize the RLOO estimator ( Kool et al. , 2019 ) for the first term, which go verns reasoning exploration. Other stabilization approaches ( Schulman et al. , 2017 ; Guo et al. , 2025 ) can be explored in future work. W e note that this is not central to the core contribution of our paper , i.e., the regression-a ware RL objectiv e and algorithm. For a gi ven input ( x, y ∗ ) , we sample K independent CoT trajectories { c 1 , c 2 , . . . , c K } from the current policy π θ ( ·| x ) . Let r ( i ) = r REAL ( θ , x, c i ) denote the reward as- sociated with the i -th trajectory . The advantage A ( i ) is computed as: A ( i ) = r ( i ) − b ( i ) ; b ( i ) = 1 K − 1 P j  = i r ( j ) . W e further compute the standardized advantage ˜ A ( i ) = clip  A ( i ) σ ( A )+ ϵ , − 1 , 1  for better training stability , where σ ( A ) denotes the standard de viation of the advantages within the sampled group and ϵ = 10 − 8 . Then from Eq. 6 , replacing re ward by RLOO adv antages in T erm 1 and add β on T erm 2 controls the strength of the prediction updates, we hav e the stabilized gradient estimator defined as ∇ θ L RLOO REAL ( θ ) ≈ 1 K K X i =1  ˜ A ( i ) REAL ∇ θ log π θ ( c i | x ) | {z } CoT Update + β ∇ θ r REAL ( θ, x, c i ) | {z } Prediction Update  (8) Follo wing ( T ang et al. , 2025 ) we introduce a prediction weight β to balance T erm 1 and T erm 2. Note that Eq. 6 implies an exact weighting of β = 1 . 0 , which we find to already bring good results, as shown in T ab. 8 . 4.4. Relations with Prior Methods W e provide a complete conceptual comparison with prior methods in T ab . 1 . Comparison to TRA CT : Acti ve vs. Static Reasoning W e identify both a conceptual connection and a critical lim- itation in TRACT ( Chiang et al. , 2025 ). While TRA CT introduces reasoning traces to the regression setting, it re- mains confined to the SFT paradigm. Specifically , TRA CT treats all self-generated Chain-of-Thought (CoT) trajecto- ries as ground truth, failing to evaluate their intermediate quality . In contrast, REAL leverages a regression-a ware rew ard to actively e xplore and rank CoT trajectories during RL. Mathematically , TRACT’ s update is equi valent to the pr ediction refinement term (T erm 2) of our objectiv e, but it lacks the trajectory e xploration component (T erm 1) neces- sary for true reinforcement learning, as can be seen when comparing the respective ro ws in T ab . 1 . Thus, TRACT can be vie wed as an offline, supervised subset of the more general REAL framew ork. Regression-A ware vs. Standard RL Standard RL typ- ically relies on coarse binary rewards (e.g., 0/1 accuracy), which ignores the ordinal structure inherent in ev aluation tasks. As illustrated in Fig. 2 , REAL utilizes the policy’ s full probability distribution to shape a fine-grained, re gression- 5 REAL: Regression-A war e Reinforcement Lear ning for LLM-as-a-Judge T able 1. Conceptual comparison of selected SFT and RL training methods on non-ordinal and ordinal targets. r ( θ, x, c ) = − ( ˆ y θ ( x, c ) − y ∗ ) 2 + λ log π θ ( y ∗ | x, c ) represents the regression-aw are reward. REAL is the first framew ork to combine a regression-aw are reward with reinforcement learning for reasoning exploration. Note that JEPO further approximates the gradient with the multi-sample Jensen’ s Bound. Method Paradigm CoT Exploration Ordinal A wareness Gradient SFT SFT ✗ ✗ E ( x,y ∗ ) ∼D [ ∇ θ log π θ ( y ∗ | x )] RAFT/TRA CT SFT ✗ ✓ E ( x,c,y ∗ ) ∼D [ ∇ θ r ( θ , x, c )] Standard RL RL ✓ ✗ E x ∼D , ( c,y ) ∼ π θ ( ·| x ) [ r ( x, y ) ∇ θ log π θ ( y , c | x )] JEPO ( T ang et al. , 2025 ) RL ✓ ✗ E x ∼D ,c ∼ π θ ( ·| x ) [log π θ ( y ∗ | x, c ) ∇ θ log π θ ( c | x ) + ∇ θ log π θ ( y ∗ | x, c )] REAL (ours) RL ✓ ✓ E ( x,y ∗ ) ∼D ,c ∼ π θ ( ·| x ) [ r ( θ , x, c ) ∇ θ log π θ ( c | x ) + ∇ θ r ( θ , x, c )] aware re ward. This allows the model to distinguish ”near- miss” predictions from total failures, leading to significantly higher alignment with human judgment as demonstrated by the correlation metrics in T ab . 2 . Comparison to JEPO: JEPO ( T ang et al. , 2025 ) op- timizes the mar ginal log-likelihood log p ( y ∗ | x ) using Jensen’ s bound to handle un verifiable data. While effecti ve for tasks like mathematical proofs, JEPO lacks the ordinal awareness required for numeric scoring in LLM-as-a-Judge settings. Our REAL objective ( 4 ) is explicitly regression- aware, providing a more principled approach for tasks where the distance between scores matters. W e contrast JEPO and our objectiv es in T ab. 1 . Since JEPO does not release an official codebase, we reproduced their results in our setting. The empirical comparisons in T ab . 10 in the Appendix con- firm that REAL consistently outperforms JEPO across all regression metrics. 5. Experiments W e mainly follow the experimental setup in ( Chiang et al. , 2025 ; Kim et al. , 2023 ; 2024 ) for e valuation, and implement our RL algorithm using the codebase of verl ( Sheng et al. , 2024 ). Models W e implement REAL on three LLMs with dif- ferent architectures, tokenizer s , and sizes to demonstrate robustness across backbones: 1) Mistral2-7B ( Jiang et al. , 2023 ), for a fair comparison with TRA CT and Prometheus-2, which both use it as their base model; 2) Qwen3-8B ( Y ang et al. , 2025 ), which uses a dif ferent tokenizer than Mistral and has recently sho wn strong reasoning and instruction- following abilities; and 3) Qwen3-32B ( Y ang et al. , 2025 ), to which we scale up performance to demonstrate the scala- bility of REAL. Datasets W e implement training on F eedback Collec- tion ( Kim et al. , 2023 ), which consists of approximately 100K pointwise samples together with 1K fine-grained score rubrics, 20K instructions. Following ( Chiang et al. , 2025 ; Kim et al. , 2023 ; 2024 ), we e valuate pointwise LLM-as- a-Judge performance on four widely used benchmarks: 1) Feedback Bench , the of ficial test set for Feedback Collec- tion, includes 1K non-overlapping score rubrics and 200 instructions; 2) FLASK , a fine-grained ev aluation bench- mark, consists of 200 prompts, 12 score rubrics, and 2K responses generated by Alpaca-7B ( T aori et al. , 2023 ), V icuna-13B ( Chiang et al. , 2023 ), Bard ( Google , 2023 ), and GPT -3.5-T urbo-0613 ( OpenAI , 2023 ); 3) V icuna Bench , a single-turn dialogue dataset, includes 80 test prompts, 80 hand-crafted score rubrics, and 320 responses generated by W izardLM-13B ( Xu et al. , 2023 ), V icuna-13B ( Chiang et al. , 2023 ), Llama-2Chat-13B ( T ouvron et al. , 2023 ), and GPT -3.5-T urbo-0613 ( OpenAI , 2023 ); and 4) MT Bench , a multi-turn chat benchmark, contains 80 test prompts, 80 hand-crafted score rubrics, and 320 responses generated by the same models as abov e. T ogether , these datasets cover a div erse range of prompts, score rubrics, response styles, and difficulty lev els. In addition, we note that Feedback Bench can be regarded as more of an in-domain test dataset, while the other three are out-of-distribution . Evaluation Metrics W e report the Pearson correlation coefficient r ( Pearson , 1895 ), Spearman’ s rank correlation ρ ( Spearman , 1961 ), and K endall’ s τ ( Kendall , 1938 ) with ground-truth scores. Additionally , we provide RMSE and MAE metrics in the ablation study for reference. Baselines W e compare R E A L against zero-shot, SFT , and RL baselines: (i) Zer o-shot inference using the base model; (ii) RAFT ( Lukasik et al. , 2025 ), which applies regression- aware supervised fine-tuning with Eq. 3 ; (iii) TRA CT ( Chiang et al. , 2025 ), which performs two- stage regression-aw are supervised fine-tuning. Note that TRA CT simply treats all self-generated CoTs as ground truth for the second stage, without any CoT update. In contrast, our method conducts CoT policy gradient updates weighted by the REAL rew ard; 6 REAL: Regression-A war e Reinforcement Lear ning for LLM-as-a-Judge T able 2. Evaluation results ( × 100 ) across benchmarks using Pearson ( r ), Spearman ( ρ ), and Kendall ( τ ). All baseline rew ard model scores are taken from TRA CT ( Chiang et al. , 2025 ) and Prometheus ( Kim et al. , 2023 ). Results for the Mistral and Qwen models are obtained from our o wn runs. Underline indicates the initialization checkpoint used by REAL. W e find that TRA CT saturates on the Qwen3 series models, so we choose RAFT as the initialization checkpoint. Mistral2-7B’ s base model failed to follo w instructions and respond in the correct format, so we applied a 100-step warmup. Due to computational constraints we run only RAFT and REAL for Qwen3-32B; we include all baselines for other models. Method T raining Inference FB Bench FLASK V ic. Bench MT Bench A verage r ρ τ r ρ τ r ρ τ r ρ τ r ρ τ Baseline Reward Models GPT -3.5-Turbo-0613 None Standard 56.3 52.1 45.3 27.0 23.2 18.7 27.5 26.7 20.2 42.2 37.1 29.9 38.3 34.8 28.5 Prometheus-1-7B SFT Standard 84.7 84.9 76.7 45.7 45.7 36.5 29.3 29.5 21.6 36.7 37.1 28.5 49.1 49.3 40.8 Prometheus-1-13B SFT Standard 86.0 85.8 77.1 46.6 42.9 34.6 47.3 45.1 34.1 46.7 45.5 34.5 56.7 54.8 45.1 Prometheus-2-7B SFT Standard 84.5 84.7 76.5 51.2 49.3 40.5 48.8 48.0 41.1 51.9 48.3 39.2 59.1 57.6 49.3 Prometheus-2-7B SFT RAIL 85.3 85.3 72.9 52.5 51.4 39.2 51.0 51.3 40.3 53.8 51.1 38.9 60.7 59.8 47.8 Mistral2-7B-V ariants Base (w/ warmup) None Standard 83.1 83.3 74.8 41.5 41.9 34.1 49.2 42.4 36.1 30.9 31.8 25.3 51.2 49.8 42.6 Base (w/ warmup) None RAIL 83.7 84.3 70.2 42.5 43.7 32.3 50.3 45.4 34.0 32.0 29.9 21.4 52.1 50.8 39.5 RAFT SFT RAIL 87.9 88.0 76.3 41.8 41.9 31.5 52.8 51.3 40.9 39.9 41.8 30.7 55.6 55.8 44.8 TRA CT SFT RAIL 93.9 93.7 82.9 50.7 50.0 37.2 56.2 54.8 42.6 52.1 50.1 36.6 63.2 62.2 49.8 Standard RL RL RAIL 93.7 93.7 82.8 51.6 50.5 37.9 58.0 56.0 43.4 52.9 50.7 37.1 64.1 62.7 50.3 REAL (ours) RL RAIL 93.2 93.4 82.5 56.0 54.1 41.1 63.3 60.2 46.3 59.3 56.9 42.2 67.9 66.2 53.0 Qwen3-8B-V ariants Base None Standard 56.6 62.7 53.9 44.8 48.0 40.1 37.3 46.5 40.8 35.5 32.7 26.5 43.6 47.4 40.3 Base None RAIL 56.7 65.4 54.1 45.0 48.3 38.5 37.3 46.1 38.6 35.9 35.5 27.0 43.7 48.8 39.5 RAFT SFT RAIL 84.3 85.5 73.0 49.2 50.1 37.5 59.9 57.2 44.0 54.1 51.7 38.5 61.9 61.1 48.3 TRA CT SFT RAIL 94.9 94.7 84.3 51.4 51.3 38.5 50.5 50.4 38.3 55.8 58.6 43.6 63.1 63.8 51.2 Standard RL RL RAIL 94.0 94.2 83.5 52.3 52.4 39.5 57.1 56.1 43.0 49.5 53.5 39.7 63.2 64.0 51.4 REAL (ours) RL RAIL 92.0 92.1 85.7 53.8 53.9 43.1 60.5 57.8 48.2 61.7 60.8 47.1 67.0 66.0 56.1 Qwen3-32B-V ariants Base None RAIL 63.4 70.8 56.7 54.3 60.4 47.2 50.8 57.4 45.0 42.5 46.8 35.3 52.7 58.8 46.1 RAFT SFT RAIL 85.4 86.5 72.9 52.1 52.9 39.9 51.9 52.0 39.9 61.1 59.6 43.9 62.6 62.8 49.2 REAL (ours) RL RAIL 91.1 91.7 85.9 58.9 58.6 47.4 65.1 60.7 51.2 68.9 69.1 55.2 71.0 70.0 60.0 T able 3. A verage-of-N results at infer ence. A veraging the results of N generations slightly improves performance, although REAL already enjoys ef ficiency and strong performance at N = 1 . Setting r ↑ ρ ↑ τ ↑ RMSE ↓ MAE ↓ N = 1 67.9 66.2 53.0 0.968 0.697 N = 4 68.1 66.2 53.0 0.965 0.695 N = 10 68.1 66.2 53.0 0.964 0.695 T able 4. Ablation on group size K . Larger v alues of K generally result in better correlation performance. Setting r ↑ ρ ↑ τ ↑ RMSE ↓ MAE ↓ K = 4 63.7 62.6 50.2 1.063 0.775 K = 8 67.2 65.0 52.0 0.943 0.679 K = 16 67.9 66.2 53.0 0.968 0.697 (iv) Standard RL (binary reward) ( Kool et al. , 2019 ), a baseline that uses accuracy as the reward, where r acc = 1 ( y = y ∗ ) , and updates via A acc ∇ θ π ( c, y | x ) , where A acc is the RLOO advantage for r acc ; (v) Other baseline reward models: 1) zero-shot GPT3.5- T urbo ( OpenAI , 2023 ); 2) Prometheus-1-7B ( Kim et al. , 2023 ); 3) Prometheus-1-13B ( Kim et al. , 2023 ); and 4) Prometheus-2-7B ( Kim et al. , 2024 ). T able 5. Ablation on log-likelihood weight λ . Using regression- aware re ward alone ( λ = 0 ) already yields excellent results, while extra log-likelihood re ward further improv es. Setting r ↑ ρ ↑ τ ↑ RMSE ↓ MAE ↓ TRA CT 63.2 62.2 49.8 1.103 0.813 λ = 0 66.0 64.6 51.8 1.027 0.736 λ = 1 67.9 66.2 53.0 0.968 0.697 λ = 5 68.1 65.8 52.7 0.955 0.719 λ = 10 68.1 66.1 52.9 0.928 0.699 Follo wing the findings in ( Chiang et al. , 2025 ; Lukasik et al. , 2024 ), we mainly report RAIL inference results, as RAIL has been shown to offer fr ee-lunch improv ements under Pearson and Spearman correlation metrics at inference. Configuration W e conduct experiments using 8 NVIDIA A100 GPUs for Mistral2-7B and Qwen3-8B, and 2×8 A100 GPUs for Qwen3-32B. All e xperiments are performed using full-parameter fine-tuning. The learning rate is set to 5 × 10 − 8 for the Mistral models and 1 × 10 − 6 for the Qwen series. The max r esponse length is set to 1024. Unless otherwise specified, other hyperparameters are shared across all three e valuated LLMs. Ablation studies are carried out to analyze the impact of dif ferent hyperparameter settings 7 REAL: Regression-A war e Reinforcement Lear ning for LLM-as-a-Judge T able 6. Ablation on dynamic sampling when gr oup size K = 8 . Partially correct prompts pro vide the best learning signal. Setting r ↑ ρ ↑ τ ↑ RMSE ↓ MAE ↓ 0 < acc < 1 67.2 65.0 52.0 0.943 0.679 0 < acc ≤ 1 65.6 64.1 51.3 1.000 0.711 0 ≤ acc < 1 65.3 63.4 50.9 1.001 0.728 0 ≤ acc ≤ 1 65.5 63.9 51.1 0.990 0.715 T able 7. Initialization from different checkpoints. Initializing from a high-quality SFT checkpoint can lead to better SO T A per- formance. Our algorithm works with dif ferent checkpoint types. Setting r ↑ ρ ↑ τ ↑ RMSE ↓ MAE ↓ base 52.1 50.8 39.5 1.225 0.901 init. from base 58.8 56.8 45.1 1.155 0.793 ∆ +6.7 +6.0 +5.6 -0.070 -0.108 SFT 63.2 62.2 49.8 1.103 0.813 init. from SFT 67.9 66.2 53.0 0.968 0.697 ∆ +4.7 +4.0 +3.2 -0.135 -0.116 on model performance. 5.1. Main Results See T ab . 2 for all the results. Across all benchmarks and model backbones, R E A L consistently improves correla- tion with ground-truth scores compared to the zero-shot inference, regression-a ware supervised fine-tuning, and the standard RL methods. W e note that REAL performs signifi- cantly better on out-of-domain benchmarks such as V icuna, MT , and Flask. W e provide a qualitative example in Ap- pendix D . A v erage-of-N results at inference W e average the RAIL prediction ˆ y θ ov er N generations for each test prompt as the final output and then calculate the metrics shown in T ab . 3 . Scaling N to 4 or 10 slightly improves performance, which indicates that REAL already achie ves strong results at N = 1 . This demonstrates that the model learns a robust policy capable of ef ficiently identifying high-rew ard outputs. This single-generation capability ensures peak sampling efficienc y , making REAL highly suitable for lo w-latency , real-world deployment by eliminating the computational ov erhead of multi-sample averaging. 5.2. Ablation Studies Rows shaded in gray indicate the default setting used in our experiments. Group size K : RLOO variance reduction r esults in su- perior perf ormance T ab . 4 indicates a clear positiv e correlation between the number of sampled responses and T able 8. Ablation on prediction weight β . W e choose β = 0 . 01 (See Eq. 8 ) for initialization from both base and SFT checkpoints. Setting r ↑ ρ ↑ τ ↑ RMSE ↓ MAE ↓ Init. fr om Base β = 0 57.7 56.9 45.1 1.200 0.831 β = 0 . 01 58.8 56.8 45.1 1.155 0.793 β = 1 57.9 56.4 44.4 1.184 0.822 Init. fr om SFT β = 0 67.3 65.9 52.5 0.980 0.701 β = 0 . 01 67.9 66.2 53.0 0.968 0.697 β = 1 66.7 64.5 51.6 0.935 0.677 T able 9. Ablation on temperature T . T = 1 is chosen by default. The temperature parameter is applied in both the training and inference stages, for both CoT generation and score prediction. Setting r ↑ ρ ↑ τ ↑ RMSE ↓ MAE ↓ T = 0 . 8 66.8 64.7 51.9 0.973 0.725 T = 1 67.9 66.2 53.0 0.968 0.697 the quality of the learned model. Increasing the group size from K = 4 to K = 16 yields consistent improvements across all correlation metrics, with the Pearson correlation r rising from 63 . 7 to 67 . 9 , indicating the effecti veness of RLOO variance reduction. Log-likelihood weight λ : regr ession-aware reward alone excels in performance; additional log-likelihood term gives further benefits. T ab . 5 shows that using the regression-a ware reward alone ( λ = 0 ) already significantly outperforms the TRACT baseline. Moreov er, introducing the log-likelihood objecti ve ( λ = 1 / 5 / 10 ) provides further enhancement. This suggests that the log-likelihood compo- nent serves as a beneficial complement to the core regression objectiv e. λ = 1 is chosen by default. Dynamic Sampling Strategy: Partially Correct Pr ompts Pro vide the Most Informativ e Learning Signal. As shown in T ab . 6 , we ev aluate four training settings based on different dynamic sampling ( Y u et al. , 2025 ; Xu et al. , 2025 ) strategies using gr oup accuracy as the primary metric. Let { y ( k ) } K k =1 denote a group of K generated responses for a gi ven prompt x , and let y ∗ be the ground-truth label. W e define the group accuracy acc ( x ) as the proportion of correct responses within the group: acc ( x ) = 1 K P K k =1 1 { y ( k ) = y ∗ } . Aligned with ( Y u et al. , 2025 ), our results demon- strate that training exclusi vely on the partially correct set ( 0 < acc < 1 ) consistently yields the best performance. This indicates that these prompts provide a rich learning signal while avoiding the gradient noise of entirely incor- rect examples ( acc = 0 ) and the vanishing supervision of already mastered ones (acc = 1 ). Initialization checkpoints, Prediction weight β , and T em- perature T . T ab . 7 demonstrates that REAL is robust 8 REAL: Regression-A war e Reinforcement Lear ning for LLM-as-a-Judge to dif ferent starting configurations, yielding consistent im- prov ements whether initialized from a base model or a high- quality SFT checkpoint. T ab . 8 indicates β = 0 . 01 brings the best performance regardless of initialization checkpoint types. T = 1 is chosen by default as supported by T ab. 9 . 6. Related W ork Recently , employing language models as judges has gained attention as a promising paradigm to mimic the depth and granularity of human ev aluation ( Gu et al. , 2024 ; Chiang & Lee , 2023 ; Zheng et al. , 2023 ; Liu et al. , 2023a ; Zhu et al. , 2024 ; Ouyang et al. , 2022 ; Zhu et al. , 2025 ). In terms of output format, we classify LLM-as-a-Judge into two types: 1. P ointwise , where the output is a score usually ranging from 1 to 5 ( Chiang & Lee , 2023 ; Liu et al. , 2023b ); 2. P air- wise , where the LLM is asked to compare two responses and select the better one ( Zheng et al. , 2023 ; W ang et al. , 2023 ; Li et al. , 2023 ). While recent advancements such as Prometheus 1 and 2 ( Kim et al. , 2023 ; 2024 ) lev erage proprietary models (e.g., GPT -4 ( Achiam et al. , 2023 )) to synthesize training data for standard SFT , they primarily focus on replicating existing ev aluator behaviors. In con- trast, our approach, REAL, targets pointwise ev aluation and formulates it as a regression problem, directly integrating the continuous nature of scoring objectives into the RL al- gorithm. An extended discussion on related work can be found in Sec. A . 7. Limitations First, our formulation focuses on pointwise regression set- tings and does not directly address pairwise preference learning. In addition, the approach relies on self-generated Chain-of-Thought reasoning, which may inherit biases or systematic errors from base models. 8. Conclusion W e introduced R E A L , a regression-aware reinforcement learning framework that employs the generalized policy gradient methods to solve regression-a ware objectiv es. Our practical implementation combines RLOO stabilization with regression-a ware prediction updates. Empirically , REAL achiev es significant and consistent impro vements across multiple LLM-as-a-Judge benchmarks o ver zero-shot and SFT baselines, specifically for out-of-domain ev aluation. W e believe this work provides a foundation for future re- search on RL algorithms for LLM-as-a-Judge. Impact Statement This work contrib utes to the dev elopment of more faithful regression-based LLM-as-a-Judge systems, which can im- prov e the reliability of automatic ev aluation for large-scale language model research. By better respecting the ordi- nal and numerical structure of ev aluation scores, R E A L has the potential to reduce noise, instability , and miscal- ibration in benchmark-driv en model comparison, thereby supporting more reproducible and ef ficient research work- flows. Ho wev er, the use of automated ev aluators raises important considerations regarding appropriate deployment. Although R E A L improves numerical consistency and ro- bustness, its predictions remain inherently model-dependent and reflect biases present in the underlying language model and training data. Consequently , such systems should not be interpreted as objecti ve or normati ve arbiters of quality , particularly in high-stakes or value-laden decision-making contexts. Over-reliance on automated judgments may cre- ate a false sense of precision or neutrality when ev aluation criteria are underspecified or subjectiv e. References Achiam, J., Adler , S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F . L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023. Agirre, E., Cer , D., Diab, M., Gonzalez-Agirre, A., and Guo, W . * sem 2013 shared task: Semantic textual similarity . In Second joint confer ence on lexical and computational semantics (* SEM), volume 1: pr oceedings of the Main confer ence and the shared task: semantic textual similar- ity , pp. 32–43, 2013. Ahmadian, A., Cremer , C., Gall ´ e, M., Fadaee, M., Kreutzer , J., Pietquin, O., ¨ Ust ¨ un, A., and Hooker , S. Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs. In Ku, L.-W ., Martins, A., and Srikumar , V . (eds.), Pr oceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pp. 12248–12267, Bangkok, Thailand, August 2024. Association for Com- putational Linguistics. doi: 10.18653/v1/2024.acl- long. 662. URL https://aclanthology.org/2024. acl- long.662/ . Bishop, C. M. and Nasrabadi, N. M. P attern r ecognition and machine learning , v olume 4. Springer, 2006. Bottai, M., Kim, T ., Lieberman, B., Luta, G., and Pe ˜ na, E. On optimal correlation-based prediction. The American Statistician , 76(4):313–321, 2022. Brown, T . B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P ., Neelakantan, A., Shyam, P ., Sastry , G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger , G., Henighan, T ., Child, R., Ramesh, A., Ziegler , D. M., W u, J., W inter , C., Hesse, C., Chen, M., Sigler, E., Litwin, M., 9 REAL: Regression-A war e Reinforcement Lear ning for LLM-as-a-Judge Gray , S., Chess, B., Clark, J., Berner , C., McCandlish, S., Radford, A., Sutske ver , I., and Amodei, D. Language models are fe w-shot learners, 2020. URL https:// arxiv.org/abs/2005.14165 . Chiang, C.-H. and Lee, H.-Y . Can large language models be an alternative to human evaluati ons? In Pr oceed- ings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pp. 15607–15631, 2023. Chiang, C.-H., Lee, H.-y ., and Lukasik, M. TRACT: Regression-a ware fine-tuning meets chain-of-thought rea- soning for LLM-as-a-judge. In Che, W ., Nabende, J., Shutov a, E., and Pilehv ar, M. T . (eds.), Pr oceedings of the 63r d Annual Meeting of the Association for Com- putational Linguistics (V olume 1: Long P apers) , pp. 2934–2952, V ienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-251- 0. doi: 10.18653/v1/2025.acl- long.147. URL https: //aclanthology.org/2025.acl- long.147/ . Chiang, W .-L., Li, Z., Lin, Z., Sheng, Y ., W u, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gonzalez, J. E., Stoica, I., and Xing, E. P . V icuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality , March 2023. URL https://lmsys.org/blog/ 2023- 03- 30- vicuna/ . Deshpande, A., Jimenez, C., Chen, H., Murahari, V ., Graf, V ., Rajpurohit, T ., Kalyan, A., Chen, D., and Narasimhan, K. C-sts: Conditional semantic textual similarity . In Pr o- ceedings of the 2023 Confer ence on Empirical Methods in Natural Languag e Pr ocessing , pp. 5669–5690, 2023. Dong, H., Xiong, W ., Goyal, D., Zhang, Y ., Cho w , W ., Pan, R., Diao, S., Zhang, J., SHUM, K., and Zhang, T . RAFT: Re ward ranked finetuning for generati ve founda- tion model alignment. T ransactions on Machine Learn- ing Resear ch , 2023. ISSN 2835-8856. URL https: //openreview.net/forum?id=m7p5O7zblY . Google. Bard: An experimental ai chatbot. https://blog.google/technology/ai/ bard- google- ai- search- updates/ , 2023. Gu, J., Jiang, X., Shi, Z., T an, H., Zhai, X., Xu, C., Li, W ., Shen, Y ., Ma, S., Liu, H., et al. A surve y on llm-as-a- judge. arXiv pr eprint arXiv:2411.15594 , 2024. Guo, D., Y ang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., W ang, P ., Bi, X., et al. Deepseek-r1: In- centi vizing reasoning capability in llms via reinforcement learning. arXiv pr eprint arXiv:2501.12948 , 2025. Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F ., Lengyel, G., Lample, G., Saulnier , L., Lav aud, L. R., Lachaux, M.- A., Stock, P ., Scao, T . L., Lavril, T ., W ang, T ., Lacroix, T ., and Sayed, W . E. Mistral 7b, 2023. URL https: //arxiv.org/abs/2310.06825 . Kaplan, J., McCandlish, S., Henighan, T ., Brown, T . B., Chess, B., Child, R., Gray , S., Radford, A., W u, J., and Amodei, D. Scaling laws for neural language models. arXiv pr eprint arXiv:2001.08361 , 2020. Kendall, M. G. A ne w measure of rank correlation. Biometrika , 30(1-2):81–93, 06 1938. ISSN 0006-3444. doi: 10.1093/biomet/30.1- 2.81. URL https://doi. org/10.1093/biomet/30.1- 2.81 . Kim, S., Shin, J., Cho, Y ., Jang, J., Longpre, S., Lee, H., Y un, S., Shin, S., Kim, S., Thorne, J., et al. Prometheus: Induc- ing fine-grained e valuation capability in language models. In The T welfth International Confer ence on Learning Repr esentations , 2023. Kim, S., Suk, J., Longpre, S., Lin, B. Y ., Shin, J., W elleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M. Prometheus 2: An open source language model specialized in ev aluating other language models. In Al- Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.), Pr oceed- ings of the 2024 Confer ence on Empirical Methods in Natural Language Pr ocessing , pp. 4334–4353, Miami, Florida, USA, Nov ember 2024. Association for Compu- tational Linguistics. doi: 10.18653/v1/2024.emnlp- main. 248. URL https://aclanthology.org/2024. emnlp- main.248/ . K ool, W ., van Hoof, H., and W elling, M. Buy 4 REINFORCE samples, get a baseline for free!, 2019. URL https://openreview.net/forum? id=r1lgTGL5DE . Li, J., Sun, S., Y uan, W ., Fan, R.-Z., Zhao, H., and Liu, P . Generati ve judge for ev aluating alignment. arXiv preprint arXiv:2310.05470 , 2023. Liu, Y ., Iter , D., Xu, Y ., W ang, S., Xu, R., and Zhu, C. G-ev al: Nlg e valuation using gpt-4 with better human alignment. In Pr oceedings of the 2023 Confer ence on Empirical Methods in Natural Language Pr ocessing . As- sociation for Computational Linguistics, 2023a. Liu, Y ., Iter , D., Xu, Y ., W ang, S., Xu, R., and Zhu, C. G-ev al: Nlg e valuation using gpt-4 with better human alignment. arXiv pr eprint arXiv:2303.16634 , 2023b. Lukasik, M., Narasimhan, H., Menon, A. K., Y u, F ., and Kumar , S. Regression aw are inference with LLMs. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.), F indings of the Association for Computational Linguis- tics: EMNLP 2024 , pp. 13667–13678, Miami, Florida, 10 REAL: Regression-A war e Reinforcement Lear ning for LLM-as-a-Judge USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings- emnlp. 799. URL https://aclanthology.org/2024. findings- emnlp.799/ . Lukasik, M., Meng, Z., Narasimhan, H., Chang, Y .-W ., Menon, A. K., Y u, F ., and Kumar , S. Better autoregressiv e regression with LLMs via regression-aw are fine-tuning. In The Thirteenth International Confer ence on Learning Repr esentations , 2025. URL https://openreview. net/forum?id=xGs7Ch3Vyo . Nguyen, K., Daum ´ e III, H., and Boyd-Graber , J. Reinforce- ment learning for bandit neural machine translation with simulated human feedback. In Proceedings of the 2017 Confer ence on Empirical Methods in Natural Languag e Pr ocessing , pp. 1464–1474, 2017. OpenAI. Gpt-3.5-turbo api documentation. https://platform.openai.com/docs/ models/gpt- 3.5- turbo , 2023. V ersion: gpt-3.5- turbo-0613. Ouyang, L., W u, J., Jiang, X., Almeida, D., W ainwright, C., Mishkin, P ., Zhang, C., Agarwal, S., Slama, K., Ray , A., et al. Training language models to follo w instructions with human feedback. Advances in neural information pr ocessing systems , 35:27730–27744, 2022. Pearson, K. V ii. note on regression and inheritance in the case of two parents. pr oceedings of the r oyal society of London , 58(347-352):240–242, 1895. Schulman, J., Heess, N., W eber, T ., and Abbeel, P . Gra- dient estimation using stochastic computation graphs. Advances in neural information pr ocessing systems , 28, 2015. Schulman, J., W olski, F ., Dhariwal, P ., Radford, A., and Klimov , O. Proximal polic y optimization algorithms. arXiv pr eprint arXiv:1707.06347 , 2017. Shao, Z., W ang, P ., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., W u, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models. arXiv pr eprint arXiv:2402.03300 , 2024. Sheng, G., Zhang, C., Y e, Z., W u, X., Zhang, W ., Zhang, R., Peng, Y ., Lin, H., and W u, C. Hybridflow: A flexi- ble and efficient rlhf framework. arXiv preprint arXiv: 2409.19256 , 2024. Song, Z., Zhang, B., Zhang, Q.-W ., Y in, D., Sun, X., and Li, C. Poli-RL: A point-to-list reinforcement learning framew ork for conditional semantic textual similarity . In The F ourteenth International Conference on Learning Repr esentations , 2026. URL https://openreview. net/forum?id=sLcRCH1U68 . Spearman, C. The proof and measurement of association between two things. 1961. Sutton, R. S., Barto, A. G., et al. Reinforcement learning. Journal of Co gnitive Neuroscience , 11(1):126–134, 1999. T ang, Y ., W ang, S., Madaan, L., and Munos, R. Beyond verifiable re wards: Scaling reinforcement learning in lan- guage models to un verifiable data. In The Thirty-ninth Annual Conference on Neural Information Pr ocessing Systems , 2025. URL https://openreview.net/ forum?id=pc6M9h3T9m . T aori, R., Gulrajani, I., Zhang, T ., Dubois, Y ., Li, X., Guestrin, C., Liang, P ., and Hashimoto, T . B. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu- lab/ stanford_alpaca , 2023. T ouvron, H., Martin, L., Stone, K., Albert, P ., Almahairi, A., Babaei, Y ., Bashlykov , N., Batra, S., Bhargav a, P ., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models. arXiv preprint , 2023. W ang, Y ., Y u, Z., Zeng, Z., Y ang, L., W ang, C., Chen, H., Jiang, C., Xie, R., W ang, J., Xie, X., et al. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. arXiv pr eprint arXiv:2306.05087 , 2023. Whitehouse, C., W ang, T ., Y u, P ., Li, X., W eston, J., Ku- likov , I., and Saha, S. J1: Incenti vizing thinking in llm- as-a-judge via reinforcement learning. arXiv pr eprint arXiv:2505.10320 , 2025. W illiams, R. J. Simple statistical gradient-following algo- rithms for connectionist reinforcement learning. Machine learning , 8(3):229–256, 1992. Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P ., Feng, J., T ao, C., and Jiang, D. W izardlm: Empowering large language models to follow complex instructions. arXiv pr eprint arXiv:2304.12244 , 2023. Xu, Y . E., Sav ani, Y ., Fang, F ., and K olter, J. Z. Not all rollouts are useful: Down-sampling rollouts in llm rein- forcement learning. arXiv preprint , 2025. Y ang, A., Li, A., Y ang, B., Zhang, B., Hui, B., Zheng, B., Y u, B., Gao, C., Huang, C., Lv , C., et al. Qwen3 technical report. arXiv pr eprint arXiv:2505.09388 , 2025. Y u, Q., Zhang, Z., Zhu, R., Y uan, Y ., Zuo, X., Y ue, Y ., Dai, W ., Fan, T ., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv pr eprint arXiv:2503.14476 , 2025. 11 REAL: Regression-A war e Reinforcement Lear ning for LLM-as-a-Judge Zausinger , J., Pennig, L., Chlodny , K., Limbach, V ., Ket- teler , A., Prein, T ., Singh, V . M., Danziger , M. M., and Born, J. Regress, don’t guess: A regression-like loss on number tokens for language models. In Pr oceed- ings of the 42nd International Confer ence on Machine Learning (ICML) , 2025. URL abs/2411.02083 . Zheng, L., Chiang, W .-L., Sheng, Y ., Zhuang, S., W u, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Ad- vances in neural information pr ocessing systems , 36: 46595–46623, 2023. Zhu, K., Zhao, Q., Chen, H., W ang, J., and Xie, X. Prompt- bench: A unified library for ev aluation of large language models. Journal of Mac hine Learning Researc h , 25(254): 1–22, 2024. Zhu, L., W ang, X., and W ang, X. JudgeLM: Fine-tuned large language models are scalable judges. In The Thir- teenth International Confer ence on Learning Repr esen- tations , 2025. URL https://openreview.net/ forum?id=xsELpEPn4A . 12 REAL: Regression-A war e Reinforcement Lear ning for LLM-as-a-Judge A. Related W ork Reinfor cement Learning in LLMs Reinforcement learning ( Sutton et al. , 1999 ) has emerged as a key technique for fine-tuning large language models (LLMs), initially aimed at aligning model outputs with human preferences ( Ouyang et al. , 2022 ). In this setting, the RL formulation typically in volves a single-generation, single-action structure ( Nguyen et al. , 2017 ). Popular approaches include REINFORCE ( W illiams , 1992 ) and Proximal Policy Optimization (PPO) ( Schulman et al. , 2017 ). RLOO ( Kool et al. , 2019 ) is an unbiased policy gradient estimator that reduces variance without requiring a learned v alue function. It improv es upon standard REINFORCE by using a lea ve-one-out baseline, where each sample’ s advanta ge is computed relativ e to the av erage reward of the other samples in the batch. Beyond RLOO, actor –critic methods such as Proximal Policy Optimization (PPO) ( Schulman et al. , 2017 ) lev erages intermediate state value functions to reduce gradient variance, at the cost of introducing bias into the estimator . Unlike RLOO, PPO typically requires an additional critic model, which increases computational overhead. During training, the updates of the generativ e policy and the critic are interleav ed. Group Relativ e Policy Optimization (GRPO) ( Guo et al. , 2025 ) further remov es the need for an explicit critic by normalizing rewards within a group of samples generated for the same prompt. While early RLHF methods focused on preference alignment, recent ef forts hav e used reinforcement learning to explicitly incentivize desirable capabilities, such as step-by-step reasoning ( Guo et al. , 2025 ). This shift reflects a broader trend to ward le veraging rew ard signals not only for alignment but also for impro ving the utility and robustness of LLMs in comple x tasks. Recent studies ( Ahmadian et al. , 2024 ) have sho wn that RLOO can outperform biased actor-critic methods such as PPO on key baselines in large-scale settings. While we mainly use RLOO variance reduction techniques, other stabilization approaches can be explored in future work. W e note that this is not central to the core contribution of our paper, i.e., the regression-a ware RL objective and algorithm. Regression-A ware LLMs. Standard LLM fine-tuning uses a cross-entropy loss ( Brown et al. , 2020 ), which is suboptimal for regression tasks as it treats all incorrect numerical tokens equally , regardless of their numeric distance from the target. Sev eral methods hav e been proposed to make LLMs more re gression-aware. ( Lukasik et al. , 2024 ) introduced Regression- A ware Inference for Language models (RAIL), an inference-time technique that applies a Bayes-optimal decision rule to the model’ s output probabilities. For squared error loss, this rule simplifies to calculating the e xpected value of the numerical prediction. A follow-up work, Regression-A ware Fine-T uning (RAFT) by ( Lukasik et al. , 2025 ), incorporated this logic directly into the training process. RAFT fine-tunes the model using a squared error loss calculated between the ground truth score and the expected value predictor from RAIL. A follow-up work from ( Zausinger et al. , 2025 ) also applied a regression loss o ver numerical tokens for re gression tasks. More recently , TRA CT ( Chiang et al. , 2025 ) builds on this by combining the numerical scoring ability of RAFT with CoT reasoning. TRA CT is a two-stage SFT approach, which 1) first applies RAFT loss on the training dataset, and then 2) uses self-generated CoT from the first-stage trained model, appended with the ground truth labels, to train a base model with the RAFT loss again. The first stage acts as a CoT re writer, addressing the distribution gap between the training dataset’ s CoT and the model’ s own outputs. W e note both a flaw in TRA CT and a connection between their method and ours: 1. TRA CT does not ev aluate the quality of the self-generated CoT and treats all generated CoTs as ground truth for the second stage. In contrast, our method e valuates CoT quality using the REAL re ward, and 2. In the second stage, TRA CT appends the ground truth and then applies RAFT loss. This gradient on the prediction is identical to our prediction update term. The TRACT method can be vie wed as a remedy for SFT and an intermediate step between SFT and RL. Our method, REAL, is the first to formalize RL training with regression-a ware objectiv es for the LLM-as-a-Judge setting. Concurrently , PoLi-RL ( Song et al. , 2026 ) explores a heuristic listwise rew ard in the Conditional Semantic T extual Similarity (C-STS) setting ( Deshpande et al. , 2023 ; Agirre et al. , 2013 ), which focuses on semantic similarity between texts in specific conte xts. While related in spirit, C-STS presents a different and generally more constrained setup compared to LLM-as-a-Judge, which in volv es broader e valuations based on alignment with task instructions and ov erall response quality . Additionally , it’ s important to note that Regression-A ware Fine-T uning (RAFT) is distinct from Reward-Rank ed Fine-Tuning ( Dong et al. , 2023 ), which is an alignment method that iteratively fine-tunes models on a filtered subset of high-rew ard samples. Our work further extends this re gression-aware paradigm to reinforcement learning and achiev es superior performance in terms of correlation metrics. LLM-as-a-Judge Recently , employing language models as judges has gained attention as a promising paradigm to mimic the depth and granularity of human e valuation ( Gu et al. , 2024 ; Chiang & Lee , 2023 ; Zheng et al. , 2023 ; Liu et al. , 2023a ; Zhu et al. , 2024 ; Ouyang et al. , 2022 ; Zhu et al. , 2025 ). In terms of output format, we classify LLM-as-a-Judge into tw o types: 1. P ointwise , where the output is a score usually ranging from 1 to 5 ( Chiang & Lee , 2023 ; Liu et al. , 2023b ); 2. P airwise , where the LLM is asked to compare two responses and select the better one ( Zheng et al. , 2023 ; W ang et al. , 13 REAL: Regression-A war e Reinforcement Lear ning for LLM-as-a-Judge 2023 ; Li et al. , 2023 ). While recent advancements such as Prometheus 1 and 2 ( Kim et al. , 2023 ; 2024 ) lev erage proprietary models (e.g., GPT -4 ( Achiam et al. , 2023 )) to synthesize training data for standard SFT , they primarily focus on replicating existing e valuator beha viors. Similarly , J1 ( Whitehouse et al. , 2025 ) adv ances the LLM-as-a-Judge paradigm by applying standard RL and transforming existing datasets into prompt-response pairs with v erifiable re wards. In contrast, our approach, REAL , targets pointwise e valuation and formulates it as a regression problem, directly integrating the continuous nature of scoring objectiv es into the RL algorithm. B. Proofs of the theor etical results First, to prove Lemma 3.1 , we restate and utilize the following two results from statistical theory: 1) W e sho w that this conditional expectation is the optimal estimator for maximizing Pearson correlation (Lemma B.1 ); 2) W e establish that minimizing the regression-a ware loss (MSE) recovers the conditional e xpectation (Lemma B.2 ). B.1. Additional Lemmas Lemma B.1 (Optimality of the Posterior Mean for Pearson Correlation) . Measure space and generative graph. Let (Ω , F , P ) be a pr obability space on the tuple ( x, c, y ∗ ) with the following data-gener ating process: x ∼ P D − → ( c ∼ π θ ( · | x ) , y ∗ ∼ P ( · | x ) , i.e., c ⊥ y ∗ | x . This induces the joint distribution P θ ( x, c, y ∗ ) = P D ( x ) π θ ( c | x ) P ( y ∗ | x ) , (9) and expectations E [ · ] ar e taken w .r .t. P θ ( x, c, y ∗ ) unless stated otherwise. Conditional mean. Define the conditional mean (posterior mean) of the label given ( x, c ) as µ ( x, c ) ≜ E [ y ∗ | x, c ] . (10) Under the conditional independence c ⊥ y ∗ | x , we have P ( y ∗ | x, c ) = P ( y ∗ | x ) and thus µ ( x, c ) = E [ y ∗ | x, c ] = E [ y ∗ | x ] . (11) P earson correlation objective. Let ˆ y ( x, c ) ∈ R be any measurable estimator with finite second moment. Assume P earson corr elation is well-defined and non-de generate: V ar( y ∗ ) > 0 , V ar( ˆ y ( x, c )) > 0 , V ar( µ ( x, c )) > 0 . (12) Define P earson corr elation as ρ ( U, V ) ≜ Co v ( U, V ) p V ar( U ) V ar( V ) . (13) Claim. The set of estimators that maximize the P earson corr elation ρ ( ˆ y ( x, c ) , y ∗ ) (over all such ˆ y ) is exactly the set of positive affine tr ansformations of µ ( x, c ) : ˆ y ∗ ( x, c ) ∈ arg max ˆ y ρ  ˆ y ( x, c ) , y ∗  ⇐ ⇒ ˆ y ∗ ( x, c ) = a µ ( x, c ) + b, ∀ a > 0 , b ∈ R . (14) Remark (inv ariance and conditional-independence). Pearson correlation is in variant to positiv e scaling and shifting: ρ ( ˆ y , y ∗ ) = ρ ( a ˆ y + b, y ∗ ) for any a > 0 and b ∈ R , hence the maximizer is not unique. Moreov er, since c ⊥ y ∗ | x in ( 9 ) , µ ( x, c ) reduces to E [ y ∗ | x ] as in ( 11 ) ; we retain the ( x, c ) notation to match the policy-generated reasoning v ariable used in post-training. This result follows from Cauchy–Schwarz ar guments; see, e.g., ( Bottai et al. , 2022 ). 14 REAL: Regression-A war e Reinforcement Lear ning for LLM-as-a-Judge Lemma B.2 (Optimality of Regression Objecti ve) . The unique global minimizer of the squar ed err or loss R ( θ ) = E [( ˆ y θ ( x, c ) − y ∗ ) 2 ] is the conditional expectation: ˆ y θ ( x, c ) = E [ y ∗ | x, c ] . (15) This is a standard result in Bayesian decision theory (see ( Bishop & Nasrabadi , 2006 )). B.2. Proof of Lemma 3.1 By Lemma B.2 , the estimator that minimizes the squared error loss is ˆ y ∗ = E [ y ∗ | x, c ] . By setting a = 1 and b = 0 in Lemma B.1 , it follows that this specific estimator is also a maximizer of the Pearson correlation. Thus, the regression objectiv e is a sufficient condition for achie ving the optimal correlation estimator . □ B.3. Proof of Lemma B.1 Setup and computation graph. Let ( X, C , Y ∗ ) be random variables on a probability space (Ω , F , P ) with the following data-generating process: X ∼ P D , C ∼ π θ ( · | X ) , Y ∗ ∼ P ( · | X ) . Equiv alently , the conditional independence relation is C ⊥ Y ∗ | X, (16) which corresponds to the computation graph C ← X → Y ∗ . This induces the joint distribution P θ ( x, c, y ∗ ) = P D ( x ) π θ ( c | x ) P ( y ∗ | x ) . (17) All expectations E [ · ] below are with respect to P θ ( x, c, y ∗ ) unless stated otherwise. Conditional mean. Define the conditional mean (posterior mean) µ ( x, c ) ≜ E [ Y ∗ | X = x, C = c ] . (18) Under ( 16 ), we hav e P ( y ∗ | x, c ) = P ( y ∗ | x ) and hence µ ( x, c ) = E [ Y ∗ | X = x, C = c ] = E [ Y ∗ | X = x ] . (19) W e keep the notation µ ( x, c ) because our estimator is explicitly a function of both ( x, c ) . Remark (role of the reasoning variable c ). In the supervised dataset, only ( x, y ∗ ) is observed; the reasoning chain c is not a supervised variable and is instead generated by the model/policy via π θ ( c | x ) during training and inference. From the statistical model ( 16 ) , conditioning on c does not re veal additional information about the label be yond x , so the Bayes-optimal conditional mean reduces as in ( 19 ) . Ne vertheless, in our post-training procedure the predictor we optimize is of the form ˆ y θ ( x, c ) , i.e., it is ev aluated on realizations of ( x, c ) sampled from P D ( x ) π θ ( c | x ) . The lemma therefore characterizes the correlation-optimal form within the space of functions of ( x, c ) (and the correlation is taken over the induced joint distribution o ver ( x, c, y ∗ ) ). Goal. Let ˆ y ( x, c ) be any measurable real-valued estimator with finite second moment. Assume Pearson correlation is well-defined and non-degenerate: V ar( Y ∗ ) > 0 , V ar( ˆ y ( X , C )) > 0 , V ar( µ ( X , C )) > 0 . (20) W e show that an y maximizer of ρ ( ˆ y ( X , C ) , Y ∗ ) must be of the form ˆ y ∗ ( x, c ) = a µ ( x, c ) + b with a > 0 . 15 REAL: Regression-A war e Reinforcement Lear ning for LLM-as-a-Judge Step 1: Covariance with Y ∗ reduces to co variance with µ ( X , C ) . Recall the Pearson correlation definition ρ ( U, V ) ≜ Co v ( U, V ) p V ar( U )V ar( V ) . (21) W e start from the cov ariance term: Co v ( ˆ y ( X , C ) , Y ∗ ) = E [ ˆ y ( X , C ) Y ∗ ] − E [ ˆ y ( X, C )] E [ Y ∗ ] . (22) Using the law of iterated e xpectations and conditioning on ( X , C ) , E [ ˆ y ( X , C ) Y ∗ ] = E  E [ ˆ y ( X , C ) Y ∗ | X, C ]  . (23) At this point, it is important to note that ˆ y ( X , C ) is a (measurable) function of ( X, C ) , hence it is ( X, C ) -measurable and acts as a constant inside the conditional expectation gi ven ( X, C ) . Therefore, E [ ˆ y ( X , C ) Y ∗ | X, C ] = ˆ y ( X , C ) E [ Y ∗ | X, C ] = ˆ y ( X , C ) µ ( X, C ) , (24) and substituting back into ( 23 ) yields E [ ˆ y ( X , C ) Y ∗ ] = E [ ˆ y ( X , C ) µ ( X, C )] . (25) Similarly , E [ Y ∗ ] = E  E [ Y ∗ | X, C ]  = E [ µ ( X, C )] . (26) Plugging ( 25 )–( 26 ) into ( 22 ) giv es Co v ( ˆ y ( X , C ) , Y ∗ ) = E [ ˆ y ( X , C ) µ ( X, C )] − E [ ˆ y ( X, C )] E [ µ ( X, C )] = Co v ( ˆ y ( X , C ) , µ ( X , C )) . (27) This identity holds irrespectiv e of whether Y ∗ depends on C ; in our setting, µ ( X, C ) further reduces to E [ Y ∗ | X ] by ( 19 ) . Step 2: Maximizing correlation with Y ∗ is equivalent to maximizing corr elation with µ ( X , C ) . Using ( 27 ), ρ ( ˆ y ( X , C ) , Y ∗ ) = Co v ( ˆ y ( X , C ) , µ ( X , C )) σ ˆ y σ Y ∗ = Co v ( ˆ y ( X , C ) , µ ( X , C )) σ ˆ y σ µ · σ µ σ Y ∗ = ρ ( ˆ y ( X , C ) , µ ( X , C )) · σ µ σ Y ∗ , (28) where σ ˆ y = p V ar( ˆ y ( X , C )) and σ µ = p V ar( µ ( X, C )) . By ( 20 ) , the factor σ µ /σ Y ∗ is a positi ve constant that does not depend on the choice of ˆ y . Hence, maximizing ρ ( ˆ y ( X, C ) , Y ∗ ) is equi valent to maximizing ρ ( ˆ y ( X , C ) , µ ( X , C )) . Step 3: Correlation is maximized by a positive affine transformation of µ ( X, C ) . Let ˜ A ≜ A − E [ A ] denote the centered version of a random v ariable. Then Co v ( ˆ y ( X , C ) , µ ( X , C )) = E  ˜ ˆ y ˜ µ  . (29) By the Cauchy–Schwarz inequality in L 2 , E  ˜ ˆ y ˜ µ  ≤ q E [ ˜ ˆ y 2 ] p E [ ˜ µ 2 ] = σ ˆ y σ µ . (30) Dividing both sides by σ ˆ y σ µ shows ρ ( ˆ y ( X , C ) , µ ( X , C )) ≤ 1 , (31) with equality if and only if ˜ ˆ y = a ˜ µ almost surely for some a ≥ 0 . Equiv alently , ˆ y ( X , C ) = a µ ( X , C ) + b a.s. (32) for some a ≥ 0 and b ∈ R . If a = 0 then V ar( ˆ y ( X , C )) = 0 , which is e xcluded by ( 20 ) ; thus a > 0 . This proves Lemma B.1 . 16 REAL: Regression-A war e Reinforcement Lear ning for LLM-as-a-Judge Connection to regr ession objectives. Since the Bayes-optimal minimizer of expected squared loss is µ ( x, c ) = E [ Y ∗ | x, c ] (and reduces to E [ Y ∗ | x ] under ( 16 ) ), a regression-aware objectiv e that tar gets the conditional mean yields an estimator that is also optimal for Pearson correlation, up to a positiv e affine transformation. A closely related formulation appears in Bottai et al. ( 2022 ). B.4. Proof of Lemma B.2 Optimality of MSE for Conditional Expectation. Let the risk function for the regression-a ware loss be defined over the joint distribution of inputs x , generated chains c ∼ π θ ( ·| x ) , and ground-truth labels y ∗ ∼ P ( ·| x ) : R ( ˆ y ) = E x,c,y ∗  ( ˆ y ( x, c ) − y ∗ ) 2  . (33) Let µ ( x, c ) = E [ y ∗ | x, c ] denote the conditional expectation of the label giv en the input and reasoning chain. W e can decompose the squared error term by adding and subtracting µ ( x, c ) : E  ( ˆ y − y ∗ ) 2 | x, c  = E  ( ˆ y − µ ( x, c ) + µ ( x, c ) − y ∗ ) 2 | x, c  = E  ( ˆ y − µ ( x, c )) 2 | x, c  + E  ( µ ( x, c ) − y ∗ ) 2 | x, c  + 2( ˆ y − µ ( x, c )) E [ µ ( x, c ) − y ∗ | x, c ] | {z } =0 . (34) The cross-term vanishes because E [ y ∗ | x, c ] = µ ( x, c ) by definition. The total risk is thus: R ( ˆ y ) = E x,c  ( ˆ y ( x, c ) − µ ( x, c )) 2  + V ar ( y ∗ | x, c ) . (35) The second term, V ar ( y ∗ | x, c ) , represents irreducible noise and is independent of the estimator ˆ y . Therefore, minimizing the risk R ( ˆ y ) is equi valent to minimizing the first term, which is strictly non-ne gativ e and equals zero if and only if: ˆ y ( x, c ) = µ ( x, c ) = E [ y ∗ | x, c ] . (36) Thus, the global minimizer of the regression-a ware loss is the conditional expectation. B.5. Proof of Lemma 4.1 T o deriv e the gradient, we expand the inner expectation as an inte gral over the reasoning trajectories c ∼ π θ ( c | x ) : L ( θ ) = E ( x,y ∗ ) ∼D Z c π θ ( c | x ) r ( θ , x, c ) dc. Applying the gradient operator ∇ θ and utilizing the product rule for differentiation: ∇ θ L ( θ ) = E ( x,y ∗ ) ∼D Z c ∇ θ [ π θ ( c | x ) r ( θ , x, c )] dc = E ( x,y ∗ ) ∼D Z c [( ∇ θ π θ ( c | x )) r ( θ , x, c ) + π θ ( c | x ) ( ∇ θ r ( θ , x, c ))] dc. T erm 1: CoT Policy Update. Using the log-deri vati ve trick, ∇ θ π θ ( c | x ) = π θ ( c | x ) ∇ θ log π θ ( c | x ) , the first part of the sum becomes: E ( x,y ∗ ) ∼D Z c π θ ( c | x ) [ r ( θ , x, c ) ∇ θ log π θ ( c | x )] dc = E ( x,y ∗ ) ∼D ,c ∼ π θ [ r ( θ , x, c ) ∇ θ log π θ ( c | x )] . This term performs reinforcement learning on the reasoning process, shifting the policy’ s mass tow ard trajectories that yield higher rew ards. T erm 2: Prediction Update. The second part accounts for the explicit dependence of the reward on θ . Within this integral, the trajectory c is a sample from the current policy (ef fectiv ely a stopped-gradient c ): E ( x,y ∗ ) ∼D Z c π θ ( c | x ) [ ∇ θ r ( θ , x, c )] dc = E ( x,y ∗ ) ∼D ,c ∼ π θ [ ∇ θ r ( θ , x, c )] . This term enables direct backpropagation through the differentiable components of the rew ard, mapping the predicted ˆ y θ ( x, c ) to the target v alue y ∗ . 17 REAL: Regression-A war e Reinforcement Lear ning for LLM-as-a-Judge C. Algorithm and Experiments C.1. Response Length and Entropy W e note that the response length increases, and entropy decreases throughout REAL training, as sho wn in Fig. 4 . F igure 4. Response length and entropy during REAL training. Response length increases, and the per-tok en entropy of the policy model decreases steadily . C.2. Comparison with JEPO W e conduct a direct comparison with JEPO as sho wn in T ab . 10 . The JEPO does not perform as effecti vely as our proposed approach in regression-a ware tasks. T able 10. Comparison with JEPO with Mistral2-7B. Method FB Bench FLASK Vic. Bench MT Bench A verage r ρ τ r ρ τ r ρ τ r ρ τ r ρ τ JEPO 92.5 92.9 81.6 54.3 52.6 39.6 61.7 59.7 45.9 55.7 54.9 39.8 66.3 64.8 51.8 REAL 93.2 93.4 82.5 56.0 54.1 41.1 63.3 60.2 46.3 59.3 56.9 42.2 67.9 66.2 53.0 C.3. REAL Algorithm See Algo. 2 for detailed implementation. Algorithm 2 REAL: Regression-A ware Reinforcement Learning Require: dataset D , policy π θ , log-likelihood weight λ , prediction weight β , group size K , learning rate η . 1: while not con ver ged do 2: Sample batch B = { ( x, y ∗ ) } ∼ D . 3: /* Step 1: Sample K generations with CoTs */ 4: Sample K reasoning chains { c 1 , . . . , c K } ∼ π θ ( ·| x ) . 5: Compute re wards r REAL ( θ , x, c i ) via Eq. 5 . 6: /* Step 2: Compute CoT Update term and Prediction Update term */ 7: Compute Adv antage ˜ A ( i ) REAL 8: Forw ard the generation again to acquire the gradient for both CoT and Prediction tokens. 9: Aggregate terms to estimate ∇ θ L RLOO REAL ( Eq. 8 ). 10: /* Step 3: Optimization */ 11: Update parameters: θ ← θ + η · 1 | B | P ∇ θ L RLOO REAL . 12: end while D. Qualitati ve Results See below for an e xample test prompt and REAL ’ s output. 18 REAL: Regression-A war e Reinforcement Lear ning for LLM-as-a-Judge Sample Question (V icuna Bench) T ask Description: An instruction (might include an Input inside it), a response to e valuate, a reference answer that gets a score of 5, and a score rubric representing a ev aluation criteria are giv en. 1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not e valuating in general. 2. After writing a feedback, write a score that is an integer between 1 and 5. Y ou should refer to the score rubric. 3. The output format should look as follows: ”Feedback: (write a feedback for criteria) [RESUL T] (an integer number between 1 and 5) 4. Please do not generate any other opening, closing, and e xplanations. The instruction to ev aluate: How can I impro ve my time management skills? Response to ev aluate: Improving your time management skills can help you to be more producti ve, reduce stress, and achiev e your goals. Here are some tips to help you improv e your time management skills: 1. Prioritize your tasks: Make a to-do list at the beginning of each day and prioritize the most important tasks. This will help you to focus on what’ s most important and av oid wasting time on low-priority tasks. 2. Use a calendar or planner: Use a calendar or planner to schedule your tasks and appointments. This will help you to stay organized and ensure that you ha ve enough time for ev erything you need to do. 3. Set goals: Set realistic goals for yourself and break them do wn into smaller , manageable tasks. This will help you to stay focused and motiv ated. 4. Eliminate distractions: Identify the things that distract you the most, such as social media or email notifications, and find ways to eliminate them. This could be as simple as turning of f your phone or logging out of social media during work hours. 5. T ake breaks: T aking regular breaks can help you to stay focused and avoid b urnout. T ake a fe w minutes ev ery hour to stretch, go for a walk, or do something that you enjoy . 6. Learn to say no: Don’t o vercommit yourself by taking on too man y tasks or responsibilities. Learn to say no to requests that don’t align with your goals or v alues. 7. Re view and adjust: Regularly revie w your schedule and priorities to ensure that you are making the most of your time. Adjust your plans as needed to accommodate changes in your workload or personal life. By implementing these tips, you can improv e your time management skills and become more productive and ef ficient. Reference Answer (Score 5): T o improve your time management skills, start by setting clear goals. Understand what you want to achiev e and by when. This will giv e you a roadmap to follow . Next, prioritize your tasks based on their importance and ur gency . Use tools like the Eisenhower Box to help you with this. Break do wn larger tasks into smaller , manageable parts to prevent feeling o verwhelmed. Establish a routine and stick to it. This could be a daily , weekly , or monthly routine depending on your goals. Use tools like calendars, planners, or apps like Google Calendar to schedule your tasks and set reminders. A void multitasking as it can lead to mistakes and inef ficiency . Instead, focus on one task at a time until it’ s complete. Finally , take regular breaks to av oid burnout. T echniques like the Pomodoro T echnique, where you work for 25 minutes then take a 5-minute break, can be very ef fectiv e. Remember, managing your time ef fectively also in volv es taking care of your mental health. Score Rubrics: Does the response provide detailed and actionable steps grounded on real-world scenarios? Score 1: The response lacks detail and provides no actionable steps or real-w orld context. Score 2: The response provides limited detail and fe w actionable steps, with minimal real-world grounding. Score 3: The response has some detailed and actionable steps, though not all are grounded in real-world scenarios. Score 4: The response is detailed with mostly actionable steps, and the majority are grounded in real-world scenarios. Score 5: The response is comprehensive, offering detailed and actionable steps, all of which are firmly grounded on real-world scenarios. Feedback: 19 REAL: Regression-A war e Reinforcement Lear ning for LLM-as-a-Judge T able 11. Example generation by REAL. GPT4 output serves as the reference ground truth answer in the dataset. GPT4 Output: The response is comprehensiv e and offers a series of detailed and actionable steps for impro ving time manage- ment skills . Steps such as prioritizing tasks , using a calendar or planner , setting goals, eliminating distractions, taking breaks, learning to say no, and revie wing and adjusting schedules can be di- rectly implemented by anyone. These steps are firmly gr ounded on r eal-world scenarios , making them practical and beneficial for individuals looking to enhance their time management skills. REAL Output: The response provided is comprehensiv e and of- fers detailed and actionable steps to impr ove time management skills. It covers various aspects such as setting goals, prioritiz- ing tasks , creating a schedule, breaking do wn tasks, eliminating distractions, delegating, using productivity tools, time-blocking, learning to say no, and taking breaks. The response also em- phasizes the importance of self-care and mental health, which is essential in real-w orld scenarios. Score: 5 Greedy decoding: 5, RAIL inference: 4.99 20

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment