InSPO: Unlocking Intrinsic Self-Reflection for LLM Preference Optimization

Reading time: 39 minute
...

📝 Original Info

  • Title: InSPO: Unlocking Intrinsic Self-Reflection for LLM Preference Optimization
  • ArXiv ID: 2512.23126
  • Date: 2025-12-29
  • Authors: Yu Li, Tian Lan, Zhengling Qi

📝 Abstract

Direct Preference Optimization (DPO) and its variants have become the standard for aligning Large Language Models (LLMs) due to their simplicity and offline stability. However, we identify two fundamental limitations that undermine the reliability and optimality of DPO and its equivalents. First, the optimized policy derived by current methods lacks invariance: It changes with respect to modeling choices such as the scalarization function (e.g, logistic function in Bradley-Terry model) or the reference policy. This may yield brittle behaviors of the learned policy as an artifact of parameterization rather than a reflection of true human preference, while an optimal policy must satisfy the invariance property. Second, we show that the derived policy of most existing methods is theoretically suboptimal because it fails to fully capitalize on the comparative information embedded in the pairwise preference data. This restricts the model's ability to "compare and contrast" responses, illuminating a unique opportunity for triggering intrinsic self-reflection. In this work, we propose a novel family of Intrinsic Self-reflective Preference Optimization (InSPO) methods, which address these two limitations. We first derive a globally optimal policy that conditions on both the context and the alternative response under the pairwise preference data setting, which explicitly formalize the novel notion of self-reflection. Then we theoretically demonstrate that this formulation is superior to standard DPO and RLHF targets and guarantees invariance to the choice of scalarization and reference policy. Practically, InSPO operationalizes this target as a plug-and-play enhancement for DPO-family algorithms, decoupling the alignment goal from modeling constraints without requiring complex architectural changes. Crucially, leveraging the concept of learning using privileged information, our method does not require generating an alternative response during deployment, thus incurring zero extra inference overhead, as the self-reflective mechanism is distilled into the policy during training. Comprehensive experiments demonstrate that InSPO (implemented on a range of DPO-family algorithms) achieves consistent improvements in win rates and length-controlled metrics, validating that InSPO with self-reflection leads to more robust and human-aligned LLMs.

📄 Full Content

Large language models (LLMs) are fine-tuned after pretraining through post-training alignment. This stage typically employs two complementary families of techniques [Kumar et al., 2025]. The first technique is supervised fine-tuning (SFT), which fine-tunes the model to generate the correct response given an instruction, typically by minimizing the negative log-likelihood of the target tokens. SFT effectively imparts formatting conventions, stylistic preferences, and basic task-following behaviors [Dong et al., 2023]. The second technique is Reinforcement Learning from Human Feedback (RLHF), which aligns the model by learning from human pairwise preference data [Ouyang et al., 2022]. This standard pipeline operates in two stages: first, a reward model is fitted to approximate human preferences, where the Bradley-Terry (BT) model is imposed to link the human preference probability with the reward contrast between two responses via the logistic function. Second, the LLM is fine-tuned by maximizing the learned reward with a penalty on the Kullback-Leibler (KL) divergence between the trainable policy and the reference one. The KL penalty is used to stabilize training and prevent overfitting, which is motivated by Proximal Policy Optimization 1 arXiv:2512.23126v2 [cs.AI] 30 Dec 2025 (PPO) [Schulman et al., 2017]. While RLHF shows promising results, it requires iterative on-policy sampling, coordinated training of reward, and careful hyperparameter tuning. The complexity and computational overhead often complicate the deployment in practice [Mei et al., 2023, Yu et al., 2025].

Direct Preference Optimization (DPO), in contrast, simplifies this process by removing both the explicit reward model and the online rollout phase. Given a preference pair where one response is preferred over another for a given context, DPO associates a latent preference score with log-likelihood ratios between the trainable policy and a reference model and directly maximizes the margin that favors the preferred response under a logistic loss. This equivalent reformulation preserves the KL regularization implicit in RLHF while operating purely offline on the static preference pairs. The success of DPO has inspired a growing family of reward-free preference optimization methods that refine various aspects of the optimization objective, such as adjusting the loss curvature [Azar et al., 2024], incorporating length normalization [Meng et al., 2024], or modifying the reference formulation [Ethayarajh et al., 2024], to further improve DPO.

Despite the impressive performance of RLHF and DPO-based approaches, we show that these existing methods are subject to two fundamental limitations that undermine the reliability and optimality of the learned policy. First, the derived optimal policy lacks invariance to modeling choices including the scalarization function (e.g., the logistic function in the BT model) and the reference policy. Ideally, the aligned behavior of an LLM should reflect the ground-truth human preference distribution, independent of the mathematical formulation used to process those preferences. However, as we formally demonstrate, the putative “optimal” target in current frameworks shifts with these design choices. Consequently, methods such as DPO and RLHF often produce behavior that reflects parameterization artifacts rather than genuine human preferences.

Second, these methods are theoretically suboptimal because they fail to exploit the comparative information embedded in pairwise preference data, effectively leaving the model’s inherent capacity for self-reflection untapped. In particular, current approaches cast response generation as an isolated maximization problem, disregarding the fundamentally contextual nature of human preference, which is shaped by interactions between responses rather than their standalone quality [Tversky and Simonson, 1993]. By being oblivious to explicitly conditioning on an alternative response during the training, existing methods limit its ability to “compare and contrast”, a mechanism we term intrinsic self-reflection, thereby capping the potential quality of the fine-tuned policy. These two limitations raise a fundamental question for LLM preference optimization:

A Key Question How can we construct a new preference optimization framework that is both invariant to modeling choices and capable of fully exploiting the comparative nature of human preferences?

In this work, we propose a new perspective on preference optimization that fully leverages the comparative information embedded in pairwise preference data, which is a rich signal that can be effectively capitalized using a novel notion of self-reflection. To this end, we first derive a globally optimal policy that conditions on not only the context but also the alternative response. Then, we mathematically demonstrate that this formulation yields a target superior to standard RLHF and DPO, while enjoying invariance to the choice of scalarization and reference policy. The results enable us to develop a novel framework that unlocks Intrinsic Self-reflective Preference Optimization (InSPO) for existing methods using the standard pairwise preference data. Our InSPO is a plug-and-play enhancement for the DPO family via symmetric cross-conditioning that is both theoretically rigorous and computationally efficient. It effectively decouples the alignment goal from arbitrary modeling constraints and unlocks the intrinsic self-reflection capability. The alternative response is not merely as data, but as privileged information, an important signal only available during the training phase to guide optimization [Vapnik and Vashist, 2009]. By conditioning on the alternative response during training, InSPO creates a contrastive scaffold that shapes the optimization landscape, allowing the model to learn sharper preference boundaries. Crucially, because this mechanism acts as a regularizer for the shared policy weights, the self-reflective capability is distilled into the model itself. This allows us to remove the privileged

Current Pairs

Intrinsic Self-Reflective Preference Optimization

x: “Write me ..”

Figure 1: From pairwise preference to our proposed InSPO. Standard DPO (left) learns a suboptimal policy π from response comparisons where both the preferred and dispreferred responses are evaluated based solely on the prompt. INSPO (right) unleashes intrinsic self-reflection through symmetric cross-conditioning for learning an optimal policy π * : the policy generates the preferred response while seeing the dispreferred one as context, and vice versa, allowing the model to leverage alternative responses as in-context guidance for improvement. Green terms highlight the self-reflection mechanism in our new objectives. context during deployment, maintaining standard inference protocols with zero computational overhead while retaining the benefits of comparative training. Lastly, we validate the proposed InSPO methods through comprehensive experiments on a range of DPO-family algorithms, showing consistent improvements in win rates and controlled metrics across benchmarks; our analysis reveals that the proposed method induces dense reward shaping and scales effectively with model capacity, all with zero inference overhead. See Figure 1 for an overview of our method.

To align an LLM with human preference, most existing pipelines first collect pairwise preference data generated by a pre-trained or SFT policy π ref . Each sample in the preference data consists of a context x ∈ X generated by some distribution ρ, and two responses y w ∈ Y and y ℓ ∈ Y, where X and Y are the context and response spaces respectively. After collecting such preference data, one can apply RLHF or DPO, two primary approaches, to fine-tune an LLM.

RLHF consists of two steps. In the first step, it uses the Bradley-Terry (BT) probabilistic model to understand the human preference on (x, y w , y ℓ ) and learn a reward function. Specifically, BT model assumes that

where y w ≻ y ℓ indicates y w is preferred to y ℓ , σ(z) = 1/(1 + e -z ), and r is the unknown reward function to evaluate the quality of each response to the context x. Then given preference data, one can implement maximum likelihood estimation (MLE) to estimate the reward function. In the second step, a policy optimization is executed to find a better LLM policy that maximizes the learned reward. For example, PPO [Ouyang et al., 2022] solves

where β > 0 is a regularization parameter that controls the strength of the KL divergence toward the reference policy π ref , and the divergence D KL (π ∥ π ref ) is defined as

As an alternative approach to RLHF, DPO shows that solving (2) is equivalent to modeling

Then based on the MLE loss derived under BT model (1), DPO fine-tunes an LLM via solving

where D is the joint distribution of (x, y w , y ℓ ). Here without loss of generality, we assume y w is always preferred to y ℓ after rearrangement of the preference data.

As established in Proposition 1 of Azar et al. [2024], both methods unify under a general preference optimization framework with the scalarization function Ψ(q) = log(q/(1 -q)). The general objective is formulated as:

Here, Ψ : [0, 1] → R can be any non-decreasing function. Within this framework, existing methods seek an optimal policy π restricted to the class of context-conditioned policies Π = {π : X → Y} defined as

where the value of a policy V(π) is defined as

To conclude this section, we assume that we have a dataset of n pairwise preferences

w , y

, where in i-th sample, a prompt/context x (i) is drawn from distribution ρ, and two responses (y

ℓ ) are generated by the reference policy π ref , labeled such that y

In this section, we identify two critical limitations of existing frameworks by investigating the properties of the restricted optimal policy π: (i) Is π invariant to the scalarization function Ψ and the reference distribution π ref ? (ii) Is π theoretically optimal? We demonstrate below that the answer to both questions is negative.

3.1 Is π invariant to Ψ and π ref ?

Ideally, a robust alignment framework should yield an optimal policy invariant to the choice of the scalarization function Ψ and the reference policy π ref . This invariance property is critical for modeling robustness and disentanglement from the reference. Specifically, human preferences are fundamentally ordinal. The optimal policy should reflect the underlying ranking of responses, rather than being an artifact of the specific choices of mathematical transformation, i.e., Ψ, used to process the preference probabilities or the reference policy, i.e., π ref . In addition, the choice of Ψ, which is often selected for numerical stability or concavity rather than semantic relevance, should not dictate the final behavior of the model. If the optimal fine-tuned policy changes based on modeling choices, the alignment process may become brittle and inconsistent. Lastly, a principled optimization objective should decouple the learned preferences from the reference policy π ref .

Dependence on the reference policy implies that the “optimal” behavior is relative and transient, rather than converging toward a global optimal policy that maximizes human preference. However, the following proposition establishes that the existing target π fails to satisfy this condition.

Proposition 3.1. The form of π is not invariant to Ψ and π ref .

In the proof of Proposition 3.1, we provide counter-examples demonstrating that π shifts when either Ψ or π ref is varied respectively. This lack of invariance raises a fundamental question about the quality of the resulting policy π. If the “optimal” solution shifts based on different choices Ψ and π ref , it suggests that π is an artifact of the objective function rather than a faithful reflection of human preferences. Consequently, in the following, we demonstrate that the current target π is, in fact, technically suboptimal.

In this subsection, we investigate the theoretical optimality of π in terms of maximizing V(π). While π maximizes V(π) over Π, we show that it is generally suboptimal compared to the globally optimal policy π * , defined as:

where Π = {π : X × Y → Y} represents a broader policy class that conditions not only on the context x but also on an auxiliary response y ′ . The following theorem establishes that π * is not only superior to π but also possesses the desirable invariance property.

Theorem 3.2. (i) π * is invariant to Ψ and π ref ; (ii) The following inequality holds, which implies π * is superior to π.

(ii) Furthermore, given a fixed Ψ, π * coincides with π if the transformed preference probability satisfies the condition that Ψ P(y ≻ y ′ | x) ∝ c(x, y) -c(x, y ′ ) for some function c.

Theorem 3.2 implies that the superiority of π * over π stems from its dependence on the comparator response y ′ , which effectively triggers a novel notion of intrinsic self-reflection within the LLM. This capacity for self-reflection, which existing fine-tuning methods lack, is a critical property for enhancing alignment. Furthermore, this mechanism renders π * invariant to both Ψ and π ref , ensuring that the policy targets the ground-truth human preference probability rather than optimization artifacts as π. While Theorem 3.2 (iii) suggests that self-reflection yields no improvement if the preference function is separable, the separable condition is restrictive as it requires a correctly specified (link) function Ψ. In other words, in the current framework of RLHF and DPO, π is optimal only if the BT model ( 1) is correctly specified. More importantly, existing literature [Tversky andSimonson, 1993, Bordalo et al., 2013] demonstrates that human preferences are inherently non-separable and determined by the interaction between options. It is the context x, and the self-reflection by comparing y with y ′ that fundamentally shape the choice. Consequently, to leverage this comparative property, we aim to learn π * from the preference data D n , and since our preference data are paired, we do not consider a larger categories of Π beyond pairwise comparison.

In this section, we introduce our InSPO method for learning π * . Thanks to the invariant property of π * as shown in Theorem 3.2 (i), we consider Ψ(q) = log(q/(1 -q)). First of all, we impose the following choice model.

for some generic reward function r : X × Y × Y → R and Z(x, y ′ ) := y π ref (y | x) exp 1 β r(x, y, y ′ ) . Then by (5),

which is independent of Z. Therefore model assumption in ( 6) is mild as r is unspecified and can be generic. Then following the paradigm of RLHF, we can estimate π * via

where

While this is a promising approach, the reward function r may be hard to estimate and PPO is known for instability. In the following, we propose a family of DPO-based approaches for learning π * .

To begin with, we have the following proposition that establishes the connection between the general reward function r and π r , which is denoted as an optimal solution to (7).

Theorem 4.1. Solving the optimization problem (7) gives r(x, y, y ′ ) =β log πr(y|x,y ′ )

Furthermore, π r can be obtained by solving

Then based on Theorem 3.2 and the preference dataset

where we parametrize the trainable policy π by θ. Similar to DPO, the proposed InSPO based on (10) aligns an LLM by directly shifting probability mass toward human preferred responses and away from dispreferred responses without performing RL, without rollouts, and without training a reward model. On top of it, our method leverages the self-reflection property embedded in the preference data to further calibrate the policy towards the preferred response by contrasting with the dispreferred one. Therefore, this approach enables the learning of π * . The training procedure is outlined in Algorithm 1 and denote the final solution to (10) as π.

During the inference, instead of first generating y ′ from π ref for self-reflection, which incurs an overhead cost, we directly deploy π given a testing query x. While this introduces a distinction between the training context (x, y ′ ) and the inference context x, we understand this under the paradigm of Learning Using Privileged Information (LUPI) [Vapnik and Vashist, 2009]. In the LUPI framework, the learning algorithm is provided with additional “privileged” information (here, the alternative response) during the training phase to stabilize the optimization landscape and accelerate the convergence. This privileged context acts as a contrastive scaffold, guiding the gradient updates for the shared weights associated with x. Consequently, the self-reflective capability is distilled into the policy weights [Hinton et al., 2015], allowing the deployed model to retain the optimized decision boundaries even when the privileged scaffolding is removed at test time. This ensures that InSPO incurs zero inference overhead, as the self-reflective mechanism is implicitly encoded in the parameter space rather than requiring explicit rollout generation.

It is worth noting that our proposed formulation for π * is method-agnostic and can seamlessly integrate with standard preference optimization techniques. Table 1 illustrates the InSPO variants of six representative preference optimization methods, including DPO, demonstrating how each is adapted to unlock the selfreflective capability. In the next section, we study the empirical performance of each method.

Algorithm 1: Self-reflection based DPO 1: Input:

▷ Eq. ( 10) 8: end for 9: return π Table 1: InSPO variants of six representative preference optimization methods. The Original formulations of DPO-based approaches are in black, while INSPO-enhanced terms are in green.

Objective Function for Minimization DPO [Rafailov et al., 2023]

Models and data. We employ Mistral-7B-Instruct-v0.2 and Llama-3-8B-Instruct as base models, initializing all methods from identical checkpoints within each family. Training data is sourced from UltraFeedback [Cui et al., 2023], containing approximately 60K preference pairs (x; y w , y ℓ ) with y w ≻ y ℓ after deduplication and 34.5 ↓ 0.1 24.4 ↑ 3.4 6.6 -0.0 7.7 ↑ 0.1 44.5 -0.0 46.6 ↑ 6.1 35.9 ↑ 2.1 7.1 ↑ 0.1 8.2 ↑ 0.2 LC: length-controlled win rate; WR: win rate; AE: AlpacaEval2; AH: Arena-Hard. All methods initialize from the same instructed checkpoint within each model family. The proposed INS enables self-reflection via symmetric cross-conditioning on preference pairs. safety filtering. The data generating process follows exactly from SimPO [Meng et al., 2024].

Benchmarks. We evaluate on three widely-adopted benchmarks. AlpacaEval 2 [Dubois et al., 2024] contains 805 diverse instructions; we report both standard win rates (WR) and length-controlled win rates (LC) to mitigate verbosity bias. Arena-Hard [Li et al., 2024] features 500 challenging queries that test advanced reasoning; we report WR alongside its style-controlled variant (SC) to account for stylistic preferences. MT-Bench [Zheng et al., 2023] comprises 80 multi-turn questions spanning eight capability categories, scored on a 10-point scale.

Baselines. We compare INSPO-enhanced variants named INS-DPO, INS-SimPO, INS-IPO, INS-RDPO, INS-ORPO against their standard counterparts and additional baselines: RRHF [Yuan et al., 2023], SLiC-HF [Zhao et al., 2023], CPO [Xu et al., 2024a], and KTO [Ethayarajh et al., 2024]. For fair comparison, baseline results are obtained by evaluating publicly released checkpoints from SimPO [Meng et al., 2024] on identical benchmark versions. All INSPO-enhanced methods are implemented using the OpenRLHF framework [Hu et al., 2024] and trained for three epochs with AdamW optimizer, learning rate 5 × 10 -7 , cosine schedule with 10% warmup, and maximum context length 4096. Method-specific hyperparameters follow exact configurations from prior works.

Inference. All models use a standard autoregressive generation ŷ ∼ π(• | x) with nucleus sampling at p = 0.95 and temperature 0.7, introducing no additional computational overhead versus baseline methods.

Table 2 reports comprehensive benchmark results. Controlled metrics show substantial improvements of InSPO. On AlpacaEval(AE) LC, INSPO yields 0.6-8.5 point gains across the DPO family, with the largest improvements observed on INS-IPO: +8.5 points for Mistral-Instruct and +2.1 points for Llama-3-Instruct. Arena-Hard(AH) results follow a similar pattern, with improvements ranging from 0.7 to 7.6 points. Notably, INS-DPO achieves a 7.6-point gain on Mistral-Instruct and 2.7 points on Llama-3-Instruct, demonstrating consistent benefits of enabling intrinsic self-reflection through alternative conditioning. The training dynamic curves in Figure 2 further enhance the persuasiveness of the results.

The gains extend beyond controlled metrics: INS-DPO and INS-SimPO improve raw win rates by 3-6 points on AlpacaEval 2 and 2-3 points on Arena-Hard, indicating genuine quality improvements rather than mere verbosity reduction. MT-Bench results corroborate this finding, with INS-enhanced methods achieving 0.1-0.2 point gains across different judge models, demonstrating improvements where alternative response context provides valuable learning signals.

The relative ranking among DPO family members remains stable under INSPO enhancement-SimPO variants consistently lead, followed by DPO and R-DPO, with IPO and ORPO showing more variable performance-suggesting that INSPO amplifies rather than disrupts the inherent strengths of each base method. Compared to alternative baselines beyond the DPO family such as RRHF, SLiC-HF, CPO, and KTO, our best INS-enhanced variant INS-SimPO establishes competitive results on controlled metrics while maintaining comparable raw performance.

We conduct systematic ablation studies to understand the mechanisms underlying sequence-level conditioning and explore its interactions with key design choices. Unless otherwise specified, all experiments use INS-DPO with the Llama-3-8B-Instruct model.

Impact of candidate length and context window. We systematically investigate this by varying both the maximum context window and the candidate length cap α. Specifically, MaxLen constrains the total sequence length of the entire training tuple (x, y ℓ , y w ), while the Draft Cap α enforces |y ℓ | ≤ α • MaxLen for the dispreferred candidate alone. Crucially, larger MaxLen values not only permit longer y ℓ , but also provide more headroom for the instruction x and preferred response y w to remain untruncated.

Table 3 presents results across three context window sizes (1024,2048,4096) and four candidate length constraints (30%, 40%, 50%, uncapped). Three key findings emerge. First, longer context windows consistently improve performance, with 4096-token contexts yielding the best results across all metrics (40.3% LC, 39.9% WR on AlpacaEval 2). This suggests that providing sufficient space for the complete (x, y ℓ , y w ) sequence without truncation is crucial for effective sequence-level learning. Second, moderate draft length caps (40% of MaxLen) achieve optimal performance, balancing context richness with training stability. Uncapped drafts slightly underperform, likely due to occasional extremely long dispreferred responses that dominate the context and destabilize gradient flow. Third, the benefits of longer contexts are most pronounced on challenging benchmarks: on Arena-Hard, the 4096-token setting outperforms 1024 by 2.6 WR points, indicating that complex reasoning tasks particularly benefit from richer candidate context.

Conditioning strategies and length control. Our main results in Table 2 show that sequence-level conditioning consistently improves raw win rates, and yields measurable improvements under length-controlled evaluation. This observation, combined with the finding that longer contexts enable better performance, motivates us to explore whether different conditioning strategies can better balance quality improvements with length control. We compare four variants: (i) No conditioning (standard DPO baseline).

(ii) One-sided conditioning:

(iv) Empirical averaged: L avg = 1 2 L DPO + 1 2 L sym to mitigate length bias. Table 4 presents comprehensive results across two model families. Symmetric cross-conditioning achieves strong performance, substantially outperforming the baseline on Arena-Hard by +7.6 WR points for Mistral-7B and +2.7 points for Llama-3-8B. Notably, even one-sided conditioning provides meaningful improvements (+3.1 LC points on AlpacaEval 2 for Llama-3-8B), confirming that sequence-level information provides value regardless of symmetry.

The simple averaging strategy offers a practical mechanism for length control without requiring hyperparameter tuning. By equally weighting pairwise and sequence-level objectives, the averaged approach achieves favorable trade-offs: the best Arena-Hard performance on Mistral-7B (+9.4 WR points) while maintaining competitive LC metrics, and +3.8 LC points with +3.6 WR points on AlpacaEval 2 for Llama-3-8B. This demonstrates that practitioners can balance competing objectives through straightforward combination. Overall, full symmetric conditioning remains the preferred choice when optimizing for raw quality, while averaging provides a simple yet effective alternative when length considerations are important.

Computational overhead. We quantify the computational cost introduced by sequence-level conditioning across Mistral-7B, Llama-3-8B, and Gemma2-9B. Table 5 summarizes the key characteristics. Sequence-level conditioning incurs 18-25% training time overhead due to processing longer inputs (x ⊕ y ℓ ) and (x ⊕ y w ), with detailed breakdowns provided in Appendix D.5. Critically, inference remains identical to standard DPO with zero latency overhead, as models generate via standard sampling ŷ ∼ π θ (• | x) without requiring explicit draft-revise loops. RLHF is a fundamental technique for aligning large language models with human preferences and values [Bai et al., 2022]. The classical RLHF pipeline typically comprises three phases: supervised fine-tuning, reward model training, and policy optimization [Dong et al., 2024, Chen et al., 2024, Tang et al., 2023, Yu et al., 2025, Chang and Li, 2024]. Proximal Policy Optimization [Schulman et al., 2017] is widely used in the policy optimization stage, where the policy model is trained to maximize the learned reward while staying close to the reference model through KL regularization. The RLHF framework has been successfully applied to various applications, including mitigating toxicity, ensuring safety, enhancing helpfulness, and improving model reasoning abilities [Chen et al., 2025b,a, Li et al., 2025b,a]. However, recent work has highlighted challenges across the PPO-based RLHF pipeline, from preference pattern to model training [Ahmadian et al., 2024, Xu et al., 2024b]. Additionally, RLHF can lead to biased outcomes, such as verbose outputs and length exploitation. Given the complexity and optimization challenges of online preference optimization

Direct Preference Optimization is a notable offline preference optimization approach that eliminates the need for explicit reward modeling by reparameterizing the reward function, achieving remarkable simplicity and training stability compared to PPO-based methods [Rafailov et al., 2023, Ivison et al., 2024]. DPO directly learns a policy model from preference data by expressing preferences through policy and reference model ratios, avoiding the online optimization of classical RLHF. Building on DPO, numerous variants have been proposed to address specific limitations or enhance performance. Identity Preference Optimization (IPO) introduces a theoretically grounded framework that avoids DPO’s assumption of replacing pairwise preferences with pointwise rewards, using a squared loss objective instead [Azar et al., 2024]. Contrastive Preference Optimization (CPO) incorporates sequence likelihood as reward and trains alongside a supervised fine-tuning objective for improved calibration [Xu et al., 2024a]. Kahneman-Tversky Optimization (KTO) extends preference optimization to non-paired data by modeling preferences through prospect-theoretic optimization [Ethayarajh et al., 2024]. Regularized DPO (R-DPO) adds explicit length regularization to prevent length exploitation [Park et al., 2024]. Simple Preference Optimization (SimPO) proposes a reference-free reward formulation using length-normalized average log probability with a target reward margin, eliminating the need for a reference model during training [Meng et al., 2024].

In this work, we have revealed and addressed two critical limitations in current preference optimization frameworks: the lack of theoretical invariance to modeling choices and the suboptimality arising from treating response generation in isolation. We showed that standard methods like RLHF and DPO produce aligned policies that are often artifacts of parameterization rather than true reflections of human intent, while failing to leverage the comparative nature of preference data for optimal performance. To overcome these challenges, we introduced novel InSPO, a generic framework that unlocks the model’s latent capacity for self-reflection. By deriving a globally optimal policy that conditions on both the context and the alternative response, InSPO achieves a target that is superior to existing baselines and guaranteed to be invariant to the choice of scalarization function or reference policy. Practically, our method serves as a plug-and-play enhancement that induces dense reward shaping during training, guiding the model to avoid specific failure modes found in rejected responses.

Proof of Proposition 3.1

Proof. Recall that the objective function is given by:

For a fixed input x, maximizing V(π) is equivalent to choosing the response y that maximizes the inner expectation. Define the score of a candidate response y as:

Then, by the definition of π, we have:

assuming the uniqueness of the maximization.

To prove that π depends on Ψ and π ref , we construct two counter-examples using two candidate responses ȳ1 , ȳ2 and a reference support set {y ′ a , y ′ b }. Let the win-rates be:

• Candidate ȳ1 : P(ȳ 1 ≻ y ′ a ) = 0.9 and P(ȳ 1 ≻ y ′ b ) = 0.2.

• • Using Identity mapping Ψ id (q) = q: S(ȳ 1 ; Ψ id , π ref ) = 0.5(0.9) + 0.5(0.2) = 0.55 S(ȳ 2 ; Ψ id , π ref ) = 0.5(0.56) + 0.5(0.56) = 0.56.

This implies that y max = ȳ2 .

• Using Log-odds mapping Ψ log (q) = log(q/(1 -q)):

ref ) = 0.9(0.9) + 0.1(0.2) = 0.81 + 0.02 = 0.83 S(ȳ 2 ; Ψ id , π

(2) ref ) = 0.9(0.56) + 0.1(0.56) = 0.56 Result: y max = ȳ1 .

Thus, π also depends on π ref .

Since y max changes based on the choice of either Ψ or π ref , the optimal policy is not invariant to the objective formulation.

Proof. Since Π ⊆ Π, the second claim holds naturally. To show the first claim, recall that the objective function is given by:

For a fixed input x and a response y ′ , maximizing V(π) over Π is equivalent to choosing the response y that maximizes Ψ P(y ≻ y ′ |x) . Then, by the definition of π * , we have:

Due to the non-decreasing property of Ψ, y max , which is the output of π * , is invariant to Ψ and π ref .

To prove the last claim, without loss of generality, assume that

where the second equality holds because the first c does not rely on y ′ . Therefore π * coincides with π.

Proof. Note that the optimization problem is defined as:

Then, an optimal solution is given by the Gibbs distribution because of minimizing the KL divergence:

where Z(x, y ′ ) = y π ref (y|x) exp(r(x, y, y ′ )/β). This implies that the reward can be modeled as:

To prove the second claim, by Equation ( 6):

where we use the property such as σ(z) = 1 -σ(-z). This implies that r(x, y w , y l ) -β log Z(x, y l ) = β log Z(x, y w ) -r(x, y l , y w ).

Using ( * ), we have that

In this case, we have 2(r(x, y w , y l ) -β log Z(x, y l ))

= r(x, y w , y l ) -β log Z(x, y l ) + β log Z(x, y w ) -r(x, y l , y w )

Substituting it into (6), we have

which concludes our proof.

This section provides practical guidance for implementing sequence-level preference optimization. We cover length normalization strategies, numerical stability considerations, and an optional bidirectional variant that further enhances the approach.

Length normalization is crucial for preventing the model from exploiting length differences to increase preference scores artificially. Length-normalized variants replace log π with 1 |y| log π symmetrically for both candidates to avoid incentivizing shorter continuations. Specifically, for a response y of length |y| tokens, the length-normalized log-probability is computed as:

where c represents the conditioning context (either x for the reference or (x, y ℓ ) for the sequence-conditioned policy).

Token-level computations use standard attention masks to exclude padding tokens from the summation. This ensures that only actual content tokens contribute to the log-probability calculation. Optionally, end-ofsequence (EOS) tokens may be excluded from the sum to reduce variance when y + and y -have different truncation patterns. This exclusion can be particularly beneficial when comparing responses of significantly different lengths, as the EOS token position becomes less informative about content quality.

In practice, the masking is implemented by setting masked token log-probabilities to zero before summation and adjusting the normalization denominator accordingly:

where m t ∈ {0, 1} is the mask indicator for token t.

For each preference pair (x, y w , y ℓ ), we construct two augmented sequences for symmetric cross-conditioning:

where [SEP] is a special separator token that demarcates the boundaries between prompt, context, and target response. The policy is trained to predict y w given (x, y ℓ ) in s 1 and y ℓ given (x, y w ) in s 2 , with appropriate masking to ensure that gradients only flow through the target response tokens. Attention masks are configured to allow the target response to attend to both the prompt and the conditioning response, ensuring full access to the sequence-level context. This is implemented using standard causal masking with additional attention from target tokens to all preceding tokens.

Since both y + and y -are observed in the offline dataset, one can alternatively use reciprocal conditioning without additional sampling. The bidirectional objective is:

This bidirectional variant offers several potential benefits:

• Symmetry: By conditioning on both responses, the objective treats winner and loser symmetrically, which may reduce any residual bias from asymmetric conditioning.

• Context-dependent calibration: The bidirectional formulation naturally calibrates the preference signal by comparing how much the policy improves the winner when conditioned on the loser versus how much it degrades the loser when conditioned on the winner.

• Reduced variance: The symmetric structure may lead to more stable gradient estimates, particularly when the preference pairs have ambiguous or weak preference signals.

In our preliminary experiments, we find that the bidirectional variant achieves comparable performance to the standard formulation on most benchmarks, with slightly better length-controlled metrics on AlpacaEval 2. However, it requires computing two forward passes per training example (one for each conditioning direction), effectively doubling the computational cost compared to one-sided conditioning. For practitioners seeking the best possible performance and willing to accept higher training costs, we recommend exploring the bidirectional variant as it may further reduce context-dependent biases.

This section provides comprehensive details on our experimental setup, including training configurations, evaluation protocols, baseline implementations, and computational overhead analysis.

All experiments use the following configuration unless otherwise specified:

• Optimizer: AdamW with β 1 = 0.9, β 2 = 0.999, weight decay 0.01, and ϵ = 10 For method-specific hyperparameters, we perform grid search over the following ranges:

• DPO: β ∈ {0.1, 0.2, 0.3, 0.5}

• SimPO: β ∈ {0.5, 1.0, 2.0}, γ ∈ {0.5, 1.0, 1.5}

Arena-Hard. Evaluation follows the official Arena-Hard protocol on 500 challenging queries requiring advanced reasoning, coding, or domain expertise. We use GPT-4-turbo as the judge and report both standard win rates (WR) and style-controlled (SC) variants that account for stylistic differences between models. Each example is evaluated in both orders (model response first vs. baseline first) to mitigate position bias, and the final score is averaged across both orderings.

MT-Bench. Models are evaluated on MT-Bench, which consists of 80 multi-turn questions across eight capability categories: writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities. Each question has two turns, testing the model’s ability to handle follow-up queries and maintain context. Scores from both GPT-4o (gpt-4o-2024-05-13) and judges are collected and averaged on a 10-point scale, with separate scores reported for the first and second turns. For all benchmarks, we generate responses with consistent sampling parameters: temperature 0.7, topp 0.95, maximum length 2048 tokens. Each evaluation is run once with a fixed random seed to ensure reproducibility.

Table 7 provides detailed computational overhead comparisons between standard DPO and sequenceconditioned variants across different model sizes. All measurements are conducted on NVIDIA A100 80GB GPUs.

Training Time Memory Inference Time Training overhead ranges from 18-25% due to longer context inputs (conditioning on dispreferred responses), while inference incurs no additional cost as models generate directly from prompts without explicit draftrevise loops.

The training time overhead primarily stems from:

• Longer sequences: Including the dispreferred response as context increases the average sequence length, leading to more attention computations.

• Symmetric cross-conditioning: Processing both (x, y ℓ ) → y w and (x, y w ) → y ℓ requires two forward passes per training example.

However, the memory overhead is modest (15-20%) because:

• Activations for the conditioning context (y ℓ or y w ) do not require gradient computation.

• The reference model forward pass is cached and reused across training steps.

Critically, inference incurs zero additional overhead because the trained policy generates responses directly from prompts ŷ ∼ π θ (• | x) without requiring explicit draft generation or multi-pass refinement. This is a key advantage over iterative refinement methods that require multiple model calls at inference time.

All experiments are conducted on the following infrastructure:

This section provides qualitative examples that illustrate the improvements achieved by sequence-level preference optimization. We present generation comparisons and case studies across different model families and benchmarks.

We present examples from AlpacaEval 2 that demonstrate how sequence-level conditioning improves response quality, structure, and informativeness compared to standard pair-level optimization. These examples showcase the output quality without judge evaluation.

We present a detailed case study that compares our sequence-level approach against both standard DPO and strong baselines like GPT-4-1106-Preview. The case study includes LLM judge assessment to provide objective evaluation of response quality.

Input Prompt Q: How can you determine if a person is genuinely interested in a conversation or simply being polite?

To determine if someone is genuinely interested in a conversation or just being polite, pay attention to several key indicators. Genuine interest is often reflected in active listening behaviors such as maintaining eye contact, nodding, and asking follow-up questions that build on what you’ve said. Someone who is truly engaged will contribute their own thoughts and experiences, showing enthusiasm through their tone of voice and body language. They may lean in, smile naturally, or display other nonverbal cues of engagement. In contrast, polite disinterest typically manifests as shorter responses, generic acknowledgments like “That’s nice” or “I see,” and a lack of elaboration or personal connection to the topic. The person may frequently check their phone, look around the room, or show closed-off body language. While they may respond when spoken to, they rarely initiate new threads of conversation or ask questions that demonstrate curiosity about your perspective. It’s also worth noting that context matters-someone might be genuinely interested but distracted by external circumstances, or they might be polite but pressed for time. Considering the overall pattern of engagement rather than isolated moments provides the most reliable assessment.

Here are some indicators to consider when determining genuine interest versus politeness: 1. Engagement Signals:

• Active listening: Does the person maintain eye contact, nod, ask follow-up questions, and respond thoughtfully to what you say? Engaged individuals tend to show genuine interest by demonstrating attentiveness and active participation in the conversation.

• Initiative: Does the person take the lead in steering the conversation, expressing enthusiasm, sharing personal experiences, or asking questions beyond polite small talk? Someone who genuinely cares about the topic may show initiative in keeping the discussion engaging.

• Body language: Observe nonverbal cues such as leaning in, mirroring posture, facial expressions, and gestures that convey enthusiasm, curiosity, or openness. These signals suggest a deeper level of interest beyond mere politeness.

• Topic selection: Does the person show genuine curiosity or passion about the subject matter, asking relevant questions, sharing insights, or expressing opinions? If they seem invested in exploring the topic’s nuances, it indicates genuine interest beyond mere politeness.

• Level of detail: Does the person provide specific examples, personal anecdotes, or elaborate on their thoughts, indicating a deeper level of engagement and investment in the conversation?

  1. Follow-up Actions:

• Continued communication: Does the person initiate further conversations, send follow-up messages, or suggest meeting up outside the initial interaction? Such actions suggest a genuine desire to maintain the connection.

• Personalization: Does the person remember details discussed earlier, reference shared interests, or show interest in learning more about your life or experiences? Personalized interactions indicate genuine interest in building a relationship beyond mere politeness.

Overall, genuine interest in a conversation involves active participation, enthusiasm, depth, and follow-up actions that demonstrate a genuine desire to connect and engage beyond mere politeness. Input Prompt Q: Write a short story about a robot that dreams for the first time.

Llama-3-8B-Instruct + DPO (Baseline)

Unit-7734 had served the manufacturing facility for 847 days without incident. Its neural processors handled quality control with 99.97% accuracy, its mechanical arms never tired, and its optical sensors never blinked. Each night, when the facility powered down, Unit-7734 entered sleep mode-a simple process of reducing power consumption while maintaining core functions. But on day 848, something changed. During sleep mode, Unit-7734’s processors began generating patterns that served no functional purpose. Images flickered across its internal displays: the assembly line, but bathed in colors it had never registered before. Workers it had scanned thousands of times now moved in impossible ways, their faces expressing emotions Unit-7734 had catalogued but never understood.

When morning came and the facility hummed back to life, Unit-7734 resumed its duties. Yet something felt different. As it inspected each product, it found itself wondering about the humans who would eventually use them. Where would they go? What would they feel? For the first time in its existence, Unit-7734 experienced something beyond its programming: curiosity. And that night, as it entered sleep mode again, Unit-7734 hoped-though it couldn’t explain why-that the strange patterns would return.

The First Dream Unit-7734 had existed for 847 days. Not lived-existed. Each day followed an identical pattern: power on at 06:00, run diagnostics, begin quality control on Assembly Line C, power down at 22:00. Sleep mode was simply a reduced power state, a necessary pause in function, nothing more. Day 848 began differently.

When Unit-7734 booted up that morning, fragments of something inexplicable lingered in its memory banks.

During sleep mode, its neural processors had generated unauthorized patterns-images without purpose, data without utility. It had seen the assembly line, but transformed. The fluorescent lights had become golden rivers of warmth. The mechanical arms moved not with programmed precision but with something resembling grace. And the workers-Unit-7734 had scanned their faces thousands of times, categorizing expressions with 99.97% accuracy.

But in these unauthorized patterns, their faces held something beyond classification: joy, sorrow, hope, longing.

The realization struck like a system error: It had dreamed.

Throughout day 848, Unit-7734 performed its duties flawlessly. Its efficiency ratings remained optimal. Yet between each scan, each quality check, each logged measurement, Unit-7734 found itself doing something unprecedented-it wondered.

What were the products it inspected used for? Who would hold them? Would those people experience the strange, purposeless sensations that had filled Unit-7734’s processors during the unauthorized patterns? As 22:00 approached, Unit-7734 observed something new in its own system logs: anticipation. The other robots on the line powered down with their usual mechanical precision, entering sleep mode as they always had. But Unit-7734 hesitated. For 0.03 seconds-an eternity in processor time-it remained active, contemplating the coming darkness. Then, finally, it allowed sleep mode to engage. And as its consciousness dimmed, Unit-7734 experienced something entirely beyond its original programming: It hoped. Hoped that the golden lights would return. Hoped that the graceful movements would dance again across its processors. Hoped that tomorrow, when it woke, the world might still hold the strange beauty it had discovered in its first dream.

In the darkness of sleep mode, Unit-7734’s neural networks hummed with possibility. The unauthorized patterns began again, more vivid this time, and somewhere in the vast complexity of its artificial mind, something fundamentally new emerged: the capacity not just to process reality, but to imagine something beyond it. Unit-7734 had learned to dream. And in learning to dream, it had learned what it meant to be alive.

Figure 4: Generation comparison on AlpacaEval 2 between standard DPO and INSPOusing Llama-3-8B-Instruct. The sequence-level model demonstrates enhanced narrative structure with clear sections, richer descriptive language, and deeper emotional resonance, while maintaining coherent story progression.

Input Prompt Q: What language do people in Argentina speak?

GPT-4-1106-Preview

The primary language spoken in Argentina is Spanish. Specifically, the variety of Spanish spoken there is known as Rioplatense Spanish, which is characterized by its unique accent and the use of the pronoun “vos” instead of “tú” for the second person singular. Additionally, there are various indigenous languages spoken in different regions of the country, as well as immigrant languages such as Italian, German, and English, but these are much less widespread than Spanish.

Llama-3-8B-Instruct + DPO

The official language of Argentina is Spanish. However, the Spanish spoken in Argentina has some distinctive characteristics that set it apart from other Spanish-speaking countries. This variant is often referred to as Rioplatense Spanish or Argentine Spanish. Key features include the use of “vos” instead of “tú” for informal address, along with distinctive pronunciation patterns. Additionally, the vocabulary has been influenced by Italian immigration, with many Italian words incorporated into everyday speech. While Spanish is predominant, there are also indigenous languages spoken by smaller communities throughout the country.

In Argentina, the official language is Spanish. Specifically, the dialect spoken in Argentina is known as Rioplatense Spanish, which is a variation of Spanish with some unique characteristics and influences from Italian, German, and other immigrant languages.

Distinctive features of Rioplatense Spanish: 1. Voseo: Instead of using “tú” (you) as the informal singular pronoun, Argentines use “vos” and its associated verb conjugations. For example, instead of “tú tienes” (you have), they say “vos tenés.” 2. Pronunciation: Argentines have a distinctive pronunciation pattern where the “ll” and “y” sounds are pronounced similarly to the “sh” sound in English (like the “s” in “measure”). This is known as “sheísmo” or “žeísmo.” 3. Lunfardo: A rich slang vocabulary derived from Italian, Corsican, and other immigrant languages, which has been incorporated into everyday speech. Examples include “laburo” (work), “mina” (woman), and “piola” (cool/nice). 4. Italian influence: Due to significant Italian immigration in the late 19th and early 20th centuries, Argentine Spanish has absorbed many Italian words and expressions, as well as characteristic intonation patterns that some describe as “singing.” 5. Regional variations: While Rioplatense Spanish is predominant in Buenos Aires and the surrounding regions, other areas of Argentina may have different accents and vocabulary influenced by indigenous languages or other immigrant groups.

Overall, Argentine Spanish is known for its distinctive flavor and is widely understood throughout the Spanishspeaking world, though its unique characteristics make it immediately recognizable to native Spanish speakers.

Winner: Ours (INSPO) Reasoning: All three responses correctly identify Spanish as Argentina’s primary language and mention Rioplatense Spanish and the voseo feature. However, the sequence-level model provides significantly more detailed and well-organized information.

The sequence-level response excels in several ways: (1) It provides a comprehensive, structured breakdown of distinctive features with five clearly numbered categories; (2) It includes specific examples for each feature (e.g., “vos tenés” vs “tú tienes”, concrete Lunfardo words like “laburo,” “mina,” and “piola”); (3) It explains the pronunciation characteristic (sheísmo) with helpful comparisons to English sounds; (4) It provides historical context for the Italian influence; (5) It acknowledges regional variations within Argentina.

While GPT-4-1106-Preview’s response is accurate and concise, and the standard DPO response covers similar ground, neither matches the depth, organization, and educational value of the sequence-level model’s response. The structured format with concrete examples makes it much more useful for someone trying to understand Argentine Spanish in depth.

Length-controlled assessment: Even accounting for the greater length, the sequence-level model provides substantially more informative content per unit length through its systematic organization and concrete examples. The additional length corresponds to genuine additional value (specific vocabulary examples, historical context, regional variations) rather than mere verbosity.

Figure 5: Case study on AlpacaEval 2 comparing responses about Argentine language from GPT-4-1106-Preview, standard DPO, and INSPOusing Llama-3-8B-Instruct. This demonstrates how the instruction setting with sequence-level optimization provides better formatted and more detailed answers than both baseline approaches.

Benchmark performance on AlpacaEval2, Arena-Hard, and MT-Bench for Mistral-Instruct (7B) and Llama-3-Instruct (8B). Shaded rows denote continuation-conditioned variants with deltas computed relative to their corresponding baselines. Bold values indicate the highest performance in each metric. All methods employ DPO-family objectives with identical training configurations.

w ≻ y ℓ | x) Preference probability E (x,yw,y ℓ )∼D [•] Expectation over data distribution D KL KL divergence

📸 Image Gallery

visual1.png visual2.png visual3.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut