A Note on Hybrid Online Reinforcement and Imitation Learning for LLMs: Formulations and Algorithms

Reading time: 10 minute
...

📝 Original Info

  • Title: A Note on Hybrid Online Reinforcement and Imitation Learning for LLMs: Formulations and Algorithms
  • ArXiv ID: 2512.23097
  • Date: 2025-12-28
  • Authors: Yingru Li, Ziniu Li, Jiacai Liu

📝 Abstract

We present a unified framework for Large Language Model (LLM) fine-tuning that integrates Imitation Learning and Reinforcement Learning. By analyzing the gradient of a composite objective combining trajectory-level KL divergence with task rewards, we derive a natural decomposition into two components: (1) an analytically computable Dense Gradient for token-level imitation, and (2) a Monte Carlo estimated Sparse Gradient for long-horizon reward optimization. The Dense Gradient admits a closed-form logit-level formula, enabling efficient GPU implementation. Introduction Knowledge Distillation (KD) [1] and Reinforcement Learning (RL) are two fundamental approaches to training Large Language Models. Standard KD minimizes divergence on static offline data via teacher forcing, leading to train-inference distribution mismatch (also known as exposure bias or covariate shift) [2]: during training, the student conditions on ground-truth history, while during inference it conditions on its own generations. This mismatch compounds over long sequences, causing errors to accumulate-a phenomenon well-studied in online imitation learning [3, 4] . The DAgger algorithm [3] addresses distribution mismatch by iteratively collecting data under the learner's own policy while querying the expert for labels. Recent theoretical work [4] establishes that interactive access to an expert provides provable statistical advantages over passive behavior cloning: under µ-recoverability assumptions, interactive learners achieve suboptimality O(µ|S|H/N ), while noninteractive learners suffer Ω(|S|H 2 /N )-a quadratic-to-linear improvement in horizon dependence. RL addresses the distribution mismatch by training on self-generated data, but introduces new challenges: high variance from sparse, trajectory-level rewards and risks of reward hacking. Recent work on on-policy distillation [5, 6] attempts to bridge these approaches. In this work, we analyze the gradient structure of a unified objective that combines on-policy KL minimization with reward maximization, revealing a natural decomposition that enables efficient implementation. Contributions. 1. We derive the exact gradient of trajectory-level KL + reward objectives, proving it decomposes into Dense (analytic) and Sparse (sampled) terms (Theorem 1). 2. We provide an efficient logit-level gradient formula amenable to GPU implementation (Proposition 5). 3. We establish mathematical equivalence to KL-regularized RLHF while clarifying interpretational differences (Section 5.2). 4. We discuss training curriculum implications for the reward weight λ (Section 5.1).

📄 Full Content

Definition 1 (Autoregressive Policy). An LLM policy π generates response y = (y 1 , . . . , y T ) given prompt x with probability:

where y <t = (y 1 , . . . , y t-1 ) and each π(•|x, y <t ) is a distribution over vocabulary V.

Definition 2 (Instantaneous Divergence Cost). At step t, the log-ratio between student π θ and reference π ref is:

We minimize a cost combining distribution matching with reward maximization:

where λ ≥ 0 controls the trade-off and r(x, y) is a black-box reward.

Using the autoregressive factorization, the trajectory KL decomposes as:

Thus the objective becomes:

Remark 1 (Connection to Online Imitation Learning). Our framework can be viewed as a continuous relaxation of DAgger [3]. While DAgger iteratively collects data under the learner’s distribution and trains via supervised learning, we directly optimize a trajectory-level objective under the learner’s distribution.

The Dense Term provides the imitation signal (analogous to DAgger’s supervised loss), while on-policy sampling addresses the distribution mismatch.

We now state our main theoretical results. All proofs are deferred to Appendix A.

Definition 3 (Future Return). The future return from step t + 1 is:

This captures future divergence costs plus the (negative) terminal reward.

Theorem 1 (Gradient Decomposition). The gradient of objective (3) at step t decomposes as:

Sparse Term (7) where the expectation is over x ∼ D and y ∼ π θ (•|x).

The proof relies on two key lemmas:

Lemma 2 (Vanishing Score Function). For any context (x, y <t ):

The Dense Term has a remarkable closed-form property:

Proposition 4 (Dense Term Equals Token-Level KL Gradient).

This means the Dense Term can be computed analytically by summing over the finite vocabulary V, without Monte Carlo sampling.

Proposition 5 (Logit-Level Gradient). Let p = softmax(z) be the student distribution and q the teacher distribution. The gradient of KL with respect to logits z is:

where ⊙ denotes element-wise multiplication.

This formula enables efficient GPU implementation for both full-vocabulary and Top-K computation.

To control the variance-bias trade-off, we introduce discount factor γ ∈ [0, 1]:

Special cases:

Remark 2 (Discounting Convention). The exponent γ k-t (rather than γ k-t-1 ) means the immediate next step c t+1 is discounted by γ relative to the current cost c t . This provides a natural interpolation: γ = 0 isolates each step from all future KL costs, while γ = 1 gives full trajectory credit assignment.

Initialize gradient accumulator g ← 0 6:

for each response i, step t do 7:

end for 9:

Compute rewards: R (i) ← r(x, y (i) )

10:

for each response i, step t do 11:

12:

end for 13:

for each response i, step t do 14:

Dense:

15:

Sparse:

g ← g + g dense + g sparse // Sum over all (i, t)

end for 18:

θ ← θ -η • g/K // Average over group 19: end for Remark 3 (Compatibility with Group Rollout Infrastructure). The algorithm is fully compatible with standard group rollout RL infrastructure. The group size K corresponds to the number of responses sampled per prompt in existing implementations (e.g., GRPO, RLOO). The only additions are: (1) computing token-level log-ratios c t against the teacher, and (2) accumulating the Dense gradient alongside the standard policy gradient. Both operations integrate naturally into existing training pipelines.

The dual interpretation of our framework suggests a natural training curriculum:

• Early training (λ small, equivalently β large): Prioritize imitation. Dense Term dominates, providing stable gradients toward teacher behavior (“cold start”).

• Late training (λ large, equivalently β small): Prioritize reward. Sparse Term becomes influential, enabling discovery of reward-maximizing behaviors beyond the teacher.

A simple schedule: λ(t) = λ 0 (1 + αt) where t is training step.

In the limit λ → ∞, the framework approaches pure RL, but the Dense Term provides ongoing regularization toward the expert.

Our objective (3) is mathematically equivalent to KL-regularized RLHF:

with β = 1/λ. This follows by negating and rearranging: max

). While mathematically identical, the frameworks differ in interpretation: The theoretical results of [4] on the value of interaction suggest that our on-policy approach should provide statistical advantages over offline distillation, particularly for long-horizon tasks where distribution mismatch is severe.

Divergence-Augmented Policy Optimization (DAPO) [8] also incorporates divergence terms into policy optimization. DAPO adds a Bregman divergence between the behavior policy (which generated the training data) and the current policy to stabilize off-policy learning-the divergence controls the degree of “off-policyness” to ensure safe policy updates when reusing old data.

Common ground. Both frameworks operate on divergence over state-action occupancy measures. In the autoregressive LLM setting, sequence-level KL is equivalent to KL over state-action occupancy: the “state” is the context (x, y <t ) and the “action” is the next token y t . The trajectory distribution π(y|x) = t π(y t |x, y <t ) induces a state-action occupancy, and our trajectory-level KL decomposes accordingly. DAPO similarly computes Bregman divergence over state-action distributions rather than action probabilities alone.

Key differences. Despite operating on the same mathematical objects, the frameworks differ in purpose:

Our Framework DAPO

Expert/teacher π ref Behavior policy π b Divergence purpose Imitation (attract to expert) Stability (bound off-policyness) Data regime On-policy Off-policy Our KL term drives the policy toward a reference expert, while DAPO’s divergence term prevents the policy from drifting too far from the data-generating behavior policy. Our on-policy setting eliminates the need for importance weights that DAPO requires for off-policy correction.

Different choices of γ and λ recover existing approaches:

Standard SFT/KD 0 0 Dense only On-policy distillation + reward 0 > 0 Dense + reward via Sparse Full trajectory KL + reward 1 > 0 Dense + full Sparse Pure RL (no KL) -> 0 Sparse only

The framework naturally extends to multiple teachers and rewards: Multiple Teachers. Given teachers {π

with weights α m ≥ 0:

Multiple Rewards. Given rewards {r n } N n=1 with weights λ n :

We derived a principled gradient decomposition for hybrid imitation-reinforcement learning in LLMs:

  1. The gradient splits into Dense (analytic, zero sampling variance) and Sparse (Monte Carlo, captures long-horizon effects) terms.

  2. The logit-level formula (11) enables efficient GPU implementation.

  3. Mathematical equivalence to RLHF (β = 1/λ) is established, connecting to the broader literature on KL-regularized policy optimization.

  4. The framework suggests natural training curricula via λ scheduling, transitioning from imitationfocused to reward-focused learning.

Future work includes empirical validation and analysis of variance-bias trade-offs for different γ values.

A.1 Proof of Lemma 2 (Vanishing Score Function)

Proof. For any distribution π θ (•|x, y <t ) over finite vocabulary V:

The key step uses that probabilities sum to 1, which is constant with respect to θ.

Proof. For k < t, the cost c k depends only on tokens y 1 , . . . , y k . We decompose the expectation using the tower property:

Since c k is measurable with respect to y <t for k < t, it factors out of the inner expectation.

Proof. Define total cost C(y) = T t=1 c t -λr(x, y).

Step 1: Apply REINFORCE identity [7]. For any function f (y):

Applying this to C(y):

Step 2: Show E y [∇ θ C(y)] = 0. Since r and π ref don’t depend on θ:

Taking expectation and using iterated conditioning:

by Lemma 2. Thus E y [∇ θ C(y)] = 0.

Step 3: Expand ∇ θ log π θ (y|x).

By the chain rule on the autoregressive factorization:

Step 4: Apply causality to isolate per-step contributions.

The gradient contribution at step t is:

Decompose C(y) = k<t c k + c t + G t+1 where G t+1 = k>t c k -λr. By Lemma 3, past costs (k < t) contribute zero:

Thus:

Linearity of expectation gives the Dense (c t ) + Sparse (G t+1 ) decomposition.

A.5 Proof of Proposition 5 (Logit-Level Gradient)

Proof. Let p = softmax(z) where p i = e z i j e z j .

Step 1: Compute the softmax Jacobian. Taking the derivative:

where δ ij is the Kronecker delta.

Step 2: Apply chain rule for KL.

Since D KL = i p i (log p i -log q i ):

Step 3: Combine using the Jacobian.

= p j (log p j -log q j + 1) -p j i p i (log p i -log q i + 1)

= p j (log p j -log q j + 1) -p j (D KL + 1)

= p j (log p j -log q j -D KL )

In vector form: ∇ z D KL = p ⊙ (log p -log q -D KL • 1).

Using Equation (11), the dense gradient can be computed as follows (pseudocode):

def dense_gradient_full(student_logits, teacher_logits): p = softmax(student_logits) q = softmax(teacher_logits) kl = sum(p * (log(p) -log(q))) grad_z = p * (log(p) -log(q) -kl) return grad_z

Complexity: O(|V|) per position. The element-wise operations are amenable to GPU parallelization (e.g., via custom CUDA or Triton kernels). Note: Both methods require O(|V|) for the initial softmax computation. The savings apply to the gradient tensor: Top-K produces a sparse gradient with only K non-zero entries. For |V| ≈ 128,000 and K = 32, this reduces gradient storage and downstream computation by ∼4000×.

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut