We present a unified framework for Large Language Model (LLM) fine-tuning that integrates Imitation Learning and Reinforcement Learning. By analyzing the gradient of a composite objective combining trajectory-level KL divergence with task rewards, we derive a natural decomposition into two components: (1) an analytically computable Dense Gradient for token-level imitation, and (2) a Monte Carlo estimated Sparse Gradient for long-horizon reward optimization. The Dense Gradient admits a closed-form logit-level formula, enabling efficient GPU implementation.
Introduction Knowledge Distillation (KD) [1] and Reinforcement Learning (RL) are two fundamental approaches to training Large Language Models. Standard KD minimizes divergence on static offline data via teacher forcing, leading to train-inference distribution mismatch (also known as exposure bias or covariate shift) [2]: during training, the student conditions on ground-truth history, while during inference it conditions on its own generations. This mismatch compounds over long sequences, causing errors to accumulate-a phenomenon well-studied in online imitation learning [3, 4] . The DAgger algorithm [3] addresses distribution mismatch by iteratively collecting data under the learner's own policy while querying the expert for labels. Recent theoretical work [4] establishes that interactive access to an expert provides provable statistical advantages over passive behavior cloning: under µ-recoverability assumptions, interactive learners achieve suboptimality O(µ|S|H/N ), while noninteractive learners suffer Ω(|S|H 2 /N )-a quadratic-to-linear improvement in horizon dependence. RL addresses the distribution mismatch by training on self-generated data, but introduces new challenges: high variance from sparse, trajectory-level rewards and risks of reward hacking. Recent work on on-policy distillation [5, 6] attempts to bridge these approaches. In this work, we analyze the gradient structure of a unified objective that combines on-policy KL minimization with reward maximization, revealing a natural decomposition that enables efficient implementation.
Contributions. 1. We derive the exact gradient of trajectory-level KL + reward objectives, proving it decomposes into Dense (analytic) and Sparse (sampled) terms (Theorem 1). 2. We provide an efficient logit-level gradient formula amenable to GPU implementation (Proposition 5). 3. We establish mathematical equivalence to KL-regularized RLHF while clarifying interpretational differences (Section 5.2). 4. We discuss training curriculum implications for the reward weight λ (Section 5.1).
Definition 1 (Autoregressive Policy). An LLM policy π generates response y = (y 1 , . . . , y T ) given prompt x with probability:
where y <t = (y 1 , . . . , y t-1 ) and each π(•|x, y <t ) is a distribution over vocabulary V.
Definition 2 (Instantaneous Divergence Cost). At step t, the log-ratio between student π θ and reference π ref is:
We minimize a cost combining distribution matching with reward maximization:
where λ ≥ 0 controls the trade-off and r(x, y) is a black-box reward.
Using the autoregressive factorization, the trajectory KL decomposes as:
Thus the objective becomes:
Remark 1 (Connection to Online Imitation Learning). Our framework can be viewed as a continuous relaxation of DAgger [3]. While DAgger iteratively collects data under the learner’s distribution and trains via supervised learning, we directly optimize a trajectory-level objective under the learner’s distribution.
The Dense Term provides the imitation signal (analogous to DAgger’s supervised loss), while on-policy sampling addresses the distribution mismatch.
We now state our main theoretical results. All proofs are deferred to Appendix A.
Definition 3 (Future Return). The future return from step t + 1 is:
This captures future divergence costs plus the (negative) terminal reward.
Theorem 1 (Gradient Decomposition). The gradient of objective (3) at step t decomposes as:
Sparse Term (7) where the expectation is over x ∼ D and y ∼ π θ (•|x).
The proof relies on two key lemmas:
Lemma 2 (Vanishing Score Function). For any context (x, y <t ):
The Dense Term has a remarkable closed-form property:
Proposition 4 (Dense Term Equals Token-Level KL Gradient).
This means the Dense Term can be computed analytically by summing over the finite vocabulary V, without Monte Carlo sampling.
Proposition 5 (Logit-Level Gradient). Let p = softmax(z) be the student distribution and q the teacher distribution. The gradient of KL with respect to logits z is:
where ⊙ denotes element-wise multiplication.
This formula enables efficient GPU implementation for both full-vocabulary and Top-K computation.
To control the variance-bias trade-off, we introduce discount factor γ ∈ [0, 1]:
Special cases:
Remark 2 (Discounting Convention). The exponent γ k-t (rather than γ k-t-1 ) means the immediate next step c t+1 is discounted by γ relative to the current cost c t . This provides a natural interpolation: γ = 0 isolates each step from all future KL costs, while γ = 1 gives full trajectory credit assignment.
Initialize gradient accumulator g ← 0 6:
for each response i, step t do 7:
end for 9:
Compute rewards: R (i) ← r(x, y (i) )
10:
for each response i, step t do 11:
12:
end for 13:
for each response i, step t do 14:
Dense:
15:
Sparse:
g ← g + g dense + g sparse // Sum over all (i, t)
end for 18:
θ ← θ -η • g/K // Average over group 19: end for Remark 3 (Compatibility with Group Rollout Infrastructure). The algorithm is fully compatible with standard group rollout RL infrastructure. The group size K corresponds to the number of responses sampled per prompt in existing implementations (e.g., GRPO, RLOO). The only additions are: (1) computing token-level log-ratios c t against the teacher, and (2) accumulating the Dense gradient alongside the standard policy gradient. Both operations integrate naturally into existing training pipelines.
The dual interpretation of our framework suggests a natural training curriculum:
• Early training (λ small, equivalently β large): Prioritize imitation. Dense Term dominates, providing stable gradients toward teacher behavior (“cold start”).
• Late training (λ large, equivalently β small): Prioritize reward. Sparse Term becomes influential, enabling discovery of reward-maximizing behaviors beyond the teacher.
A simple schedule: λ(t) = λ 0 (1 + αt) where t is training step.
In the limit λ → ∞, the framework approaches pure RL, but the Dense Term provides ongoing regularization toward the expert.
Our objective (3) is mathematically equivalent to KL-regularized RLHF:
with β = 1/λ. This follows by negating and rearranging: max
). While mathematically identical, the frameworks differ in interpretation: The theoretical results of [4] on the value of interaction suggest that our on-policy approach should provide statistical advantages over offline distillation, particularly for long-horizon tasks where distribution mismatch is severe.
Divergence-Augmented Policy Optimization (DAPO) [8] also incorporates divergence terms into policy optimization. DAPO adds a Bregman divergence between the behavior policy (which generated the training data) and the current policy to stabilize off-policy learning-the divergence controls the degree of “off-policyness” to ensure safe policy updates when reusing old data.
Common ground. Both frameworks operate on divergence over state-action occupancy measures. In the autoregressive LLM setting, sequence-level KL is equivalent to KL over state-action occupancy: the “state” is the context (x, y <t ) and the “action” is the next token y t . The trajectory distribution π(y|x) = t π(y t |x, y <t ) induces a state-action occupancy, and our trajectory-level KL decomposes accordingly. DAPO similarly computes Bregman divergence over state-action distributions rather than action probabilities alone.
Key differences. Despite operating on the same mathematical objects, the frameworks differ in purpose:
Our Framework DAPO
Expert/teacher π ref Behavior policy π b Divergence purpose Imitation (attract to expert) Stability (bound off-policyness) Data regime On-policy Off-policy Our KL term drives the policy toward a reference expert, while DAPO’s divergence term prevents the policy from drifting too far from the data-generating behavior policy. Our on-policy setting eliminates the need for importance weights that DAPO requires for off-policy correction.
Different choices of γ and λ recover existing approaches:
Standard SFT/KD 0 0 Dense only On-policy distillation + reward 0 > 0 Dense + reward via Sparse Full trajectory KL + reward 1 > 0 Dense + full Sparse Pure RL (no KL) -> 0 Sparse only
The framework naturally extends to multiple teachers and rewards: Multiple Teachers. Given teachers {π
with weights α m ≥ 0:
Multiple Rewards. Given rewards {r n } N n=1 with weights λ n :
We derived a principled gradient decomposition for hybrid imitation-reinforcement learning in LLMs:
-
The gradient splits into Dense (analytic, zero sampling variance) and Sparse (Monte Carlo, captures long-horizon effects) terms.
-
The logit-level formula (11) enables efficient GPU implementation.
-
Mathematical equivalence to RLHF (β = 1/λ) is established, connecting to the broader literature on KL-regularized policy optimization.
-
The framework suggests natural training curricula via λ scheduling, transitioning from imitationfocused to reward-focused learning.
Future work includes empirical validation and analysis of variance-bias trade-offs for different γ values.
A.1 Proof of Lemma 2 (Vanishing Score Function)
Proof. For any distribution π θ (•|x, y <t ) over finite vocabulary V:
The key step uses that probabilities sum to 1, which is constant with respect to θ.
Proof. For k < t, the cost c k depends only on tokens y 1 , . . . , y k . We decompose the expectation using the tower property:
Since c k is measurable with respect to y <t for k < t, it factors out of the inner expectation.
Proof. Define total cost C(y) = T t=1 c t -λr(x, y).
Step 1: Apply REINFORCE identity [7]. For any function f (y):
Applying this to C(y):
Step 2: Show E y [∇ θ C(y)] = 0. Since r and π ref don’t depend on θ:
Taking expectation and using iterated conditioning:
by Lemma 2. Thus E y [∇ θ C(y)] = 0.
Step 3: Expand ∇ θ log π θ (y|x).
By the chain rule on the autoregressive factorization:
Step 4: Apply causality to isolate per-step contributions.
The gradient contribution at step t is:
Decompose C(y) = k<t c k + c t + G t+1 where G t+1 = k>t c k -λr. By Lemma 3, past costs (k < t) contribute zero:
Thus:
Linearity of expectation gives the Dense (c t ) + Sparse (G t+1 ) decomposition.
A.5 Proof of Proposition 5 (Logit-Level Gradient)
Proof. Let p = softmax(z) where p i = e z i j e z j .
Step 1: Compute the softmax Jacobian. Taking the derivative:
where δ ij is the Kronecker delta.
Step 2: Apply chain rule for KL.
Since D KL = i p i (log p i -log q i ):
Step 3: Combine using the Jacobian.
= p j (log p j -log q j + 1) -p j i p i (log p i -log q i + 1)
= p j (log p j -log q j + 1) -p j (D KL + 1)
= p j (log p j -log q j -D KL )
In vector form: ∇ z D KL = p ⊙ (log p -log q -D KL • 1).
Using Equation (11), the dense gradient can be computed as follows (pseudocode):
def dense_gradient_full(student_logits, teacher_logits): p = softmax(student_logits) q = softmax(teacher_logits) kl = sum(p * (log(p) -log(q))) grad_z = p * (log(p) -log(q) -kl) return grad_z
Complexity: O(|V|) per position. The element-wise operations are amenable to GPU parallelization (e.g., via custom CUDA or Triton kernels). Note: Both methods require O(|V|) for the initial softmax computation. The savings apply to the gradient tensor: Top-K produces a sparse gradient with only K non-zero entries. For |V| ≈ 128,000 and K = 32, this reduces gradient storage and downstream computation by ∼4000×.
This content is AI-processed based on open access ArXiv data.