Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives

Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Standard negative log-likelihood (NLL) for Supervised Fine-Tuning (SFT) applies uniform token-level weighting. This rigidity creates a two-fold failure mode: (i) overemphasizing low-probability targets can amplify gradients on noisy supervision and disrupt robust priors, and (ii) uniform weighting provides weak sharpening when the model is already confident. Existing methods fail to resolve the resulting plasticity–stability dilemma, often suppressing necessary learning signals alongside harmful ones. To address this issue, we unify token-level SFT objectives within a generalized deformed-log family and expose a universal gate $\times$ error gradient structure, where the gate controls how much the model trusts its current prediction. By employing the Cayley transform, we map the model’s continuously evolving uncertainty onto a continuous focus trajectory, which enables seamless interpolation between scenarios involving uncertain novel concepts and those involving well-established knowledge. We then introduce Dynamic Entropy Fine-Tuning (DEFT), a parameter-free objective that modulates the trust gate using distribution concentration (Rényi-2 entropy) as a practical proxy for the model’s predictive state. Extensive experiments and analyses demonstrate that DEFT achieves a better balance between exploration and exploitation, leading to improved overall performance.


💡 Research Summary

The paper tackles a fundamental limitation of the standard negative log‑likelihood (NLL) objective used in Supervised Fine‑Tuning (SFT) of large language models. NLL treats every token uniformly, assigning a gradient magnitude proportional to (1‑p) where p is the model’s predicted probability for the target token. This uniform treatment leads to two major problems: (1) low‑probability tokens receive large updates regardless of whether they represent genuine knowledge gaps or noisy/conflicting supervision, which can erode the model’s pretrained priors (the plasticity‑stability dilemma); (2) as the model becomes confident, gradients decay linearly, resulting in inefficient sharpening of already‑high‑confidence predictions.

To address these issues, the authors unify token‑level SFT objectives under a deformed‑log family derived from the Tsallis q‑logarithm. Introducing a “focus index” α ≥ 0, they define a loss Lα(p) = (1 − pα)/α, which recovers NLL as α → 0, linear probability loss at α = 1, and the generalized cross‑entropy family for 0 < α < 1. The gradient of this loss with respect to the target logit decomposes into a product of a “trust gate” G(p) = pα and an error term (1‑p). Thus, α directly controls how much the model trusts its current prediction: small α opens the gate for all confidences, while larger α closes it for low‑confidence tokens and emphasizes high‑confidence ones.

The paper further establishes an optimization‑entropy duality: minimizing the expected Lα loss is equivalent to minimizing a Tsallis entropy of order q = 1 + α. When α → 0 the induced entropy is Shannon (information‑acquisition regime), encouraging coverage of low‑probability events but converging slowly. When α → 1 the induced entropy becomes order‑2 Tsallis (collision entropy), which sharpens high‑probability mass more aggressively. This duality shows that the trust gate is not a heuristic but a principled mechanism reflecting the geometry of the underlying entropy space.

A static α, however, cannot simultaneously provide coverage for unknown knowledge gaps and sharpening for confident predictions. To obtain a state‑dependent gate, the authors map the model’s predictive uncertainty onto a continuous focus trajectory using the Cayley transform. They convert token probability p into a spherical angle θ = 2·arctan((1‑p)/p), then apply a sigmoid‑scaled linear function to produce a dynamic α(p). In practice they approximate the required uncertainty measure with Rényi‑2 entropy (H₂ = ‑log ∑p_i²), which is cheap to compute and captures distribution concentration. High entropy (high uncertainty) yields small α, reproducing NLL‑like behavior; low entropy (high confidence) yields large α, moving toward probability‑loss‑like behavior.

The resulting parameter‑free objective is called Dynamic Entropy Fine‑Tuning (DEFT). DEFT computes the batch‑averaged Rényi‑2 entropy, derives α on the fly, and applies the Lα loss. Consequently, early training phases allocate strong gradients to uncertain tokens, facilitating knowledge acquisition, while later phases focus gradients on confident tokens, efficiently sharpening the distribution.

Empirical evaluation spans seven model backbones (7B‑13B parameters) and five domains (mathematics, coding, general QA, medical, legal). DEFT is compared against standard NLL, probability‑scaled losses, and entropy‑adaptive losses. Across all settings DEFT improves key metrics (accuracy, BLEU, code execution success) by 1‑3 percentage points, with especially notable gains in both “Model‑Strong” (where the base model already performs well) and “Model‑Weak” regimes. Token‑level analyses reveal that DEFT reduces “forgetting” of previously learned high‑confidence tokens while suppressing harmful updates on noisy low‑confidence tokens.

The authors’ contributions are threefold: (a) a unified deformed‑log framework that makes the trust‑gate × error structure explicit and subsumes prior SFT objectives; (b) a principled, parameter‑free dynamic trust gate derived via the Cayley transform and grounded in entropy geometry; (c) the DEFT objective, which leverages Rényi‑2 entropy to adaptively modulate the focus index, delivering consistent performance gains across models and tasks. This work provides both theoretical insight and a practical tool for safer, more efficient fine‑tuning of large language models.


Comments & Academic Discussion

Loading comments...

Leave a Comment