InstructDiff: Domain-Adaptive Data Selection via Differential Entropy for Efficient LLM Fine-Tuning
Supervised fine-tuning (SFT) is fundamental to adapting large language models, yet training on complete datasets incurs prohibitive costs with diminishing returns. Existing data selection methods suffer from severe domain specificity: techniques optimized for general instruction-following fail on reasoning tasks, and vice versa. We observe that measuring entropy differences between base models and minimally instruction-tuned calibrated models reveals a pattern – samples with the lowest differential entropy consistently yield optimal performance across domains, yet this principle manifests domain-adaptively: reasoning tasks favor entropy increase (cognitive expansion), while general tasks favor entropy decrease (cognitive compression). We introduce InstructDiff, a unified framework that operationalizes differential entropy as a domain-adaptive selection criterion through warmup calibration, bi-directional NLL filtering, and entropy-based ranking. Extensive experiments show that InstructDiff achieves 17% relative improvement over full data training on mathematical reasoning and 52% for general instruction-following, outperforming prior baselines while using only 10% of the data.
💡 Research Summary
InstructDiff tackles the growing inefficiency of fine‑tuning large language models (LLMs) on ever‑expanding datasets. While supervised fine‑tuning (SFT) is essential for adapting LLMs to downstream tasks, training on the full corpus quickly becomes prohibitively expensive and yields diminishing returns. Existing data‑selection methods are typically domain‑specific: heuristics that work well for general instruction‑following fail on reasoning or math tasks, and vice‑versa.
The authors observe a simple yet powerful information‑theoretic signal: the entropy difference (ΔH) between a base model (π_base) and a lightly instruction‑tuned “calibration” model (π_inst) obtained by fine‑tuning on a small random warm‑up subset. Across three domains—general instruction‑following, medical QA, and mathematical reasoning—ΔH exhibits opposite trends. For general tasks and medical QA, π_inst reduces uncertainty (ΔH > 0), a phenomenon the authors call “cognitive compression.” For math reasoning, π_inst often increases uncertainty (ΔH < 0), termed “cognitive expansion.”
Crucially, the samples that yield the best downstream performance in each domain are those with the smallest absolute entropy shift, i.e., minimal |ΔH|. This “minimum differential‑entropy principle” unifies the two opposite trends: the optimal region lies at the model’s learnable frontier where new information can be integrated without overwhelming the model or providing negligible gradient signal.
InstructDiff operationalizes this principle in a two‑stage pipeline:
-
Warm‑up Calibration – Randomly sample a small fraction (α ≈ 10 %) of the full dataset and fine‑tune the base model for a few steps, producing π_inst. This step is cheap yet sufficient to activate latent instruction‑following capabilities.
-
Distribution‑Aware Selection – For every remaining candidate (x_i, y_i), compute two model‑state differences:
• ΔNLL_i = L_inst – L_base (length‑normalized negative log‑likelihood) – measures raw learning signal strength.
• ΔH_i = H_base – H_inst (per‑token entropy difference) – indicates whether the sample pushes the model toward compression or expansion.The method first applies bi‑directional NLL filtering: samples in the bottom γ ≈ 10 % percentile (near‑duplicates, negligible gradient) and top γ ≈ 10 % percentile (nonsensical or overly noisy) are discarded. This isolates a “learnable middle range.”
From the filtered pool, samples are ranked by |ΔH| and the lowest β ≈ 10 % are selected. Because the sign of ΔH naturally reflects the domain (negative for reasoning, positive for general tasks), the same ranking rule automatically adapts without any hand‑crafted domain heuristics.
The authors also propose an iterative refinement: after fine‑tuning on the selected subset, a new calibration model is built and the selection process repeated. Empirically, the second iteration yields the largest additional gain, while later iterations show diminishing returns.
Experimental evaluation spans four domains, each with 10 k instruction‑response pairs: (i) mathematics (NuminaMath) with Qwen2.5‑7B, (ii) general instruction (Alpaca) with LLaMA‑3‑8B, (iii) medical QA (MedCA‑QA) with LLaMA‑3‑8B, and (iv) code generation (BigCode) with LLaMA‑3‑8B. Baselines include random sampling, perplexity‑based extremes, length‑based heuristics, and several recent selection algorithms (IFD, Superfiltering, SelectIT, ZIP).
Key results: using only 10 % of the data (20 % for code) InstructDiff outperforms full‑data training across all domains. Relative improvements over full‑data baselines are +17 % on math (average score 31.63 vs 27.05, with a +54 % boost on AIME 2024), +52 % on general instruction (LC win rate 12.09 % vs 8.15 %), +6.2 % on medical QA (accuracy 56.42 % vs 53.14 %), and +4.9 % on code generation (45.1 vs 43.0).
Ablation studies confirm the necessity of each component: removing bi‑directional NLL filtering drops performance by 2–5 points; selecting the top or middle 10 % of ΔH instead of the bottom 10 % consistently underperforms, validating the minimum‑ΔH hypothesis; varying the warm‑up size shows that a modest α (5–20 %) suffices, but too small a warm‑up harms calibration quality.
Insights and implications
- Domain‑adaptive without explicit heuristics: ΔH’s sign automatically encodes whether a task benefits from entropy expansion or compression, eliminating the need for hand‑crafted domain rules.
- Efficient frontier exploration: By focusing on samples that cause the smallest entropy shift, InstructDiff targets the region where the model can learn new patterns without catastrophic forgetting or redundant exposure.
- Scalable and model‑aware: Both ΔNLL and ΔH are cheap to compute during a single forward pass, making the method applicable to very large corpora.
Limitations and future work
The current study evaluates on relatively small (10 k) datasets; scaling to millions of examples and testing on larger models (e.g., 70 B) remains an open question. The hyper‑parameters α, β, and γ are set empirically; developing adaptive schemes could further improve robustness. Finally, integrating the selection process directly into the training loop (online selection) may yield additional gains.
In summary, InstructDiff introduces a principled, entropy‑difference‑driven data‑selection framework that adapts automatically to diverse downstream tasks, dramatically reduces the amount of data required for effective fine‑tuning, and sets a new benchmark for cost‑efficient LLM adaptation.
Comments & Academic Discussion
Loading comments...
Leave a Comment