In Good GRACEs: Principled Teacher Selection for Knowledge Distillation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Knowledge distillation is an efficient strategy to use data generated by large “teacher” language models to train smaller capable “student” models, but selecting the optimal teacher for a specific student-task combination requires expensive trial-and-error. We propose a lightweight score called GRACE to quantify how effective a teacher will be for post-training a student model. GRACE measures distributional properties of the student’s gradients without access to a verifier, teacher logits, teacher internals, or test data. From an information-theoretic perspective, GRACE connects to leave-one-out stability of gradient-based algorithms, which controls the generalization performance of the distilled students. On GSM8K and MATH, GRACE correlates strongly (up to 86% Spearman correlation) with the performance of the distilled LLaMA and OLMo students. In particular, training a student using the GRACE-selected teacher can improve the performance by up to 7.4% over naively using the best-performing teacher. Further, GRACE can provide guidance on crucial design choices in distillation, including (1) the best temperature to use when generating from the teacher, (2) the best teacher to use given a size constraint, and (3) the best teacher to use within a specific model family. Altogether, our findings demonstrate that GRACE can efficiently and effectively identify a strongly compatible teacher for a given student and provide fine-grained guidance on how to perform distillation.

💡 Research Summary

Knowledge distillation for large language models (LLMs) often follows a “guess‑and‑check” pipeline: many candidate teacher models are used to generate synthetic data, each teacher’s data is then used to fine‑tune a student, and the best teacher is selected based on the student’s downstream performance. This process is extremely costly because it requires generating billions of tokens per teacher and repeatedly training the student, while also being sensitive to hyper‑parameters such as generation temperature.

The paper introduces GRACE (Gradient Cross‑validation Evaluation), a lightweight, non‑intrusive score that predicts how effective a teacher will be for a given student and task without needing test data, teacher logits, or any external verifier. GRACE works by examining the student’s gradients on a small held‑out set of teacher‑generated examples. The procedure is as follows:

For each prompt‑response pair (x, y) produced by a teacher, compute the student’s cross‑entropy gradient g(x, y).
Project the high‑dimensional gradient onto a low‑dimensional random sign matrix Π (size d ≪ D) to obtain a compact representation.
Scale the projected gradient by log(|y|) to mitigate bias toward short responses, yielding h(x, y).
Assemble all h‑vectors into a matrix G(D) and compute the mean µ(D) and covariance Σ(D). A smoothed, normalized covariance ˆΣ(D) = ˜Σ(D)+ν I is used for numerical stability.

GRACE then performs a C‑fold cross‑validation on the small evaluation set D. For each fold i, the covariance of the “training” portion D_{‑i} is used as a preconditioner: the gradients from the held‑out fold D_i are weighted by ˆΣ(D_{‑i})^{‑1/2} and their squared norm is averaged. Formally:

GRACE(D) = (1/C) ∑_{i=1}^{C} Tr

In Good GRACEs: Principled Teacher Selection for Knowledge Distillation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment