Efficient Analysis of the Distilled Neural Tangent Kernel

Efficient Analysis of the Distilled Neural Tangent Kernel
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Neural tangent kernel (NTK) methods are computationally limited by the need to evaluate large Jacobians across many data points. Existing approaches reduce this cost primarily through projecting and sketching the Jacobian. We show that NTK computation can also be reduced by compressing the data dimension itself using NTK-tuned dataset distillation. We demonstrate that the neural tangent space spanned by the input data can be induced by dataset distillation, yielding a 20-100$\times$ reduction in required Jacobian calculations. We further show that per-class NTK matrices have low effective rank that is preserved by this reduction. Building on these insights, we propose the distilled neural tangent kernel (DNTK), which combines NTK-tuned dataset distillation with state-of-the-art projection methods to reduce up NTK computational complexity by up to five orders of magnitude while preserving kernel structure and predictive performance.


💡 Research Summary

The paper tackles the longstanding computational bottleneck of Neural Tangent Kernel (NTK) methods, whose cost scales as O(n²P) in time and O(n²) in memory for a model with P parameters trained on n data points. While prior work has focused on reducing parameter‑side complexity via random sketching and feature approximations, this work introduces a complementary data‑side reduction by leveraging NTK‑tuned dataset distillation. The authors first observe that the empirical NTK exhibits substantial redundancy in three spaces: the dataset itself, the parameter space, and the gradient subspace. They formalize data redundancy as a low truncation rank of the class‑wise NTK matrix and parameter redundancy as the existence of a low‑dimensional subspace V⊂ℝ^P that captures most of the gradient covariance.

The key insight is that distilled data act as inducing points for the NTK: a small synthetic set ˜X defines a tangent‑feature subspace V(˜D)=col(˜Φ^⊤) where ˜Φ contains logit gradients at the frozen reference parameters θ. The loss gradient on the distilled set lies exactly in this subspace, making the distilled set a controllable proxy for the full gradient information. The paper provides three central theoretical results: (1) a one‑step smoothness regret bound showing that the performance gap between the distilled update and the optimal subspace‑restricted update is governed by the projection residual ‖(I‑Π)g_t‖²; (2) a characterization that the optimal r‑dimensional subspace minimizing this residual is the top‑r eigenspace of the gradient covariance matrix G=E


Comments & Academic Discussion

Loading comments...

Leave a Comment