데이터셋 증류를 통한 선형 분류기 훈련의 최적화

February 23, 2026

Reading time: 6 minute

...

📝 Abstract

The task of dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples. Existing distillation methods focus on synthesizing datasets that enable training randomly initialized models. In contrast, state-of-the-art vision approaches are increasingly building on large, pre-trained self-supervised models rather than training from scratch. In this paper, we investigate the problem of distilling datasets that enable us to optimally train linear probes on top of such large, pre-trained vision models. We introduce a method of dataset distillation for this task called Linear Gradient Matching that optimizes the synthetic images such that, when passed through a pre-trained feature extractor, they induce gradients in the linear classifier similar to those produced by the real data. Our method yields synthetic data that outperform all real-image baselines and, remarkably, generalize across pre-trained vision models, enabling us, for instance, to train a linear CLIP probe that performs competitively using a dataset distilled via a DINO backbone. Further, we show that our distilled datasets are exceptionally effective for fine-grained classification and provide a valuable tool for model interpretability, predicting, among other things, how similar two models’ embedding spaces are under the platonic representation hypothesis or whether a model is sensitive to spurious correlations in adversarial datasets.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

The task of Dataset Distillation involves the synthesis of a small set of synthetic samples such that a model trained from scratch on this synthetic set will achieve test-time performance comparable to that of a model trained on the full real dataset. Since this problem’s first introduction and proposed solution in the self-titled paper [47], many new methods [6,27,51,[54][55][56] and extensions thereof [7,11,16,25,28,40,46,53] have made strides towards the lofty goal of learning a high-quality model from just a handful of synthetic images.

Meanwhile, computer vision has increasingly adopted a paradigm of using the representations of large, pre-trained self-supervised vision models for downstream tasks, either via fine-tuning or by using these models as feature extraction backbones. Given this trend, in this work, we explore dataset distillation in the regime of training models on top of features extracted by pre-trained vision foundation models. Specifically, we study linear classification on top of a pre-trained feature representation.

In our new method, Linear Gradient Matching, we distill synthetic datasets by optimizing such that their representations extracted by pre-trained feature extractors induce gradients in a linear classifier similar to those obtained from real images. We find that a single synthetic image per class suffices to train linear classifiers to competitive performance across a wide variety of large vision ImageNet-1k Distilled for Self-Supervised Models: Using our method of Linear Gradient Matching, we distill vision datasets to just one synthetic image per class using different pre-trained self-supervised backbone models. These learned images can then be used to train linear probes that achieve high accuracy on unseen test data, outperforming all real-image baselines. Furthermore, each backbone model seems to yield its own “style” of distilled image, giving insights into the aspects on which these models tend to focus (structure, texture, color, etc.).

model backbones, outperforming all real-image baselines. Figure 1 shows samples distilled from ImageNet-1k [12] with our method using various self-supervised feature extractors.

Motivated by recent hypotheses that different large models converge to similar representations even when trained on different modalities [20], we investigate whether distilled datasets transfer across architectures. We find that a gradient matching objective alone leads to images that are overfit to a particular model architecture and do not yield competitive performance across foundation models. However, we overcome this issue through differentiable augmentations and a simple re-parameterization of images via a multi-scale pyramid. Compared to those retrieved via naïve pixel optimization, the resulting distilled images not only look remarkably realistic but also readily transfer across foundation models, such that a dataset distilled using, for example, a DINO backbone yields competitive performance when used to train a linear classifier on top of a different model’s representation, such as CLIP’s.

We also observe that our distilled datasets offer several interesting interpretability results, including predicting alignment between different models, explaining susceptibility (or robustness) to spurious correlations in adversarial datasets, and highlighting out-of-distribution capabilities.

Extensive experiments and ablations validate our Linear Gradient Matching method’s effectiveness on this new dataset distillation task and highlight its potential as an interpretability tool.

Dataset Distillation. As dataset and model sizes continue to grow, so has the interest in more efficient forms of learning. To this end, researchers have worked towards methods of learning optimal training data such that one could train an effective model from scratch using as few samples as possible. One such solution to this problem was the initial proposal of Dataset Distillation [47] in which the model’s final performance was expressed as a function of the synthetic training data that was optimized end-to-end by back-propagating through many inner training iterations. Follow-up works introduced proxy losses and learned synthetic images that matched gradients [55], feature distributions [54], training trajectories [6] and more [27,51,56]. Some works extend dataset distillation to large models [50,51], but these methods do not excel in the ultra-small data regime, i.e., one image per class. For such settings, Trajectory Matching [6] and Our meta loss (Lmeta) is then defined as the cosine distance between the gradients of these classification losses (ℓreal and ℓsyn) with respect to the random linear probe (W ). This meta loss is then back-propagated through the initial synthetic gradient calculation and used to update our synthetic images. This technique allows us to distill large datasets to just a single image per class while still achieving high performance when training ne

View Original ArXiv

This content is AI-processed based on ArXiv data.

데이터셋 증류를 통한 선형 분류기 훈련의 최적화

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found