Path-Guided Flow Matching for Dataset Distillation

Path-Guided Flow Matching for Dataset Distillation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Dataset distillation compresses large datasets into compact synthetic sets with comparable performance in training models. Despite recent progress on diffusion-based distillation, this type of method typically depends on heuristic guidance or prototype assignment, which comes with time-consuming sampling and trajectory instability and thus hurts downstream generalization especially under strong control or low IPC. We propose \emph{Path-Guided Flow Matching (PGFM)}, the first flow matching-based framework for generative distillation, which enables fast deterministic synthesis by solving an ODE in a few steps. PGFM conducts flow matching in the latent space of a frozen VAE to learn class-conditional transport from Gaussian noise to data distribution. Particularly, we develop a continuous path-to-prototype guidance algorithm for ODE-consistent path control, which allows trajectories to reliably land on assigned prototypes while preserving diversity and efficiency. Extensive experiments across high-resolution benchmarks demonstrate that PGFM matches or surpasses prior diffusion-based distillation approaches with fewer steps of sampling while delivering competitive performance with remarkably improved efficiency, e.g., 7.6$\times$ more efficient than the diffusion-based counterparts with 78% mode coverage.


💡 Research Summary

Dataset distillation aims to compress a large training set into a tiny synthetic one while preserving the performance of models trained on the original data. Recent works have largely relied on diffusion‑based generative models, which achieve high fidelity but suffer from costly multi‑step sampling, complex noise schedules, and unstable trajectories, especially when the number of images per class (IPC) is low. In this paper the authors introduce Path‑Guided Flow Matching (PGFM), the first framework that leverages flow‑matching – a deterministic ODE‑based generative paradigm – for dataset distillation.

The method operates in the latent space of a frozen variational auto‑encoder (VAE). First, all training images are encoded into latent vectors, globally normalized, and clustered per class using K‑means++ with K equal to the IPC budget. The resulting cluster centroids serve as class‑specific prototypes that represent major modes of the data distribution. A pre‑trained conditional flow‑matching generator (GMFlow) provides a time‑dependent velocity field uϕ(t, z, y) that transports Gaussian noise to the class‑conditional latent distribution. Sampling consists of solving the ODE dz/dt = uϕ(t, z, y) with a small number of integration steps (typically 5–10) using an explicit solver such as Heun’s method.

To improve mode coverage without sacrificing the inherent stability of flow‑matching, PGFM augments the velocity field with a lightweight prototype‑guided term during the early phase of integration: dz/dt = uϕ(t, z, y) + g(t)·ũproto(t, z, y, k). Here ũproto points from the current latent (or its predicted clean version) toward the assigned prototype µy,k, scaled by an adaptive factor α(t) that enforces a trust‑region constraint. The gate g(t) deactivates the guidance after a preset early‑stop time, allowing the original flow to refine fine‑grained details. The trust‑region parameters (ρ0, ρmin) bound the L2 norm of the guidance relative to the base velocity, preventing over‑control that could lead to blurry or noisy samples.

Extensive experiments on high‑resolution benchmarks (ImageNette, ImageNet‑100, CIFAR‑100) demonstrate that PGFM matches or exceeds state‑of‑the‑art diffusion‑based distillation methods such as MGD³ and MinimaxDiff while requiring far fewer sampling steps. On ImageNet‑100 with IPC = 10, PGFM achieves a 78 % prototype hit rate and is 7.6× more computationally efficient than the best diffusion baseline, delivering comparable or higher test accuracy across multiple downstream backbones (ResNet‑18, ConvNet, ViT). Ablation studies confirm that the early‑stage guidance and trust‑region scaling are crucial for balancing diversity and fidelity.

The paper’s contributions are threefold: (1) introducing flow‑matching as an efficient alternative to diffusion for dataset distillation; (2) proposing a continuous path‑to‑prototype guidance mechanism that operates consistently with the ODE dynamics; and (3) showing that a combination of deterministic flow sampling and minimal, early‑stage control yields synthetic datasets that are both diverse and high‑quality under tight IPC budgets. Limitations include dependence on the quality of the frozen VAE and the fixed number of prototypes equal to IPC, which may restrict scalability to very high‑IPC regimes. Future work could explore adaptive prototype selection, joint training of the VAE and flow model, or extending the approach to non‑visual modalities. Overall, PGFM offers a compelling, computationally cheap route to high‑performance dataset distillation.


Comments & Academic Discussion

Loading comments...

Leave a Comment