FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection

Reading time: 5 minute
...

📝 Original Info

  • Title: FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection
  • ArXiv ID: 2511.19476
  • Date: 2025-11-22
  • Authors: ** - Jin Cui* (시안 자오통 대학교, State Key Laboratory of Human‑Machine Hybrid Augmented Intelligence) - Boran Zhao* (시안 자오통 대학교, School of Software Engineering) - Jiajun Xu (시안 자오통 대학교, School of Software Engineering) - Jiaqi Guo (난카이 대학교, School of Mathematical Sciences) - Shuo Guan (시안 자오통 대학교, School of Software Engineering) - Pengju Ren† (시안 자오통 대학교, State Key Laboratory of Human‑Machine Hybrid Augmented Intelligence) * 동등 기여, † 교신 저자 **

📝 Abstract

Coreset selection compresses large datasets into compact, representative subsets, reducing the energy and computational burden of training deep neural networks. Existing methods are either: (i) DNN-based, which are tied to model-specific parameters and introduce architectural bias; or (ii) DNN-free, which rely on heuristics lacking theoretical guarantees. Neither approach explicitly constrains distributional equivalence, largely because continuous distribution matching is considered inapplicable to discrete sampling. Moreover, prevalent metrics (e.g., MSE, KL, MMD, CE) cannot accurately capture higher-order moment discrepancies, leading to suboptimal coresets. In this work, we propose FAST, the first DNN-free distribution-matching coreset selection framework that formulates the coreset selection task as a graph-constrained optimization problem grounded in spectral graph theory and employs the Characteristic Function Distance (CFD) to capture full distributional information in the frequency domain. We further discover that naive CFD suffers from a "vanishing phase gradient" issue in medium and high-frequency regions; to address this, we introduce an Attenuated Phase-Decoupled CFD. Furthermore, for better convergence, we design a Progressive Discrepancy-Aware Sampling strategy that progressively schedules frequency selection from low to high, preserving global structure before refining local details and enabling accurate matching with fewer frequencies while avoiding overfitting. Extensive experiments demonstrate that FAST significantly outperforms state-of-the-art coreset selection methods across all evaluated benchmarks, achieving an average accuracy gain of 9.12%. Compared to other baseline coreset methods, it reduces power consumption by 96.57% and achieves a 2.2x average speedup, underscoring its high performance and energy efficiency.

💡 Deep Analysis

Figure 1

📄 Full Content

FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection Jin Cui∗1, Boran Zhao∗2, Jiajun Xu2, Jiaqi Guo3, Shuo Guan2, Pengju Ren†1 1State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University 2School of Software Engineering, Xi’an Jiaotong University 3School of Mathematical Sciences, Nankai University andycui@stu.xjtu.edu.cn, {boranzhao, pengjuren}@xjtu.edu.cn Abstract Coreset selection compresses large datasets into compact, representative subsets, reducing the energy and computa- tional burden of training deep neural networks. Exist- ing methods are either: (i) DNN-based, which are inher- ently coupled with network-specific parameters, inevitably introducing architectural bias and compromising general- ization; or (ii) DNN-free, which utilize heuristics that lack rigorous theoretical guarantees for stability and accuracy. Neither approach explicitly constrains distributional equiv- alence of the representative subsets, largely because con- tinuous distribution matching is broadly considered inap- plicable to discrete dataset sampling. Furthermore, preva- lent distribution metrics (e.g., MSE, KL, MMD, and CE) are often incapable of accurately capturing higher-order mo- ments differences. These deficiencies lead to suboptimal coreset performance, preventing the selected coreset from being truly equivalent to the original dataset. We propose FAST (Frequency-domain Aligned Sampling via Topology), the first DNN-free distribution-matching coreset selection framework that formulates coreset se- lection task as a graph-constrained optimization problem grounded in spectral graph theory and employs the Char- acteristic Function Distance (CFD) to capture full distri- butional information (i.e., all moments and intrinsic cor- relations) in the frequency domain. We further discover that naive CFD suffers from a “vanishing phase gradi- ent” issue in medium and high-frequency regions; to ad- dress this, we introduce an Attenuated Phase-Decoupled CFD. Furthermore, for better convergence, we design a Progressive Discrepancy-Aware Sampling strategy that pro- *Equal contribution. †Corresponding author. gressively schedules frequency selection from low to high. This preserves global structure before refining local details, enabling accurate matching with few frequencies while pre- venting overfitting. Extensive experiments demonstrate that FAST significantly outperforms state-of-the-art coreset se- lection methods across all evaluated benchmarks, achiev- ing an average accuracy gain of 9.12%. Compared to other baseline coreset methods, it reduces power consumption by 96.57% and achieves a 2.2× average speedup even on CPU with 1.7GB of memory, underscoring its high performance and energy efficiency. 1. Introduction Deep Neural Networks (DNNs) have achieved—and in some cases even surpassed—human-level performance across diverse domains such as vision [5, 12, 26], program- ming [7, 24, 31], and science [18, 28]. This remarkable success is primarily driven by the availability of massive training datasets [15, 25, 35]. However, training on such large-scale data incurs prohibitive energy costs, as illus- trated in Fig. 1, the total energy consumption can reach up to 103 MWh, exceeding the annual electricity usage of numer- ous households by several orders of magnitude. To mitigate this challenge, a variety of dataset compression techniques have been proposed to condense large-scale datasets into compact yet representative subsets [17, 41, 45, 46]. These methods have found widespread adoption in tasks such as neural architecture search, continual learning [33, 43], and transfer learning [23]. Among existing compression techniques, coreset selec- tion [17, 19, 29] offers superior efficiency over synthesis- based distillation [41, 45, 46] by avoiding computationally intensive nested gradient descent, making it ideal for on- device deployment. Furthermore, it preserves the fidelity of data and mitigates the failure modes of synthesis methods, 1 arXiv:2511.19476v1 [stat.ML] 22 Nov 2025 Annual household electricity consumption (MWh) AI Model Training Energy (MWh) Dataset Size (GB) MWh/GB (Log) ImgNet MNIST CIFAR10 COCO 1998 2010 2012 2014 2018 2019 2020 2021 2022 2023 2024 BERT DALL-E GPT2 GPT3 PaLM ChCA Gemini 101 102 103 104 ~5000x Energy Gap Figure 1. The energy consumption of training exceeds the an- nual electricity usage of numerous households by several orders of magnitude. which often struggle to generate highly discriminative sam- ples for classification tasks with high inter-class similarity. Existing coreset selection methods can be broadly cate- gorized into two paradigms: (i) DNN-based approaches that adopt a proxy DNN to evaluate each sample’s contribution to training performance. While effective, these methods are intrinsica

📸 Image Gallery

accuracy_deviation_dual_axis.png compare_figure.png energy_table.png fft_decompose.png freq_num.png gd_converenge_cures_v2.png generalization.png graph_cons_accuracy_adjusted.png main-figure-modified.png moments_pdf_final1.png phase_lambda_academic.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut