Efficient Semi-Supervised Adversarial Training via Latent Clustering-Based Data Reduction
Learning robust models under adversarial settings is widely recognized as requiring a considerably large number of training samples. Recent work proposes semi-supervised adversarial training (SSAT), which utilizes external unlabeled or synthetically generated data and is currently the state of the art. However, SSAT requires substantial extra data to attain high robustness, resulting in prolonged training time and increased memory usage. In this paper, we propose data reduction strategies to improve the efficiency of SSAT by optimizing the amount of additional data incorporated. Specifically, we design novel latent clustering-based techniques to select or generate a small, critical subset of data samples near the model’s decision boundary. While focusing on boundary-adjacent points, our methods maintain a balanced ratio between boundary and non-boundary data points, thereby avoiding overfitting. Comprehensive experiments across image benchmarks demonstrate that our methods can effectively reduce SSAT’s data requirements and computational costs while preserving its strong robustness advantages. In particular, our latent-space selection scheme based on k-means clustering and our guided diffusion-based approach with LCG-KM are the most effective, achieving nearly identical robust accuracies with 5 times to 10 times less unlabeled data. When compared to full SSAT trained to convergence, our methods reduce total runtime by approximately 3 times to 4 times due to strategic prioritization of unlabeled data.
💡 Research Summary
The paper tackles a critical bottleneck in semi‑supervised adversarial training (SSAT): the need for massive amounts of external unlabeled or synthetically generated data, which inflates memory usage, training time, and energy consumption. Building on the observation that examples near a model’s decision boundary are far more informative for robust learning than those far away, the authors propose a suite of data‑reduction techniques that either select a small, high‑impact subset from existing unlabeled data or generate a compact set of boundary‑adjacent samples on demand.
Three selection strategies are introduced. The simplest, Prediction‑Confidence Selection (PCS), picks samples with low soft‑max confidence. Two more sophisticated methods, LCS‑KM and LCS‑GMM, operate in the model’s latent embedding space (extracted from an intermediate layer). By clustering the embeddings with k‑means or a Gaussian mixture model, the algorithms identify regions of high uncertainty and select points that are farthest from cluster centroids (or have low mixture weight), thereby focusing on boundary‑critical examples while preserving a balanced ratio of boundary to non‑boundary data.
To avoid the costly pre‑generation of millions of synthetic images using diffusion models, the authors devise guided diffusion fine‑tuning. A denoising diffusion probabilistic model (DDPM) is fine‑tuned with a guidance loss that mirrors the selection criteria (PCS, LCS‑KM, LCS‑GMM). The resulting guided generators—LCG‑KM and LCG‑GMM—produce exactly the desired number of high‑utility samples, eliminating the need to store or process a huge synthetic dataset.
Extensive experiments on CIFAR‑10, SVHN, and a COVID‑19 medical imaging task demonstrate that the proposed methods achieve almost identical robust accuracies to full‑scale SSAT while using 5‑10× less unlabeled data. In particular, using only 10 % of the external data selected by LCS‑KM yields robustness comparable to using the entire 500 K‑image set. Training time is reduced by approximately 3‑4× (to about 25 % of the original runtime) because the model converges faster on the more informative subset. The guided diffusion approach further cuts computational overhead by removing the separate data‑generation phase.
The paper also discusses limitations: the performance of latent‑space clustering depends on the choice of embedding dimensionality and the number of clusters, and highly imbalanced or extremely high‑dimensional data may cause unstable clustering. Future work is suggested on automated cluster‑number selection, multi‑stage selection pipelines, and extending the guided generation concept to other generative models such as VAEs or GANs. Overall, the work provides a practical pathway to make SSAT more scalable and environmentally friendly without sacrificing adversarial robustness.
Comments & Academic Discussion
Loading comments...
Leave a Comment