Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation
Adversarial transferability refers to the capacity of adversarial examples generated on the surrogate model to deceive alternate, unexposed victim models. This property eliminates the need for direct access to the victim model during an attack, thereby raising considerable security concerns in practical applications and attracting substantial research attention recently. In this work, we discern a lack of a standardized framework and criteria for evaluating transfer-based attacks, leading to potentially biased assessments of existing approaches. To rectify this gap, we have conducted an exhaustive review of hundreds of related works, organizing various transfer-based attacks into six distinct categories. Subsequently, we propose a comprehensive framework designed to serve as a benchmark for evaluating these attacks. In addition, we delineate common strategies that enhance adversarial transferability and highlight prevalent issues that could lead to unfair comparisons. Finally, we provide a brief review of transfer-based attacks beyond image classification.
💡 Research Summary
This paper addresses a critical gap in the study of adversarial transferability for image classification: the lack of a unified evaluation framework and consistent benchmarking practices. While many works have proposed methods to improve the ability of adversarial examples generated on a surrogate model to fool unseen victim models, comparisons across papers are often unfair due to differing datasets, model sets, attack hyper‑parameters, and baseline choices.
The authors first conduct an exhaustive literature review, collecting over one hundred transfer‑based attacks. They reorganize these methods into six distinct methodological categories: (1) Gradient‑based attacks, which modify the gradient computation (e.g., momentum, variance‑tuning, Nesterov acceleration); (2) Input‑Transformation attacks, which apply stochastic or deterministic transformations to the input image before gradient estimation (e.g., resizing, padding, random noise, mixup); (3) Advanced‑Objective attacks, which replace the standard cross‑entropy loss with alternative objectives such as hinge loss, feature‑distance, or regularization terms; (4) Generation‑based attacks, which train a generator (often GAN‑style) to produce adversarial perturbations directly; (5) Model‑related attacks, which exploit architectural properties of the surrogate model (e.g., skip‑connection manipulation, weight pruning, customized forward/backward passes); and (6) Ensemble‑based attacks, which attack multiple surrogate models simultaneously and aggregate their losses or logits.
To enable fair comparison, the paper proposes a comprehensive benchmark. The evaluation suite includes nine modern architectures: four CNNs (ResNet‑50, VGG‑16, MobileNet‑v2, Inception‑v3) and four Vision Transformers (ViT, PiT, Swin, Vision‑former), plus a set of five defense mechanisms (Adversarial Training, HGD, Randomized Smoothing, NRP, DiffPure). All experiments use an ImageNet‑compatible subset of 1,000 images resized to 224×224. Perturbations are constrained by an ℓ∞ norm of ε = 16/255; step size α = 1/255; untargeted attacks run for T = 10 iterations, while targeted attacks run for T = 300. The default surrogate model is ResNet‑50, and the baseline attack is MI‑FGSM unless the method explicitly changes the optimization scheme. Success rate (ASR) on each victim model is the primary metric.
Experimental results reveal clear patterns. Gradient‑based enhancements such as momentum (MI‑FGSM), variance‑tuning (VMI‑FGSM), and Nesterov acceleration consistently boost transferability over vanilla I‑FGSM, often by 5‑15 % absolute ASR. Input‑Transformation techniques (e.g., resizing‑padding, random noise, “admix”) improve diversity and raise ASR by 3‑10 % but suffer diminishing returns when transformations become too aggressive. Advanced‑Objective methods that target intermediate feature representations (e.g., feature‑distance loss) achieve the largest gains among single‑model attacks, sometimes surpassing ensemble baselines. Generation‑based approaches provide rapid sample generation but are sensitive to the training data distribution and typically lag behind the best gradient‑based methods in transfer success. Model‑related tricks that align gradient flow with architectural shortcuts add modest improvements (≈3 %). Ensemble‑based attacks remain the most powerful, achieving the highest ASR (often >80 % on defended models) by averaging or minimizing losses across multiple surrogates.
Beyond performance numbers, the authors critically examine methodological shortcomings in prior work. They identify frequent unfair practices: (i) evaluating new methods only on undefended models while ignoring strong baselines on defended models; (ii) using under‑optimized baselines (e.g., default FGSM with a single step) as comparison points; (iii) reporting results on a single surrogate without testing cross‑surrogate robustness. To remedy this, the paper supplies a “fairness checklist” that mandates reporting baseline hyper‑parameters, testing across multiple victim architectures, and including defended models in the evaluation.
Finally, the paper briefly surveys transfer‑based attacks outside pure image classification, covering object detection, semantic segmentation, and cross‑modal attacks on large language models. It notes emerging trends such as cross‑task transfer (e.g., from classification to detection) and multimodal adversarial generation, indicating that the field is rapidly expanding.
In conclusion, the work delivers a valuable taxonomy, a rigorously defined benchmark, and actionable insights into what truly drives adversarial transferability. By standardizing evaluation, it paves the way for reproducible research and more meaningful progress in both attack development and defense strategies.
Comments & Academic Discussion
Loading comments...
Leave a Comment