RobustBlack: Challenging Black-Box Adversarial Attacks on State-of-the-Art Defenses
Although adversarial robustness has been extensively studied in white-box settings, recent advances in black-box attacks (including transfer- and query-based approaches) are primarily benchmarked against weak defenses, leaving a significant gap in the evaluation of their effectiveness against more recent and moderate robust models (e.g., those featured in the Robustbench leaderboard). In this paper, we question this lack of attention from black-box attacks to robust models. We establish a framework to evaluate the effectiveness of recent black-box attacks against both top-performing and standard defense mechanisms, on the ImageNet dataset. Our empirical evaluation reveals the following key findings: (1) the most advanced black-box attacks struggle to succeed even against simple adversarially trained models; (2) robust models that are optimized to withstand strong white-box attacks, such as AutoAttack, also exhibits enhanced resilience against black-box attacks; and (3) robustness alignment between the surrogate models and the target model plays a key factor in the success rate of transfer-based attacks
💡 Research Summary
The paper “RobustBlack: Challenging Black‑Box Adversarial Attacks on State‑of‑the‑Art Defenses” addresses a critical gap in the evaluation of black‑box adversarial attacks: most recent attacks are benchmarked against weak or outdated defenses, while modern, highly robust models (e.g., those on the RobustBench leaderboard) are rarely tested. To close this gap, the authors construct a rigorous evaluation protocol and apply it to a broad set of attacks and defenses on the ImageNet dataset.
Evaluation protocol. The authors fix a realistic ℓ∞ perturbation budget of ε = 4/255, following RobustBench recommendations, and evaluate attacks on eight top‑performing defenses. These defenses span simple adversarial training (Madry’s AT), AutoAttack‑optimized models, ConvStem‑based architectures, Vision Transformers, and recent large‑scale pre‑training schemes. By keeping the budget and dataset constant, the protocol eliminates confounding factors that have plagued earlier studies (e.g., large ε, small models, non‑robust targets).
Attack suite. Thirteen representative black‑box attacks are selected, covering the two dominant paradigms: (1) transfer‑based attacks that generate adversarial examples on a white‑box surrogate and then test them on the target, and (2) query‑based attacks that iteratively probe the target model. Transfer‑based methods include MI‑FGSM, DI‑FGSM, TI‑FGSM, VMI, ADMIX, LGV, Ghost, SGM, and others; query‑based methods include ZOO, Square, Sign‑OPT, and RayS. The authors also incorporate recent hybrid approaches such as BASES and TREMBA, which combine surrogate ensembles with query feedback.
Key findings.
- Advanced transfer attacks struggle against even modestly robust models. When evaluated under the strict ε = 4/255 budget, the best transfer attacks achieve success rates below 10 % against simple adversarially trained networks, a dramatic drop from the >80 % rates reported in earlier work that used larger perturbations.
- Robustness against strong white‑box attacks translates to black‑box resilience. Models that have been hardened specifically for AutoAttack (the de‑facto standard white‑box benchmark) also exhibit markedly higher resistance to both transfer and query attacks. On average, these models reduce black‑box success rates by more than 30 % compared to baseline defenses.
- Robustness alignment between surrogate and target is a decisive factor. The authors introduce the notion of “robustness alignment”: when the surrogate model shares the same defense strategy or training pipeline as the target, transfer success improves by an average of 6.49 percentage points. This effect is observed across multiple attack families and suggests that attackers can boost efficacy simply by selecting surrogates that are themselves robust.
- Some state‑of‑the‑art defenses unintentionally aid black‑box attacks. Certain recent defenses (e.g., BASES, Liu et al.) can be repurposed as powerful surrogates, raising the success rate of attacks from 1.26 % to 12.55 % on the most robust targets. This highlights a previously overlooked risk: a defense may improve white‑box robustness while simultaneously providing a better foothold for black‑box adversaries.
Implications and recommendations. The study argues that black‑box attack research must adopt standardized, robust evaluation protocols that include strong defenses; otherwise, conclusions about attack effectiveness are misleading. Moreover, the observed correlation between white‑box and black‑box robustness suggests that improving defenses against AutoAttack can serve as a proxy for broader security. Finally, the authors call for defense designers to assess how their models might be used as surrogates, and to incorporate such “surrogate‑risk” analyses into the development pipeline.
Contribution summary.
- Demonstrates that simple adversarial training already neutralizes many state‑of‑the‑art black‑box attacks under realistic budgets.
- Shows that defenses optimized for AutoAttack provide cross‑paradigm robustness, bridging the white‑box/black‑box gap.
- Reveals that robustness alignment between surrogate and target can unintentionally increase attack success, and that some defenses may act as strong surrogates for attackers.
The authors release a full replication package (https://figshare.com/projects/RobustBlack/265786) to enable reproducibility and encourage the community to adopt more stringent benchmarking practices.
Comments & Academic Discussion
Loading comments...
Leave a Comment