Learning Better Certified Models from Empirically-Robust Teachers
Adversarial training attains strong empirical robustness to specific adversarial attacks by training on concrete adversarial perturbations, but it produces neural networks that are not amenable to strong robustness certificates through neural network verification. On the other hand, earlier certified training schemes directly train on bounds from network relaxations to obtain models that are certifiably robust, but display sub-par standard performance. Recent work has shown that state-of-the-art trade-offs between certified robustness and standard performance can be obtained through a family of losses combining adversarial outputs and neural network bounds. Nevertheless, differently from empirical robustness, verifiability still comes at a significant cost in standard performance. In this work, we propose to leverage empirically-robust teachers to improve the performance of certifiably-robust models through knowledge distillation. Using a versatile feature-space distillation objective, we show that distillation from adversarially-trained teachers consistently improves on the state-of-the-art in certified training for ReLU networks across a series of robust computer vision benchmarks.
💡 Research Summary
This paper tackles the long‑standing trade‑off between empirical adversarial robustness—obtained by adversarial training—and provable robustness—obtained by certified training that leverages neural‑network relaxations. While adversarial training yields models that resist specific attacks (e.g., PGD) with high clean‑accuracy, those models are notoriously hard to certify with branch‑and‑bound or other verification tools because the verification problem is NP‑complete for ReLU networks. Certified training, on the other hand, inserts lower‑ and upper‑bounds from convex relaxations (IBP, CROWN, DeepPoly, etc.) directly into the loss, guaranteeing that the same relaxations can later certify the trained network. However, purely certified models suffer a substantial drop in standard accuracy.
Recent work introduced the notion of “expressivity”: a parametrized loss that continuously interpolates between a pure adversarial loss and a pure relaxation‑based loss via a convex combination controlled by a scalar α∈
Comments & Academic Discussion
Loading comments...
Leave a Comment