Safety-Efficacy Trade Off: Robustness against Data-Poisoning
Backdoor and data poisoning attacks can achieve high attack success while evading existing spectral and optimisation based defences. We show that this behaviour is not incidental, but arises from a fundamental geometric mechanism in input space. Using kernel ridge regression as an exact model of wide neural networks, we prove that clustered dirty label poisons induce a rank one spike in the input Hessian whose magnitude scales quadratically with attack efficacy. Crucially, for nonlinear kernels we identify a near clone regime in which poison efficacy remains order one while the induced input curvature vanishes, making the attack provably spectrally undetectable. We further show that input gradient regularisation contracts poison aligned Fisher and Hessian eigenmodes under gradient flow, yielding an explicit and unavoidable safety efficacy trade off by reducing data fitting capacity. For exponential kernels, this defence admits a precise interpretation as an anisotropic high pass filter that increases the effective length scale and suppresses near clone poisons. Extensive experiments on linear models and deep convolutional networks across MNIST and CIFAR 10 and CIFAR 100 validate the theory, demonstrating consistent lags between attack success and spectral visibility, and showing that regularisation and data augmentation jointly suppress poisoning. Our results establish when backdoors are inherently invisible, and provide the first end to end characterisation of poisoning, detectability, and defence through input space curvature.
💡 Research Summary
The paper presents a rigorous geometric analysis of data‑poisoning (backdoor) attacks using kernel ridge regression (KRR) as an exact proxy for infinitely wide neural networks. By modeling a cluster of dirty‑label poisons as a tight block in the kernel matrix, the authors prove that the input Hessian acquires a rank‑one spike whose top eigenvalue is proportional to the squared “gain” S(m;λ)=mc+k_ζ·m, where m is the number of poisoned points and λ the ridge term. The attack’s efficacy Δf grows linearly with m, while the curvature (the spike) grows quadratically, establishing a “spike‑efficacy law”.
For linear kernels the ratio R_k =‖∇ₓk‖²/k₀² is constant, so any increase in efficacy is accompanied by a comparable increase in curvature, making the attack spectrally visible. In contrast, for nonlinear kernels—specifically the Gaussian (exponential) kernel—the ratio becomes R_k=(‖x₀−ζ‖/ℓ)⁴. When the trigger point x₀ lies much closer than the kernel length‑scale ℓ (the “near‑clone” regime), the kernel value k₀≈1 while the gradient norm scales as r²/ℓ⁴, causing the Hessian spike to vanish quadratically even though Δf remains O(1). Thus backdoors can be highly effective yet spectrally undetectable. The authors link this regime to neural collapse: at near‑zero training error, class features collapse to class means, so dirty‑label poisons become near‑duplicates of clean examples in feature space, satisfying the near‑clone condition.
The paper then introduces input‑gradient regularisation (penalising ½κ‖∇ₓL‖²). In the KRR setting this adds a positive‑semidefinite matrix G to the kernel, yielding the modified system (K+λI+κG)α=y. Theorem 3.9 shows that the effective degrees of freedom df(κ)=tr
Comments & Academic Discussion
Loading comments...
Leave a Comment