Out-of-the-box: Black-box Causal Attacks on Object Detectors

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: Out-of-the-box: Black-box Causal Attacks on Object Detectors
ArXiv ID: 2512.03730
Date: 2025-12-03
Authors: Researchers from original ArXiv paper

📝 Abstract

Adversarial perturbations are a useful way to expose vulnerabilities in object detectors. Existing perturbation methods are frequently white-box and architecture specific. More importantly, while they are often successful, it is rarely clear why they work. Insights into the mechanism of this success would allow developers to understand and analyze these attacks, as well as fine-tune the model to prevent them. This paper presents BlackCAtt, a black-box algorithm and a tool, which uses minimal, causally sufficient pixel sets to construct explainable, imperceptible, reproducible, architecture-agnostic attacks on object detectors. BlackCAtt combines causal pixels with bounding boxes produced by object detectors to create adversarial attacks that lead to the loss, modification or addition of a bounding box. BlackCAtt works across different object detectors of different sizes and architectures, treating the detector as a black box. We compare the performance of BlackCAtt with other black-box attack methods and show that identification of causal pixels leads to more precisely targeted and less perceptible attacks. On the COCO test dataset, our approach is 2.7 times better than the baseline in removing a detection, 3.86 times better in changing a detection, and 5.75 times better in triggering new, spurious, detections. The attacks generated by BlackCAtt are very close to the original image, and hence imperceptible, demonstrating the power of causal pixels.

💡 Deep Analysis

Deep Dive into Out-of-the-box: Black-box Causal Attacks on Object Detectors.

📄 Full Content

Out-of-the-box: Black-box Causal Attacks on Object Detectors Melane Navaratnarajah∗ King’s College London Department of Informatics melane.navaratnarajah@kcl.ac.uk David A. Kelly∗ King’s College London Department of Informatics david.a.kelly@kcl.ac.uk Hana Chockler King’s College London Department of Informatics hana.chockler@kcl.ac.uk Abstract Adversarial perturbations are a useful way to expose vul- nerabilities in object detectors. Existing perturbation meth- ods are frequently white-box and architecture specific. More importantly, while they are often successful, it is rarely clear why they work. Insights into the mechanism of this success would allow developers to understand and an- alyze these attacks, as well as fine-tune the model to prevent them. This paper presents BlackCAtt, a black-box algorithm and a tool, which uses minimal, causally sufficient pixel sets to construct explainable, imperceptible, reproducible, architecture-agnostic attacks on object detectors. Black- CAtt combines causal pixels with bounding boxes produced by object detectors to create adversarial attacks that lead to the loss, modification or addition of a bounding box. BlackCAtt works across different object detectors of differ- ent sizes and architectures, treating the detector as a black box. We compare the performance of BlackCAtt with other black-box attack methods and show that identification of causal pixels leads to more precisely targeted and less per- ceptible attacks. On the COCO test dataset, our approach is 2.7 times better than the baseline in removing a detection, 3.86 times better in changing a detection, and 5.75 times better in triggering new, spurious, detections. The attacks generated by BlackCAtt are very close to the original im- age, and hence imperceptible, demonstrating the power of causal pixels. 1. Introduction Picture yourself in a self-driving car when you suddenly see a dog in the road in front of you. The car has detected it as well, via its object detector. Then, for no apparent reason, the dog is no longer detected: its bounding box has van- ished. The dog is still there, you can see it, but to the car it *These authors contributed equally to this work has become invisible. You have to intervene and apply the brakes to avoid an accident. Why did the dog vanish? Object detectors (OD) are known to be vulnerable to both accidental and adversarial perturbations [33]. In fact, image classification models in general are quite easy to attack [31, 44]. What is harder to explain is what causes the attack to work. Generic attacks, such as global gaussian noise, are reproducible and demonstrate vulnerability of the models, but do not reveal the causal relationship between pixel-level perturbations and failed object detections. There is growing interest in exploring adversarial at- tacks on object detectors using eXplainable AI (XAI) tech- niques [47, 48]. These approaches are mostly white-box methods: they need access to OD hidden layers, which is an unnatural, and generous, attack model. Moreover, saliency maps produced from the hidden layers are well known to be noisy, sensitive to input perturbations and not naturally interpretable [40, 49]. In this paper, we present BlackCAtt (Black-box Causal Attacks), a black-box, causal approach to generating ad- versarial attacks on object detectors. BlackCAtt discovers minimal, sufficient pixel sets (MSPSs) for a detected object using ReX [7]. These pixels, by themselves, are enough to cause the required detection [6, 15]. BlackCAtt uses these pixels to generate low-distortion attacks that remove, alter, or introduce detections. BlackCAtt attacks the causes of the classification, not random pixels nor the entire image. One might expect that all MSPSs would be contained within the bounding box. One of the most surprising results in this paper is that MSPSs are, in fact, frequently either fully outside, or not fully contained within the OD bounding box. We exploit this phenomenon, showing that perturbing causal pixels outside the box often makes the box disappear (Section 6). This happens across different detector architec- tures (single-stage, two-stage, and transformer-based). We also compare the accuracy and precision of BlackCAtt’s native MSPSs by applying our extraction and perturba- tion techniques to another popular black box saliency tool, DRISE, and quantify attack success with a number of mea- sures, including perceptual distortion [50]. 1 arXiv:2512.03730v1 [cs.CV] 3 Dec 2025 (a) Cat and its bounding box (b) Bounding box and MSPS (c) No cat 1: blur (d) No cat 2: black Figure 1. The MSPS for cat (Figure 1b) reveals a dependency on the surrounding context. BlackCAtt starts with causal pixels outside of the bounding box and works inwards in order to maximize imperceptibility. In both Figures 1c and 1d the cat is still clearly present and complete, but YOLO no longer detects the cat. The attack works because BlackCAtt changes part of the cause of the detec

…(Full text truncated)…

📄 Read Full PDF on ArXiv

📸 Image Gallery

000000114270_added_new_pred_mixture_of_gaussian_bbox.png

000000114270_added_new_pred_mixture_of_gaussian_bbox.webp

000000146422_best_pred_changed_mixture_of_gaussian_bbox.png

000000146422_best_pred_changed_mixture_of_gaussian_bbox.webp

$analysis_conf_vs_frac_exp_inside_coco_fasterrcnn.png$ $analysis_conf_vs_frac_exp_inside_coco_fasterrcnn.webp$ $analysis_conf_vs_frac_exp_inside_coco_yolo.png$ $analysis_conf_vs_frac_exp_inside_coco_yolo.webp$