Towards Adversarially Robust Object Detection

T owards Adv ersarially Rob ust Object Detection Haichao Zhang Jianyu W ang Baidu Research, Sunnyv ale USA hczhang1@gmail.com wjyouch@gmail.com Abstract Object detection is an important vision task and has emer ged as an indispensable component in many vision sys- tem, r endering its r ob ustness as an increasingly important performance factor for pr actical applications. While object detection models have been demonstrated to be vulnerable against adversarial attac ks by many recent works, very few efforts have been devoted to impr o ving their r ob ustness. In this work, we take an initial attempt towar ds this direction. W e ﬁrst r evisit and systematically analyze object detectors and many r ecently developed attacks fr om the perspective of model r obustness. W e then pr esent a multi-task learning perspective of object detection and identify an asymmetric r ole of task losses. W e further de velop an adver sarial tr ain- ing appr oach whic h can leverag e the multiple sour ces of at- tacks for impr oving the r obustness of detection models. Ex- tensive experiments on P ASCAL-V OC and MS-COCO veri- ﬁed the effectiveness of the pr oposed appr oach. 1. Introduction Deep learning models hav e been widely applied to many vision tasks such as classiﬁcation [45, 47, 19] and object detection [15, 14, 29, 40, 42, 3], leading to state-of-the- art performance. Howe ver , one impeding factor of deep learning models is their issues with robustness. It has been shown that deep net-based classiﬁers are vulnerable to ad- versarial attack [49, 16], i.e. , there e xist adv ersarial exam- ples that are slightly modiﬁed but visually indistinguish- able version of the original images that cause the classi- ﬁer to generate incorrect predictions [36, 4]. Many ef forts hav e been dev oted to improving the robustness of classi- ﬁers [35, 34, 56, 17, 25, 44, 46, 38, 30]. Object detection is a computer vision technique that deals with detecting instances of semantic objects in im- ages [54, 8, 12]. It is a natural generalization of the v anilla classiﬁcation task as it outputs not only the object label as in classiﬁcation b ut also the location. Many successful object detection approaches hav e been developed during the past sev eral years [15, 14, 42, 29, 40] and object detectors pow- clean standard detector robust detector adversarial Figure 1. Standard v .s. rob ust detectors on clean and adversar - ial images. The adversarial image is produced using PDG-based detector attacks [23, 33] with perturbation budget 8 (out of 256). The standard model [29] fails completely on the adversarial image while the robust model can produce reasonable detection results. ered by deep nets ha ve emerged as an indispensable com- ponent in many vision systems of real-world applications. Recently , it has been shown that object detectors can also be attacked by maliciously crafted inputs [57, 32, 23, 6, 55, 11, 31, 22] ( c.f . Figure 1). Giv en its critical role in appli- cations such as surveillance and autonomous dri ving, it is important to inv estigate approaches for defending object de- tectors against various adversarial attacks. Ho wever , while many works have shown it is possible to attack a detector , it remains largely unclear whether it is possible to improve the robustness of the detectors and what is the practical ap- proach for that. This work servers as an initial attempt to bridge this gap towards this direction. W e sho w that it is possible to improve the robustness of the object detector w .r .t. v arious types of attacks and propose a practical ap- proach for achieving this, by generalizing the adversarial training framew ork from classiﬁcation to detection. 1 The contribution of this paper is threefold: i ) we pro- vide a categorization and analysis of dif ferent attacks for object detectors, revealing their shared underlying mecha- nisms; ii ) we highlight and analyze the interactions between different tasks losses and their implication on robustness; iii ) we generalize the adversarial training frame work from classiﬁcation to detection and dev elop an adversarial train- ing approach that can properly handle the interactions be- tween task losses for improving detection rob ustness. 2. Related W ork Attacks and Adversarial T raining for Classiﬁcation . Adversarial examples ha ve been in vestigated for general learning-based classiﬁers before [2]. As a learning-based model, deep networks are also vulnerable to adv ersarial ex- amples [49, 37]. Many variants of attacks [16, 36, 4] and defenses [35, 34, 56, 17, 25, 30, 44, 46, 38, 1] ha ve been de- veloped. Fast gradient sign method (FGSM) [16] and Pro- jectiv e Gradient Descend (PGD) [33] are two representa- tiv e approaches for white-box adversarial attack generation. Adversarial training [16, 21, 50, 33] is one of the effecti ve defense method against adversarial attacks. It achie ves ro- bust model training by solving a minimax problem, where the inner maximization generates attacks according to the current model parameters while the outer optimization min- imize the training loss w .r .t. the model parameters [16, 33]. Object Detection and Adversarial Attacks . Many suc- cessful object detection approaches have been de veloped during the past several years, including one-stage [29, 40] and two-stage variants [15, 14, 42]. T wo stage detectors reﬁne proposals from the ﬁrst stage by one or multiple re- ﬁnement steps [42, 3]. W e focus on one-stage detectors in this work due to its essential role in dif ferent variants of detectors. A number of attacks for object detectors hav e been de veloped very recently [57, 32, 6, 11, 55, 23, 22, 31]. [57] e xtends the attack generation method from classiﬁca- tion to detection and demonstrates that it is possible to at- tack objectors using a designed classiﬁcation loss. Lu et al. generate adversarial e xamples that fool detectors for stop sign and face detections [32]. [6] de velops physical attacks for Faster -RCNN [42] and adapts the expectation-o ver - transformation idea for generating physical attacks that re- main effecti ve under various transformations such as view- point variations. [23] proposes to attack the re gion-proposal network (RPN) with a specially designed hybrid loss incor - porating both classiﬁcation and localization terms. Apart from the full images, it is also possible to attack detectors by restricting the attacks to be within a local region [22, 31]. 3. Object Detection and Attacks Revisited W e revisit object detection and discuss the connections between many v ariants of attacks developed recently . base-net θ b classiﬁcation θ c localization θ l NMS Figure 2. One-stage detector architectur e . A base-net (w . para. θ b ) is shared by classiﬁcation (w . para. θ c ) and localization (w . para. θ l ) tasks. θ = [ θ b , θ c , θ l ] denotes the full parameters for the detector . For training, the NMS module is removed and task losses are appended for classiﬁcation and localization respectiv ely . 3.1. Object Detection as Multi-T ask Lear ning An object detector f ( x ) → { p k , b k } K k =1 takes an im- age x ∈ [0 , 255] n as input and outputs a varying number of K detected objects, each represented by a probability vec- tor p k ∈ R C ov er C classes (including background) and a bounding box b k = [ x k , y k , w k , h k ] . Non-maximum sup- pression (NMS) [43] is applied to remove redundant detec- tions for the ﬁnal detections ( c.f . Figure 2). For training, we parametrize the detector f ( · ) by θ . Then the training of the detector boils down to the estimation of θ which can be formulated as follows: min θ E ( x , { y k , b k } ) ∼D L ( f θ ( x ) , { y k , b k } ) . (1) x denotes the training image and { y k , b k } the ground-truth (class label y k and the bounding box b k ) sampled from the dataset D . W e will drop the expectation over data and present subsequent deriv ations with a single example to av oid notation clutter without loss of generality as follows: min θ L ( f θ ( x ) , { y k , b k } ) . (2) L ( · ) is a loss function measuring the dif ference between the output of f θ ( · ) and the ground-truth and the minimization of it (ov er the dataset) leads to a proper estimation of θ . In practice, it is typically instantiated as a combination of classiﬁcation loss and localization loss as follows [29, 40]: min θ loss cls ( f θ ( x ) , { y k , b k } ) + loss loc ( f θ ( x ) , { y k , b k } ) . (3) As shown in Eqn.(3), the classiﬁcation and localization tasks share some intermediate computations including the base-net ( c.f. Figure 2). Howe ver , the y use different parts of the output from f θ ( · ) for computing losses emphasizing on different aspects, i.e. , classiﬁcation and localization perfor- mance respectiv ely . This is a design choice for sharing fea- ture and computation for potentially rele vant tasks [29, 40], which is essentially an instance of multi-task learning [5]. 3.2. Detection Attacks Guided by T ask Losses Many different attack methods for object detectors have been de veloped very recently [57, 32, 6, 11, 55, 23, 22, 31]. Although there are man y differences in the formulations of these attacks, when viewed from the multi-task learning Attacks for Object Detection Components loss cls loss loc T N T N ShapeShifter [6] X DFool [32], PhyAttack [11] X DAG [57], Transfer [55] X X DPatch [31] X X RAP [23] X X BPatch [22] X X T able 1. Analysis of existing attack methods for object detection. “T” denotes “targeted attack” and “N” for “non-tar geted attack”. perspectiv e as pointed out in Section 3.1, they hav e the same framew ork and design principle: an attac k to a detector can be ac hieved by utilizing variants of individual task losses or their combinations . This provides a common grounding for understanding and comparing different attacks for object detectors. From this perspecti ve, we can cate gorize existing attack methods as in T able 1. It is clear that some methods use classiﬁcation loss [6, 32, 11, 57, 55] while other meth- ods also incorporated localization loss [31, 23, 22]. There are two perspectiv es for explaining the ef fectiveness of in- dividual task loss in generating attacks: i ) the classiﬁcation and localization tasks share a common base-net, implying that the weakness in the base-net will be shared among all tasks built upon it; ii ) while the classiﬁcation and localiza- tion outputs ha ve dedicated branches for each task beyond the shared base-net, they are coupled in the testing phase due to the usage of NMS, whi ch jointly use class scores and bounding box locations for redundant prediction pruning. Although many attacks hav e been developed and it is possible to come up with new combinations and conﬁgu- rations follo wing the general principle, there is a lack of un- derstanding on the r ole of individual components in model robustness. Filling this gap one of our contributions which will naturally lead to our robust training method for object detectors as detailed in the sequel. 4. T owards Adversarially Rob ust Detection 4.1. The Roles of T ask Losses in Rob ustness As the classiﬁcation and localization tasks of a detec- tor share a base-net ( c.f . Figure 2), the two tasks will in- evitably af fect each other ev en though the input images are manipulated according to a criterion trailered for one indi- vidual task. W e therefore conduct analysis on the role of task losses in model robustness from se veral perspecti ves. Mutual Impacts of T ask Losses. Our ﬁrst empirical ob- servation is that differ ent tasks have mutual impacts and the adver sarial attac ks trailer ed for one task can reduce the performance of the model on the other task . T o sho w this, we tak e a marginalized view over one f actor while in ves- tigating the impact of the other . For example, when con- classification localization classification accuracy 0 0.2 0.4 0.6 0.8 average IoU accuracy 0.5 0.6 0.7 0.8 0.9 classification localization performance 0 0.2 0.4 0.6 0.8 clean loss cls loss loc (a) (b) g l g c Figure 3. Mutual impacts of task los ses and gradient visualiza- tion . (a) Model performance on classiﬁcation and localization un- der dif ferent attacks: clean image, loss cls -based attack and loss loc - based attack. The model is a standard detector trained on clean images. The performance metric is detailed in te xt. (b) Scatter plot of task gradients for classiﬁcation g c and localization g l . sidering classiﬁcation, we can marginalize out the factor of location and the problem reduces to a multi-label classiﬁca- tion task [52]; on the other hand, when focusing on local- ization only , we can marginalize out the class information and obtain a class agnostic object detection problem [53]. The results with single step PGD and b udget 8 are shown in Figure 3 (a). The performances are measured on detection outputs prior to NMS to better reﬂect the ra w performance. A candidates set is ﬁrst determined as the foreground can- didates whose prior boxes have an IoU v alue lar ger than 0.5 with any of the ground-truth annotation. This ensures that each selected candidate has a relativ e clean input both tasks. For classiﬁcation, we compute the classiﬁcation accurac y on the candidate set. For localization, we compute the av- erage IoU of the predicted bounding boxes with ground- truth bounding boxes. The attack is generated with one- step PGD and a budget of 8. It can be observed from the results in Figure 3 (a) that the two losses interact with each other . The attacks based on the classiﬁcation loss ( loss cls ) reduces the classiﬁcation performance and decreases the lo- calization performance at the same time. Similarly , the lo- calization loss induced attacks ( loss loc ) reduces not only the location performance but the classiﬁcation performance as well. This can essentially be viewed as a type of cross- task attack transfer: i.e . . when using only the classiﬁcation loss (task) to generate adversarial images, the attacks can be transferred to localization tasks and reduce its performance and vice versa. This is one of the reason why adversarial images generated based on indi vidual task losses ( e.g . clas- siﬁcation loss [57]) can effecti vely attack object detectors. Misaligned T ask Gradients. Our second empirical obser- vation is that the gradients of the two tasks shar e certain level of common dir ections but are not fully aligned, lead- ing to misaligned task gradients that can obfuscate the sub- sequent adversarial training. T o show this, we analyze the image gradients deriv ed from the two losses (referred to as task gradients ), i.e. , g c = ∇ x loss cls and g l = ∇ x loss loc . The element-wise scatter plot between g c and g l is shown localization task domain S loc classiﬁcation task domain S cls Figure 4. Visualization of task domains S cls and S loc using t-SNE. Giv en a single clean image x , each dot in the picture repre- sents one adversarial example generated by solving Eqn.(5) staring from a random point within the  -ball around x . Different colors encode the task losses used for generating adversarial examples ( red : loss cls , blue : loss loc ). Therefore, the samples form empiri- cal images of the corresponding task domains. It is observ ed that the two task domains ha ve both o verlaps and distincti ve re gions. in Figure 3 (b). W e ha ve se veral observ ations: i ) the magni- tudes of the task gradients are not the same (different value ranges), indicating the potential existence of imbalance be- tween the tw o task losses; ii ) the direction of the task gradi- ents are inconsistent (non-diagonal), implying the potential conﬂicts between the two tasks gradients. W e further vi- sualize the task gradient domains representing the domain of a task maximizing gradient for each respectiv e task ( c.f . Eqn.(5)) as in Figure 4. The fact the the two domains are not fully separated ( i.e. they do not collapse to two iso- lated clusters) further reinforces our pre vious observation on their mutual impacts. The other aspect that they have a signiﬁcant non-overlapping portion is another reﬂection of the mis-alignments between task gradients (task domains). 4.2. Adversarial T raining f or Robust Detection Motiv ated by the preceding analysis, we propose the fol- lowing formulation for rob ust object detection training: min θ  max ¯ x ∈S cls ∪S loc L ( f θ ( ¯ x ) , { y k , b k } )  , (4) where the task-oriented domain S cls and S loc represent the permissible domains induced by each individual tasks: S cls , { ¯ x | arg max ¯ x ∈S x loss cls ( f ( ¯ x ) , { y k } )) } S loc , { ¯ x | arg max ¯ x ∈S x loss loc ( f ( ¯ x ) , { b k } )) } (5) where S x is deﬁned as S x = { z | z ∈ B ( x ,  ) ∩ [0 , 255] n } , and B ( x ,  ) = { z | k z − x k ∞ ≤  } denotes the ` ∞ -ball with center as the clean image x and radius as the pertur- bation budget  . W e denote P S x ( · ) as a projection operator projecting the input into the feasible region S x . It is impor - tant to note se veral crucial dif ferences compared with the con ventional adversarial training for classiﬁcation: • multi-task sour ces for adversary training : different from the adversarial training in classiﬁcation case [16, 33] where only a single source is inv olved, here we have Algorithm 1 Adversarial T raining for Robust Detection Input: dataset D , training epochs T , batch size S , learning rate γ , attack budget  for t = 1 to T do for random batch { x i , { y i k , b i k }} S i =1 ∼ D do · ˜ x i ∼ B ( x i ,  ) compute attacks in the classiﬁcation task domain · ¯ x i cls = P S x  ˜ x i +  · sign  ∇ x loss cls ( ˜ x i , { y i k } )  compute attacks in the localization task domain · ¯ x i loc = P S x  ˜ x i +  · sign  ∇ x loss loc ( ˜ x i , { b i k } )  compute the ﬁnal attack examples · m = L ( ¯ x i cls , { y i k , b i k } ) > L ( ¯ x i loc , { y i k , b i k } ) · ¯ x i = m  ¯ x i cls + (1 − m )  ¯ x i loc perform adversarial training step · θ = θ − γ · ∇ θ 1 S P S i =1 L ( ¯ x i , { y i k , b i k } ; θ ) end for end for Output: learned model parameter θ for object detection. multiple (in the presence of multiple objects) and heter o- geneous (both classiﬁcation and localization) sources of supervisions for adversary generation and training, thus generalizing the adversarial training for classiﬁcation; • task-oriented domain constraints : different from the con ventional adversarial training setting which uses a task-agnostic domain constraint S x , we introduce a task- oriented domain constraint S cls ∪ S loc which restricts the permissible domain as the set of images that maximize either the classiﬁcation task losses or the localization losses. The ﬁnal adversarial e xample used for training is the one that maximizes the overall loss within this set. The crucial advantage of the proposed formulation with task-domain constraints is that we can beneﬁt from gen- erating adv ersarial examples guided by each task without suffering from the interferences between them. If we relax the task-oriented domain to S x , set the coor- dinates of the bounding box corresponding to the full image and assign a single class label to the image, then the pro- posed formulation Eqn.(4) reduces to the conv entional ad- versarial training setting for classiﬁcation [16, 33]. There- fore, we can vie w the proposed adversarial training for ro- bust detection as a natural generalization of the con ven- tional adversarial training under the classiﬁcation setting. Howe ver , it is crucial to note that while both tasks contribute to improving the model robustness in expectation according to their overall strengths, there is no interference between the tasks for generating individual adversarial example due to the task oriented domain in contrast to S x ( c.f . Sec.5.3). T raining object detection models that are resistant to ad- versarial attacks boils down to solving a minimax prob- lem as in Eqn.(4). W e solve it approximately by replac- ing the original training images with the adversarially per- (a) mAP (b) mAP Figure 5. Model perf ormance under different number of steps for (a) loss cls and (b) loss loc -based PGD attack with  = 8 . STD is the standard model. CLS and LOC are our robust models. turbed ones obtained by solving the inner problem, and then conducting con ventional training of the model using the perturbed images as typically done in adversarial train- ing [16, 33]. The inner maximization is approximately solved using a v ariant of FGSM [16] for efﬁcienc y . For incorporating the task-oriented domain constraint, we pro- pose to take FGSM steps within each task domain and then select the one that maximizes the overall loss. The details of the algorithm are summarized in Algorithm 1. 5. Experiments 5.1. Experiment and Implementation Details W e use the single-shot multi-box detector (SSD) [29] with VGG16 [45] backbone as one of the representativ e single-shot detectors in our experiments. W e also make the necessary modiﬁcations to the VGG16 net as detailed in [29] and keep the batch normalization layers. Experi- ments with dif ferent detector architectures (Recepti ve Field Block-based Detector (RFB) [28], Feature Fusion Single Shot Detector (FSSD) [24] and YOLO-V3 [40, 41]) and backbones (VGG16 [45], ResNet50 [19], DarkNet53 [39]) are also conducted for comprehensiv e ev aluations. For P ASCAL VOC dataset, we adopt the standard “07+12” protocol (a union of 2007 and 2012 trainval , ∼ 16k images) following [29] for training. For testing, we use P ASCAL VOC2007 test with 4952 test images and 20 classes [10]. 1 For MS-COCO dataset [27], we train on train+valminusminival 2014 ( ∼ 120k images) and test on minival 2014 with 80 classes ( ∼ 5k images) . The “mean average precision” (mAP) with IoU threshold 0.5 is used for ev aluating the performance of a detector [10]. All models are trained from scratch using SGD with an initial learning rate of 10 − 2 , momentum 0 . 9 , weight decay 0 . 0005 and batch size 32 [18] with the multi-box loss [9, 48]. The learning rate schedule is [40k, 60k, 80k] for P ASCAL VOC and [180k, 220k, 260k] for MS-COCO with decay factor 0.1. The size of the image is 300 × 300 . Pixel v alue range is [0 , 255] shifted according to dataset mean. For adversarial attacks and training, we use a bud- get  = 8 , which roughly corresponds to a PSNR of 30 between the perturbed and original images follo wing [23]. 1 V OC2012 test is not used as the annotations required for generating attacks are unav ailable. (a) mAP (b) mAP Figure 6. Model perf ormance under different attack budgets for (a) loss cls and (b) loss loc -based PGD attack with 20 steps. STD is the standard model. CLS and LOC are our robust models.  = 0  = 2  = 4  = 8 Figure 7. V isualization of attacks on STD model using loss cls based 20-step PGD attack (zoom electronically for better view). All the attack methods incorporate sgn( · ) operator into the PGD steps for normalization and efﬁcienc y following [16]. 5.2. Impacts of T ask Losses on Rob ustness W e will inv estigate the role of task losses in model ro- bustness. For this purpose, we introduce the standard model and sev eral variations of our proposed rob ust model: • STD : standard training with clean image as the domain • CLS : using S cls only as the task domain for training • LOC : using S loc only as the task domain for training. W e will systemically in vestigate the performance of these models under attacks induced by individual task losses with different number of attack steps and b udgets as follows. Attacks under different number of steps. W e ﬁrst ev alu- ate the performance of models under attacks with different number of PGD steps under a ﬁxed attack budget of 8. The results are shown in Figure 5. W e have se veral interest- ing observations from the results: i ) the performance of the standard model ( STD ) drops belo w all other robust models within just a few steps and decreases quickly (approach- ing zero) as the number of PGD steps increases, for both loss cls -base and loss loc -based attacks. These results imply that both types of attacks are v ery effecti ve attacks for de- tectors; ii ) all the robust models maintains a relative stable performance across different number of attack steps, indi- cating their impro ved rob ustness against adversarial attacks compared to the standard model. Attacks with different budgets. W e ev aluate model ro- bustness under a range of different attack budgets  ∈ { 2 , 4 , 6 , 8 , 10 } . The results are presented in Figure 6. It is observed that the performance of the standard model trained with natural images ( STD ) drops signiﬁcantly , e.g . , from ∼ 72% on clean images (not sho wn in ﬁgure) to ∼ 4% with a small attack b udget of 2. Robust models, on the other hand, degrade more gracefully as the attack b udget increases, im- plying their improved robustness compared to the standard attacks clean loss cls loss loc DAG [57] RAP [23] standard 72.1 1.5 0.0 0.3 6.6 ours CLS 46.7 21.8 32.2 28.0 43.4 LOC 51.9 23.7 26.5 17.2 43.6 CON 38.7 18.3 27.2 26.4 40.8 MTD 48.0 29.1 31.9 28.5 44.9 ours avg 46.3 23.2 29.4 25.0 43.2 T able 2. Impacts of task domains on model performance (mAP) and defense against attacks from literature (attack  = 8 ). model. In Figure 7, we visualize the detection results un- der different attack b udgets on standard model. It is ob- served that even with a small attack budget ( e.g.  = 2 ), the detection results are changed completely , implying that the standard model is very fragile in term of robustness, which is consistent with our previous observation from Figure 6. It is also observed that the erroneous detections can be of sev eral forms: i ) label ﬂipping: the bounding box loca- tion is roughly correct b ut the class label is incorrect, e.g. , “ dinningtable ” (  : 0 → 2 ); ii ) disappearing: the bounding box for the object is missing, e.g . , “ horse ” and “ person ” (  : 0 → 2 ); iii ) appearing: spurious detections of objects that do not exist in the image with locations not well aligned with any of the dominant objects, e.g . , “ chair ” (  : 0 → 2 ) and “ pottedplant ” (  : 2 → 8 ). As the attack budget is increased, the detection output will be further changed in terms of the three types of changes described above. It can also be observed from the ﬁgure that the attack image gen- erated with  = 8 bears noticeable changes compared with the original one, although not very sev ere. W e will therefore use attack  = 8 as it is a large enough attack budget while maintain a reasonable resemblance to the original image. 5.3. Bey ond Single-T ask Domain W e further examine the impacts of task domains on ro- bustness. The following approaches with different task do- mains are considered in addition to STD , CLS and LOC : • CON : using the con ventional task agnostic domain S x , which is essentially the direct application of the adver - sarial training for classiﬁcation [16, 33] to detection; • MTD : using the task oriented domain S cls ∪ S loc . The results are summarized in T able 2. It is observed from comparison that different domains lead to dif ferent le vels of model robustness. For example, for methods with a single task domain, LOC leads to less robust models compared with CLS . On the other hand, LOC has a higher clean accuracy than CLS . Therefore, it is not straightforward to select one single domain as it is unknown a priori whether one of the task domains is the best. Simply relaxing the task domains as done in the con ventional adversarial training CON [16, 33] leads to compromised performance. Concretely , the perfor- mance of CON with task-agnostic task domain achie ves an in-between or inferior performance compared to the models SSD-backbone DAG [57] RAP [23] STD ours STD ours VGG16 0.3 28.5 6.6 44.9 ResNet50 0.4 22.9 8.8 39.1 DarkNet53 0.5 26.2 8.2 46.6 T able 3. Evaluation results on across different backbones. with indi vidual task domains under dif ferent attacks, imply- ing that simply mixing the task domains leads to compro- mised performance, due to the conﬂicts between the task gradients (Sec. 4.1). On the other hand, the rob ust model MTD using adversarial training with task oriented domain constraint can improve the performance ov er CON baseline. More importantly , when the task-oriented multi-task do- main is incorporated, a proper trade-off and overall perfor- mance is observed compared with the single domain-based methods, implying the importance of properly handling het- erogeneous and possibly imbalanced tasks in object detec- tors. In summary , the tasks could be imbalanced and con- tribute differently to the model robustness. As it is unknown a priori which is better, randomly adopting one or simply combining the losses ( CON ) could lead to compromised per- formance. MTD setting overcomes this issue and achieves performance on par or better than best single domain mod- els and the task-agnostic domain model. 5.4. Defense against Existing White-box Attacks T o further inv estigate the model robustness, we ev alu- ate models against representative attack methods from lit- erature. W e use DAG [57] and RAP [23] as representati ve attacks according to T able 1. It is important to note that the attack used in training and testing are different. The results are summarized in T able 2. It is observ ed that the performances of rob ust models improv e over the standard model by a lar ge margin. CLS performs better in general than LOC and CON in terms of robustness against the two attacks from literature. The model using multi-task do- mains ( MTD ) demonstrates the best performance. MTD has a higher clean image accuracy than CLS and performs uni- formly well against dif ferent attacks, thus o verall is better and will be used for reporting performance in the follo wing. V isualization of example results are pro vided in Figure 8. 5.5. Evaluation on Differ ent Backbones W e e v aluate the ef fecti veness of the proposed approach under different SSD backbones, including VGG16 [45], ResNet50 [19] and DarkNet53 [39]. A verage performance under DAG [57] and RAP [23] attacks are reported in T able 3. It is observed that the proposed approach can boost the performance of the detector by a large margin (20% ∼ 30% absolute impro vements), across dif ferent back- bones, demonstrating that the proposed approach performs well across backbones of different network structures with clear and consistent improv ements ov er baseline models. standard Results under the D A G [57] attack ours standard ours Results under the RAP [23] attack Figure 8. V isual comparison between standard model and ours under D AG [57] and RAP [23] attacks with attack b udget 8. architecture DAG [57] RAP [23] STD ours STD ours SSD +VGG16 0.3 28.5 6.6 44.9 RFB +ResNet50 0.4 27.4 8.7 48.7 FSSD +DarkNet53 0.3 29.4 7.6 46.8 Y OLO +DarkNet53 0.1 27.6 8.1 44.3 T able 4. Evaluation results on different detection architectures. 5.6. Results on Different Detection Ar chitectures Our proposed approach is also applicable to different detection architectures. T o sho w this, we use different detection architectures, including SSD [29], RFB [28], FSSD [24] and Y OLO-V3 [40, 41]. The input image size for YOLO is 416 × 416 and all others take 300 × 300 im- ages as input. A verage performance under DAG [57] and RAP [23] attacks are summarized in T able 4. It is observed that the proposed method can improve over the standard method signiﬁcantly and consistently for different detector architectures. This clearly demonstrates the applicability of the proposed approach across detector architectures. 5.7. Defense against T ransferred Attacks W e further test the performance of the rob ust models under transferred attacks: attacks that are transferred from models with different backbones and/or detection architec- tures. Our model under test is based on SSD+VGG16. For attacks transferred from different backbones, they are gen- erated under the SSD architecture b ut replacing the VGG backbone with ResNet or DarkNet. F or attacks transferred from different detection architectures, we use RFB [28], FSSD [24] and Y OLO [40, 41]. 2 DAG [57] and RAP [23] are used as the underlining attack generation algorithms. The results are summarized in T able 5. It is observ ed that the 2 As the input image size for YOLO is 416 × 416 , which is different from input size of 300 × 300 for SSD, we insert a differentiable interpolation module ( 300 2 → 416 2 ) between the input with size of 300 × 300 and YOLO. small objects visually confusing classes incorrect bounding box and/or class Figure 9. V isualization of failure cases . Example challenging cases include images with small objects and visually confusing classes. transferred attack DAG [57] RAP [23] av erage SSD+ResNet50 49.3 49.4 49.4 SSD+DarkNet53 49.2 49.4 49.3 RFB+ResNet50 49.1 49.3 49.2 FSSD+DarkNet53 49.3 49.2 49.3 YOLO+DarkNet53 49.5 49.5 49.5 T able 5. Performance of our model (SSD+VGG16) against attacks transferred from different backbones and detector architectures. proposed model is robust against transferred attacks gener- ated with different algorithms and architectures. It is also observed that the attacks have a certain le vel of robustness can be transferred across detectors with different backbones or structures. This reconﬁrms the results from [57, 23]. 5.8. Results on MS-COCO W e further conduct experiments on MS-COCO [27], which is more challenging both for the standard detector as well as the defense due to its increased number of classes and data v ariations. The results of different models under RAP attack [23] with attack b udget 8 and PGD step 20 are summarized in T able 6. The standard model achie ves a very low accuracy in the presence of attack (compared with ∼ 40% on clean images). Our proposed models im- prov es o ver the standard model signiﬁcantly and performs generally well across different backbones and detection ar- chitectures. This further demonstrates the effecti veness of the proposed approach on improving model rob ustness. 5.9. F ailure Case Analysis W e visualize in Figure 9 some example cases that are challenging to our current model. Images with small objects that are challenging for the standard detectors [29, 40] re- main to be one cate gory of challenging e xamples for rob ust detectors. Better detector architectures might be necessary to address this challenge. Another challenging category is objects with visually confusing appearance, which naturally leads to low conﬁdence predictions. This is more related to the classiﬁcation task of the detector and can beneﬁt from advances in classiﬁcation [58]. There are also cases where the predictions are inaccurate or completely wrong, which rev eals the remaining challenges in robust detector training. model architec. backbone clean attack standard SSD VGG16 39.8 2.8 ours SSD VGG16 27.8 16.5 SSD DarkNet53 20.9 18.8 SSD ResNet50 18.0 16.4 RFB ResNet50 24.7 21.6 FSSD DarkNet53 23.5 20.9 Y OLO DarkNet53 24.0 21.5 T able 6. Comparison of standard and robust models on MS-COCO under RAP attack [23] with attack budget 8 and 20 PGD steps. 6. Conclusions W e have presented an approach for improving the robust- ness object detectors against adversarial attacks. From a multi-task view of object detection, we systematically ana- lyzed existing attacks for object detectors and the impacts of individual task component on model robustness. An ad- versarial training method for robust object detection is de- veloped based on these analyses. Extensiv e experiments hav e been conducted on P ASCAL-V OC and MS-COCO datasets and experimental results ha ve demonstrated the ef- ﬁcacy of the proposed approach on improving model ro- bustness compared with the standard model, across differ - ent attacks, datasets, detector backbones and architectures. This work serves as an initial step tow ards adversarially robust detector training with promising results. More efforts need to be dev oted in this direction to address the remain- ing challenges. New advances on object detection can be used to further improve the model performance, e.g. , bet- ter loss function for approximating the true objectiv e [26] and different architectures for addressing small object is- sues [7, 13]. Similarly , as a component task of object detec- tion, any advances on classiﬁcation task could be potentially transferred as well [58]. There is also a trade-off between accuracy on clean image and robustness for object detection as in the classiﬁcation case [51]. How to le verage this trade- off better is another future work. Furthermore, by viewing object detection as an instance of multi-task learning task, this work could serve as an example on robustness improv e- ment for other multi-task learning problems as well [20, 59]. References [1] A. Athalye, N. Carlini, and D. W agner . Obfuscated gradients giv e a false sense of security: Circumventing defens es to ad- versarial examples. In International Confer ence on Mac hine learning , 2018. [2] B. Biggio and F . Roli. W ild patterns: T en years after the rise of adversarial machine learning. In A CM Confer ence on Computer and Communications Security , 2018. [3] Z. Cai and N. V asconcelos. Cascade R-CNN: Delving into high quality object detection. In IEEE Conference on Com- puter V ision and P attern Recognition , 2018. [4] N. Carlini and D. W agner . T owards evaluating the rob ustness of neural networks. In IEEE Symposium on Security and Privacy , 2017. [5] R. Caruana. Multitask learning. Machine Learning , 28(1):41–75, 1997. [6] S. Chen, C. Cornelius, J. Martin, and D. H. Chau. ShapeShifter: Robust physical adversarial attack on Faster R-CNN object detector . CoRR , abs/1804.05810, 2018. [7] L. Cui. MDSSD: Multi-scale decon volutional single shot de- tector for small objects. CoRR , abs/1805.07009, 2018. [8] N. Dalal and B. T riggs. Histograms of oriented gradients for human detection. In IEEE Conference on Computer V ision and P attern Recognition , 2005. [9] D. Erhan, C. Sze gedy , A. T oshev , and D. Anguelov . Scalable object detection using deep neural networks. In IEEE Con- fer ence on Computer V ision and P attern Recognition , 2014. [10] M. Everingham, S. M. Eslami, L. Gool, C. K. W illiams, J. W inn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. Int. J. Comput. V ision , 111(1):98–136, 2015. [11] K. Eykholt, I. Evtimov , E. Fernandes, B. Li, A. Rahmati, F . T ram ` er , A. Prakash, T . K ohno, and D. Song. Phys- ical adversarial examples for object detectors. CoRR , abs/1807.07769, 2018. [12] P . F . Felzenszwalb, R. B. Girshick, D. McAllester , and D. Ra- manan. Object detection with discriminati vely trained part- based models. IEEE T rans. P attern Anal. Mach. Intell. , 32(9):1627–1645, 2010. [13] C.-Y . Fu, W . Liu, A. Ranga, A. T yagi, and A. C. Berg. DSSD: Deconv olutional single shot detector . CoRR , abs/1701.06659, 2017. [14] R. Girshick. Fast R-CNN. In IEEE International Conference on Computer V ision , 2015. [15] R. Girshick, J. Donahue, T . Darrell, and J. Malik. Rich fea- ture hierarchies for accurate object detection and semantic segmentation. In IEEE Confer ence on Computer V ision and P attern Recognition , 2014. [16] I. Goodfellow , J. Shlens, and C. Szegedy . Explaining and harnessing adversarial examples. In International Confer- ence on Learning Repr esentations , 2015. [17] C. Guo, M. Rana, M. Ciss ´ e, and L. van der Maaten. Coun- tering adversarial images using input transformations. In In- ternational Confer ence on Learning Repr esentations , 2018. [18] K. He, R. B. Girshick, and P . Doll ´ ar . Rethinking ImageNet pre-training. CoRR , abs/1811.08883, 2018. [19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer V ision and P attern Recognition , 2016. [20] A. K endall, Y . Gal, and R. Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and seman- tics. In IEEE Confer ence on Computer V ision and P attern Recognition , 2018. [21] A. Kurakin, I. Goodfellow , and S. Bengio. Adversarial machine learning at scale. In International Confer ence on Learning Repr esentations , 2017. [22] Y . Li, X. Bian, and S. L yu. Attacking object detectors via im- perceptible patches on background. CoRR , abs/1809.05966, 2018. [23] Y . Li, D. T ian, M. Chang, X. Bian, and S. L yu. Robust adver - sarial perturbation on deep proposal-based models. In British Machine V ision Conference , 2018. [24] Z. Li and F . Zhou. FSSD: feature fusion single shot multibox detector . CoRR , abs/1712.00960, 2017. [25] F . Liao, M. Liang, Y . Dong, and T . P ang. Defense ag ainst ad- versarial attacks using high-level representation guided de- noiser . In IEEE Confer ence on Computer V ision and P attern Recognition , 2018. [26] T .-Y . Lin, P . Goyal, R. B. Girshick, K. He, and P . Doll ´ ar . Focal loss for dense object detection. In International Con- fer ence on Computer V ision , 2017. [27] T .-Y . Lin, M. Maire, S. J. Belongie, L. D. Bourde v , R. B. Gir- shick, J. Hays, P . Perona, D. Ramanan, P . Doll ´ ar , and C. L. Zitnick. Microsoft COCO: Common objects in context. In Eur opean Conference on Computer V ision , 2014. [28] S. Liu, D. Huang, and a. W ang. Recepti ve ﬁeld block net for accurate and fast object detection. In European Conference on Computer V ision , 2018. [29] W . Liu, D. Anguelov , D. Erhan, C. Szegedy , S. Reed, C.-Y . Fu, and A. C. Ber g. SSD: Single shot multibox detector . In Eur opean Conference on Computer V ision , 2016. [30] X. Liu, M. Cheng, H. Zhang, and C.-J. Hsieh. T owards ro- bust neural netw orks via random self-ensemble. In Eur opean Confer ence on Computer V ision , 2018. [31] X. Liu, H. Y ang, L. Song, H. Li, and Y . Chen. DPatch: Attacking object detectors with adversarial patches. CoRR , abs/1806.02299, 2018. [32] J. Lu, H. Sibai, and E. Fabry . Adversarial e xamples that fool detectors. CoRR , abs/1712.02494, 2017. [33] A. Madry , A. Makelov , L. Schmidt, D. Tsipras, and A. Vladu. T o wards deep learning models resistant to ad- versarial attacks. In International Confer ence on Learning Repr esentations , 2018. [34] D. Meng and H. Chen. MagNet: a tw o-pronged defense against adversarial examples. In ACM SIGSA C Conference on Computer and Communications Security , 2017. [35] J. H. Metzen, T . Genewein, V . Fischer , and B. Bischoff. On detecting adversarial perturbations. In International Confer- ence on Learning Repr esentations , 2017. [36] S.-M. Moosavi-Dezfooli, A. Fa wzi, and P . Frossard. Deep- Fool: a simple and accurate method to fool deep neural net- works. In IEEE Confer ence on Computer V ision and P attern Recognition , 2016. [37] A. Nguyen, J. Y osinski, and J. Clune. Deep neural networks are easily fooled: High conﬁdence predictions for unrecog- nizable images. In IEEE Confer ence on Computer V ision and P attern Recognition , 2015. [38] A. Prakash, N. Moran, S. Garber, A. DiLillo, and J. Storer . Deﬂecting adversarial attacks with pix el deﬂection. In IEEE Confer ence on Computer V ision and P attern Recognition , 2018. [39] J. Redmon. Darknet: Open source neural networks in C. http://pjreddie.com/darknet/ , 2013–2016. [40] J. Redmon, S. K. Divv ala, R. B. Girshick, and A. Farhadi. Y ou only look once: Uniﬁed, real-time object detection. In IEEE Conference on Computer V ision and P attern Recogni- tion , 2016. [41] J. Redmon and A. F arhadi. YOLOv3: An incremental im- prov ement. CoRR , abs/1804.02767, 2018. [42] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: T owards real-time object detection with region proposal net- works. In Advances in Neural Information Processing Sys- tems , 2015. [43] A. Rosenfeld and M. Thurston. Edge and curve detection for visual scene analysis. IEEE T rans. Comput. , 20(5):562–569, 1971. [44] P . Samangouei, M. Kabkab, and R. Chellappa. Defense- GAN: Protecting classiﬁers against adv ersarial attacks using generativ e models. In International Confer ence on Learning Repr esentations , 2018. [45] K. Simonyan and A. Zisserman. V ery deep conv olutional networks for large-scale image recognition. In International Confer ence on Learning Repr esentations , 2015. [46] Y . Song, T . Kim, S. Nowozin, S. Ermon, and N. Kushman. Pixeldefend: Le veraging generative models to understand and defend against adversarial examples. In International Confer ence on Learning Repr esentations , 2018. [47] C. Szegedy , W . Liu, Y . Jia, P . Sermanet, S. Reed, D. Anguelov , D. Erhan, V . V anhoucke, and A. Rabinovich. Going deeper with con volutions. In IEEE Conference on Computer V ision and P attern Recognition , 2015. [48] C. Szegedy , S. E. Reed, D. Erhan, and D. Anguelov . Scal- able, high-quality object detection. CoRR , abs/1412.1441, 2014. [49] C. Szegedy , W . Zaremba, I. Sutske ver , J. Bruna, D. Erhan, I. Goodfello w , and R. Fer gus. Intriguing properties of neural networks. In International Conference on Learning Repr e- sentations , 2014. [50] F . T ram ` er , A. Kurakin, N. Papernot, D. Boneh, and P . Mc- Daniel. Ensemble adversarial training: Attacks and defenses. In International Confer ence on Learning Representations , 2018. [51] D. Tsipras, S. Santurkar , L. Engstrom, A. T urner, and A. Madry . Robustness may be at odds with accuracy . In In- ternational Confer ence on Learning Repr esentations , 2019. [52] G. Tsoumakas and I. Katakis. Multi label classiﬁcation: An ov erview . 3(3):1–13, 2007. [53] J. Uijlings, K. van de Sande, T . Gevers, and A. Smeulders. Selectiv e search for object recognition. International Jour- nal of Computer V ision , 104(2):154–171, 2013. [54] P . V iola and M. J. Jones. Robust real-time face detection. Int. J. Comput. V ision , 57(2):137–154, 2004. [55] X. W ei, S. Liang, X. Cao, and J. Zhu. Transferable adver- sarial attacks for image and video object detection. CoRR , abs/1811.12641, 2018. [56] C. Xie, J. W ang, Z. Zhang, Z. Ren, and A. Y uille. Mitigating adversarial effects through randomization. In International Confer ence on Learning Repr esentations , 2018. [57] C. Xie, J. W ang, Z. Zhang, Y . Zhou, L. Xie, and A. Y uille. Adversarial examples for semantic segmentation and object detection. In International Conference on Computer V ision , 2017. [58] C. Xie, Y . W u, L. van der Maaten, A. Y uille, and K. He. Feature denoising for improving adversarial robustness. In IEEE Conference on Computer V ision and P attern Recogni- tion , 2019. [59] X. Zhao, H. Li, X. Shen, X. Liang, and Y . W u. A modulation module for multi-task learning with applications in image re- triev al. In Eur opean Confer ence on Computer V ision , 2018.

Towards Adversarially Robust Object Detection

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment