Improving the robustness of ImageNet classifiers using elements of human visual cognition

Impro ving the robustness of ImageNet classiﬁers using elemen ts of h uman visual cognition Emin Orhan Cen ter for Data Science New Y ork Univ ersity eo41@nyu.edu Brenden M. Lak e Cen ter for Data Science, Department of Psyc hology New Y ork Univ eristy brenden@nyu.edu Abstract We investigate the r obustness pr op erties of image r e c o gnition mo dels e quipp e d with two fe atur es inspir e d by human vision, an explicit episo dic memory and a shap e bias, at the ImageNet sc ale. As r ep orte d in pr e- vious work, we show that an explicit episo dic mem- ory impr oves the r obustness of image r e c o gnition mo d- els against smal l-norm adversarial p erturb ations under some thr e at mo dels. It do es not, however, impr ove the r obustness against mor e natur al, and typic al ly lar ger, p erturb ations. L e arning mor e r obust fe atur es during tr aining app e ars to b e ne c essary for r obustness in this se c ond sense. We show that fe atur es derive d fr om a mo del that was enc our age d to le arn glob al, shap e-b ase d r epr esentations ( Geirhos et al. , 2019 ) do not only im- pr ove the r obustness against natur al p erturb ations, but when use d in c onjunction with an episo dic memory, they also pr ovide additional r obustness against adver- sarial p erturb ations. Final ly, we addr ess thr e e imp or- tant design choic es for the episo dic memory: memory size, dimensionality of the memories and the r etrieval metho d. We show that to make the episo dic memory mor e c omp act, it is pr efer able to r e duc e the numb er of memories by clustering them, inste ad of r e ducing their dimensionality. 1. In tro duction ImageNet-trained deep neural net works (DNNs) are state of the art mo dels for a range of computer vi- sion tasks and are curren tly also the b est models of the human visual system and primate visual systems more generally ( Schrimpf et al. , 2018 ). Y et, they hav e serious deﬁciencies as mo dels of h uman and primate visual systems: 1) they are extremely sensitiv e to small adv ersarial p erturbations imp erceptible to the human ey e ( Szegedy et al. , 2013 ), 2) they are m uch more sensi- tiv e than humans to larger, more natural perturbations ( Geirhos et al. , 2018 ), 3) they rely heavily on lo cal tex- ture information in making their predictions, whereas h umans rely muc h more on global shap e information ( Geirhos et al. , 2019 ; Brendel & Bethge , 2019 ), 4) a ﬁne- grained, image-by-image analysis suggests that images that ImageNet-trained DNNs ﬁnd hard to recognize do not match w ell with the images that humans ﬁnd hard to recognize ( Ra jalingham et al. , 2018 ). Here, we add a ﬁfth under-appreciated deﬁciency: 5) h uman visual recognition has a strong episo dic comp o- nen t lac king in DNNs. When we recognize a coﬀee mug, for instance, we do not just recognize it as a m ug, but as this p articular m ug that we ha ve seen b efore or as a no vel m ug that we hav e not seen b efore. This sense of familiarit y/nov elty comes automatically , in volun tarily , ev en when w e are not explicitly trying to judge the familiarit y/nov elty of an ob ject we are seeing. More con trolled psychological exp erimen ts also conﬁrm this observ ation: h umans hav e a phenomenally go od long- term recognition memory with a massive capacit y ev en in diﬃcult one-shot settings ( Standing , 1973 ; Brady et al. , 2008 ). Standard deep vision models, on the other hand, cannot p erform this kind of familiarit y/nov elty computation naturally or automatically , since this in- formation is a v ailable to a trained model only indirectly and implicitly in its parameters. What do es it take to address these deﬁciencies and what are the p oten tial b eneﬁts, if any , of doing so other than making the mo dels more h uman-like in their be- ha vior? In this pap er, w e address these questions. W e sho w that a minimal mo del incorp orating an explicit k ey-v alue based episo dic memory do es not only make it psychologically more realistic, but also reduces the sensitivit y to small adversarial p erturbations. It does not, ho wev er, reduce the sensitivit y to larger, more nat- ural p erturbations and it do es not address the heavy lo cal texture reliance issue. In the episo dic memory , using features from DNNs that were trained to learn 1 more global shape-based represen tations ( Geirhos et al. , 2019 ) addresses these remaining issues and moreov er pro vides additional robustness against adversarial pe r- turbations. T ogether, these results suggest that tw o basic ideas motiv ated and inspired b y h uman vision, a strong episo dic memory and a shape bias, can make image recognition mo dels more robust to b oth natural and adversarial perturbations at the ImageNet scale. 2. Related work In this section, we review previous w ork most closely related to ours and summarize our own con tributions. T o our knowledge, the idea of using an episo dic cac he memory to impro ve the adv ersarial robustness of image classiﬁers was ﬁrst prop osed in Zhao & Cho ( 2018 ) and in Papernot & McDaniel ( 2018 ). Zhao & Cho ( 2018 ) considered a diﬀerentiable memory that w as trained end-to-end with the rest of the mo del. This mak es their mo del computationally muc h more exp en- siv e than the cache models considered here, where the cac he uses pre-trained features instead. The deep k - nearest neigh b or model in P ap ernot & McDaniel ( 2018 ) and the “CacheOnly” mo del describ ed in Orhan ( 2018 ) are closer to our cache models in this resp ect, how ev er these w orks did not consider mo dels at the ImageNet scale. More recently , Dubey et al. ( 2019 ) did consider cac he mo dels at the ImageNet scale (and b ey ond) and demonstrated substantial impro v ements in adv ersarial robustness under certain threat mo dels. None of these earlier papers addressed the important problem of robustness to natural p erturbations and they did not in vestigate the eﬀects of v arious cac he design choices, such as the retriev al metho d (i.e. a con tinuous cac he vs. nearest neighbor retriev al), cache size, dimensionalit y of the k eys or the feature t yp e used (e.g. texture-based vs. shap e-based features), on the robustness prop erties of the cache mo del. A diﬀeren t line of recent w ork addressed the question of robustness to natural p erturbations in ImageNet- trained DNNs. In well-con trolled psychoph ysical exp er- imen ts with human sub jects, Geirhos et al. ( 2018 ) com- pared the sensitivity of h umans and ImageNet-trained DNNs to sev eral diﬀerent t yp es of natural distortions and p erturbations, such as changes in contrast, color or spatial frequency con tent of images, image rotations etc. They found that ImageNet-trained DNNs are m uc h more sensitive to suc h p erturbations than human sub- jects. More recen tly , Hendrycks & Dietterich ( 2019 ) in tro duced the ImageNet-C and ImageNet-P bench- marks to measure the robustness of neural netw orks against some common p erturbations and corruptions that are likely to o ccur in the real world. W e use the ImageNet-C b enc hmark b elow to measure the robust- ness of diﬀerent models against natural p erturbations. This second line of w ork, ho wev er, did not address the question of adversarial robustness. An adequate mo del of the human visual system should b e robust to both natural and adv ersarial perturbations. 1 Moreo ver, b oth prop erties are clearly desirable prop erties in practical image recognition systems, indep endent of their v alue in building more adequate mo dels of the human visual system. Our main contributions in this pap er are as follows: 1) as reported in previous w ork ( Zhao & Cho , 2018 ; P ap ernot & McDaniel , 2018 ; Orhan , 2018 ; Dub ey et al. , 2019 ), w e sho w that an explicit cache memory impro ves the adversarial robustness of image recognition mo dels at the ImageNet scale, but only under certain threat scenarios; 2) we inv estigate the eﬀects of v arious de- sign choices for the cac he memory , such as the retriev al metho d, cache size, dimensionalit y of the k eys and the feature type used for extracting the keys; 3) w e sho w that cac hing, by itself, does not improv e the robustness of classiﬁers against natural p erturbations; 4) using more global, shap e-based features ( Geirhos et al. , 2019 ) in the cache does not only improv e robustness against natural p erturbations, but also pro vides extra robust- ness against adversarial perturbations as well. 2 3. Metho ds 3.1. Models Throughout the paper, w e use pre-trained ResNet-50 mo dels either on their own or as feature extractors (or “bac kb ones”) to build cac he mo dels that incorporate an explicit episo dic memory storing lo w-dimensional em b eddings (or keys) for all images seen during training ( Orhan , 2018 ). The cache models in this pap er are es- sen tially identical to the “CacheOnly” models described in Orhan ( 2018 ). A sc hematic diagram of a cache model is shown in Figure 1 . W e used one of the higher la y ers of a pre-trained ResNet-50 mo del as an embedding lay er. Let φ ( x ) de- note the d -dimensional embedding of an image x in to this la yer. The cache is then a key-v alue dictionary con- sisting of the keys µ k ≡ φ ( x k ) for each training image x k in the dataset and the v alues are the corresp onding class lab els represen ted as one-hot vectors v k ( Orhan , 1 Two recent pap ers ( Elsay ed et al. , 2018 ; Zhou & Firestone , 2019 ) suggested that humans might b e vulnerable, or at least sensitive, to adversarial p erturbations too. Ho wev er, these results apply only in very limited experimental settings (e.g. very short viewing times in Elsay ed et al. ( 2018 )) and require relatively large and transferable p erturbations, which often tend to yield meaningful features resembling the target class. 2 Code for reproducing the results is av ailable at: https:// github.com/eminorhan/robust- vision ... d K keys values cache C ... x d ResNet-50 backbone Figure 1. Schematic illustration of the cache model (adapted from Orhan ( 2018 )). The key for a new image x is compared with the keys in the cache. A prediction is made by a linear com bination of the v alues weigh ted by the similarity to the corresp onding keys. 2018 ). W e normalized all keys to hav e unit l 2 -norm. When a new test image x is presen ted, the similar- it y b etw een its key and all other keys in the cache is computed through ( Orhan , 2018 ): σ k ( x ) ∝ exp( θ φ ( x ) > µ k ) (1) A probabilit y distribution ov er the lab els is then ob- tained by taking an av erage of the v alues stored in the cac he weigh ted by the corresp onding similarity scores ( Orhan , 2018 ): p cache ( y | x ) = P K k =1 υ k σ k ( x ) P K k =1 σ k ( x ) (2) where K denotes the num b er of items stored in the cac he. The hyper-parameter θ in Equation 1 acts as an inv erse temp erature parameter for this distribution, with larger θ v alues pro ducing sharp er distributions. W e optimized θ only in one of the exp erimental conditions b elo w (the gray-box adversarial setting) by searching o ver 9 uniformly spaced v alues b etw een 10 and 90 and ﬁxed its v alue for all other conditions. Because we take all items in the cache into account in Equation 2 , weigh ted by their similarity to the test item, we call this type of cache a c ontinuous c ache ( Gra ve et al. , 2016 ). An alternative (and more scalable) approac h would be to perform a nearest neigh b or search in the cac he and consider only the most similar items in making predictions ( Grav e et al. , 2017 ; Dub ey et al. , 2019 ). W e compare the relative p erformance of these t wo approaches b elo w. F or the em b edding lay er, w e considered three c hoices (in descending order and using the lay er names from the torchvision.models implemen tation of ResNet- 50): fc , avgpool , and layer4 bottleneck1 relu . fc corresp onds to the ﬁnal softmax la yer (w e used the p ost-nonlinearit y probabilities, not the logits), avgpool corresp onds to the global av erage p o oling lay er righ t b efore the ﬁnal lay er and layer4 bottleneck1 relu is the output of the p enultimate b ottlenec k blo ck of the net work. W e also explored the use of low er lay ers as em b eddings; how ever, these lay ers led to substantially w orse clean and adversarial accuracies, hence they were not considered further. layer4 bottleneck1 relu is a 7 × 7 × 2048-dimensional spatial lay er; we applied a global spatial av erage p o oling op eration to this lay er to reduce its dimensionalit y . This gav e rise to d = 1000 dimensional keys for fc and d = 2048 dimensional k eys for the other tw o lay ers. T o in vestigate the eﬀect of diﬀerent feature t yp es on the robustness of the models, we also considered a ResNet-50 mo del join tly trained on ImageNet and St ylized-ImageNet datasets and then ﬁne-tuned on Im- ageNet ( Geirhos et al. , 2019 ) (we used the pre-trained mo del provided by the authors). F ollowing Geirhos et al. ( 2019 ), we call this mo del Shap e-ResNet-50. Geirhos et al. ( 2019 ) argue that Shap e-ResNet-50 learns more global, shap e-based represen tations than a standard ImageNet-trained ResNet-50 (which instead relies more hea vily on lo cal texture) and pro duces predictions more in line with human judgments in texture vs. shap e cue conﬂict exp erimen ts. All exp eriments w ere conducted on the ImageNet dataset containing approximately 1 . 28M training im- ages from 1000 classes and 50K v alidation images ( Rus- sak ovsky et al. , 2015 ). W e note that using the full cac he (i.e. a contin uous cache) w as computationally feasible in our exp eriments at the ImageNet scale. The largest cac he we used (of size 1 . 28M × 2048) tak es up ∼ 10 . 5 GB of disk space when stored as a single-precision ﬂoating-p oin t array . 3.2. Perturbations Ideally , we wan t our image recognition mo dels to b e robust against both adversarial p erturbations and more natural p erturbations. This subsection describ es the details of the natural and adversarial p erturbations considered in this pap er. 3.2.1 Adv ersarial p erturbations Our exp eriments on adversarial p erturbations closely follo wed the exp erimen tal settings describ ed in Dub ey et al. ( 2019 ). In particular, w e considered three diﬀerent threat mo dels: white-box attacks, gray-box attacks, and blac k-b ox attacks. White-b o x attac ks: This is the strongest attac k scenario. In this scenario, the attack er has full knowl- edge of the bac kb one mo del and the items stored in the cac he. Gra y-b ox attacks: In this scenario, the attac ker has full kno wledge of the bac kb one mo del, but do es not ha ve access to the items stored in the cache. In man y cases, this threat scenario is more realistic than the white-b o x or black-box settings, since the mo dels used as feature extractors are usually publicly av ailable (e.g. pre-trained ImageNet mo dels), but the database of items stored using those features is priv ate. In practice, for the cac he mo dels, we implemented the gra y-b ox scenario by ﬁrst running white-b ox attacks against the bac kb one mo del and then testing the resulting adver- sarial examples on the cac he mo del. Blac k-b ox attacks: This is the weak est attack sce- nario where the attack er do es not kno w the backbone mo del or the items stored in the cache. F or the cache mo dels, we implemented the blac k-b ox scenario b y run- ning white-b o x attacks against a mo del diﬀeren t from the mo del used as the bac kb one and testing the re- sulting adversarial examples on the cac he mo del as w ell as on the backbone itself. In practice, we used an ImageNet-trained ResNet-18 mo del to generate adver- sarial examples in this setting (note that we alwa ys use a ResNet-50 backbone in our mo dels). W e chose a strong, state-of-the-art, gradient- based attack metho d called pro jected gradien t descen t (PGD) with random starts ( Madry et al. , 2017 ) to generate adv ersarial examples in all three settings. W e used the F o olb o x im- plemen tation of this attack ( Rauber et al. , 2017 ), RandomStartProjectedGradientDescentAttack , with the follo wing attac k parameters: binary search = False , stepsize = 2/225 , iterations = 10 , random start = True . W e also controlled the total size of the adversarial p erturbation as measured by the l ∞ -norm of the p erturbation normalized b y the l ∞ -norm of the clean image x :  ≡ || x adv − x || ∞ / || x || ∞ . W e considered six diﬀerent  v alues: 0 . 01, 0 . 02, 0 . 04, 0 . 06, 0 . 08, 0 . 1. In general, the attacks are exp ected to b e more successful for larger  v alues. As recommended by Athaly e et al. ( 2018 ), we used targeted attacks, where for each v alidation image we ﬁrst chose a target class lab el diﬀerent from the correct class lab el for the image and then ran the attack to return an image that w as misclassiﬁed as belonging to the target class. In cases where the attac k was not successful, the original clean image was returned, therefore the mo del had the same baseline accuracy on suc h failure cases as on clean images. W e ran attacks starting from all v alidation images, hence the rep orted accuracies are av erages ov er all v alidation images. 3.2.2 Natural perturbations T o measure the robustness of image recognition mo d- els against natural p erturbations, we used the recently in tro duced ImageNet-C b enc hmark ( Hendrycks & Diet- teric h , 2019 ). ImageNet-C contains 15 diﬀeren t natural p erturbations applied to each image in the ImageNet v alidation set at 5 diﬀerent severit y levels, for a total of 15 × 5 × 50K = 3 . 75M images. The p erturbations in ImageNet-C come in four diﬀerent categories: 1) noise p erturbations (Gaussian, shot, and impulse noise), 2) blur p erturbations (defo cus, glass, motion, and zo om blur), 3) we ather perturbations (snow, frost, fog, and brigh tness), and 4) digital perturbations (contrast, elas- ticit y , pixelation, and JPEG compression). W e refer the reader to Hendrycks & Dietterich ( 2019 ) for further details ab out the dataset. T o measure the robustness of a model against the p erturbations in ImageNet-C, we use the mC E (mean corruption error) measure ( Hendrycks & Dietterich , 2019 ). A model’s mC E is calculated as follows. F or eac h p erturbation c , w e ﬁrst av erage the mo del’s clas- siﬁcation error ov er the 5 diﬀerent sev erity levels s and divide the result by the av erage error of a ref- erence classiﬁer (whic h is tak en to be the AlexNet): C E c ≡ h E s,c i s / h E AlexNet s,c i s . The ov erall performance on ImageNet-C is then measured b y the mean C E c a veraged ov er the 15 diﬀeren t p erturbation types c : mC E ≡ h C E c i c . Dividing by the p erformance of a ref- erence mo del in calculating C E c ensures that diﬀerent p erturbations hav e roughly similar sized con tributions to the ov erall measure mCE . Note that smaller mCE v alues indicate more robust classiﬁers. 4. Results 4.1. Caching impro ves robustness against adv ersar - ial perturbations Figure 2 shows the adversarial accuracy in the gray- b o x, black-box, and white-b ox settings for cac he mo d- els using diﬀerent la yers as embeddings. In the gra y- b o x setting, lo wer lay ers sho wed more robustness at the exp ense of a reduction in clean accuracy , with the layer4 bottleneck1 relu la yer achieving the highest gra y-b ox accuracies. In the blac k-b ox setting, we found that even large p erturbation adv ersarial examples for the ResNet-18 mo del were not eﬀectiv e adv ersarial examples for the bac kb one ResNet-50 mo del (dashed line) or for the 0 0.02 0.04 0.06 0.08 0.1 P e r t u r b a t i o n s i z e ( n o r m a l i z e d l ∞ - n o r m ) 0 0.25 0.5 0.75 1 Top-1 accuracy a Gray-box ResNet-50 Cache (fc) Cache (avgpool) Cache (layer4_bottleneck1_relu) b Black-box c White-box Figure 2. T op-1 accuracy of the ResNet-50 backbone and cac he mo dels in the ( a ) gra y-b ox, ( b ) blac k-b ox and ( c ) white-b o x adversarial settings. The 0 perturbation size corresp onds to the clean images. Note that the gray-box setting is meaningful for the cache mo dels only and is not w ell-deﬁned for the backbone ResNet-50 mo del. 0 0.02 0.04 0.06 0.08 0.1 P e r t u r b a t i o n s i z e ( n o r m a l i z e d l ∞ - n o r m ) 0 0.25 0.5 0.75 1 Top-1 accuracy a Gray-box Shape-ResNet-50 Cache (fc) Cache (avgpool) Cache (layer4_bottleneck1_relu) b Black-box c White-box Figure 3. Similar to Figure 2 , but with Shap e-ResNet-50 as the backbone. cac he models, hence the mo dels largely maintained their p erformance on clean images with a slight general decrease in accuracy for larger p erturbation sizes. In the white-b o x setting, we observed a divergence in b ehavior b et ween fc and the other lay ers. The PGD attack was generally unsuccessful against the fc la yer cache mo del, whereas for the other la yers it was highly successful even for small p erturbation sizes. The softmax non-linearity in fc w as crucial for this eﬀect, as it was substan tially easier to run successful white- b o x attacks when the logits were used as keys instead. W e thus attribute this eﬀect to gradient obfuscation in the fc la yer cache mo del ( Athaly e et al. , 2018 ), rather than consider it as a real sign of adv ersarial robustness. Indeed, the gray-box adversarial examples (generated from the backbone ResNet-50 mo del) w ere v ery eﬀectiv e against the fc lay er cache mo del (Figure 2 a). Qualitativ ely similar results w ere observed when Shap e-ResNet-50 was used as the backbone instead of ResNet-50 (Figure 3 ). T able 1 rep orts the clean and adv ersarial accuracies for a subset of the conditions. 4.2. Cache design choices In this subsection, w e consider the eﬀect of three cac he design choices on the clean and adversarial accu- racy of cache mo dels: the size and dimensionality of the cache and the retriev al metho d. Dub ey et al. ( 2019 ) recently inv estigated the ad- v ersarial robustness of cac he models with very large databases (databases of up to K = 50B items). Scaling up the cache mo del to v ery large databases requires making the cache memory as compact as p ossible and using a fast approximate nearest neighbor algorithm for retriev al from the cache (instead of using a contin uous cac he). There are at least t wo diﬀerent wa ys of making the cac he more compact: one can either reduce the n umber of items in the cac he by clustering them, or alternativ ely one can reduce the dimensionality of the k eys. Dub ey et al. ( 2019 ) made the k eys more compact by reducing the original 2048-dimensional em b eddings to 256 dimensions (an 8-fold compression) with online PCA and used a fast 50-nearest neighbor (50-nn) metho d for retriev al. In our exp erimen ts, replacing the con tinuous cache with a 50-nn retriev al metho d did not hav e an adverse eﬀect on adversarial and clean accuracies (Figure 4 and T able 2 ). This suggests that the contin uous cache can b e safely replaced with an eﬃcien t nearest neigh b or algorithm to scale up the cache size without muc h eﬀect on the mo del accuracy . On the other hand, reducing the dimensionality of the keys from 2048 to 256 using online PCA ov er the training data resulted in a substan tial drop in both clean and adv ersarial accuracies (Figure 4 and T able 2 ). Even a 4-fold reduction to 512 dimensions resulted in a large drop in accuracy . This implies that the higher lay ers of the bac kb one used for cac hing are not v ery compressible and drastic dimensionality reduction measures should b e a voided to prev ent a substan tial decrease in accuracy . Reducing the cac he size by the same amount (4-fold or 8-fold compression) by clustering the items in the cac he with a mini-batch k -means algorithm resulted in a signiﬁcan tly smaller decrease in accuracy (Figure 4 and T able 2 ): for example, an 8-fold reduction in dimen- sionalit y led to a clean accuracy of 49 . 8%, whereas an 8-fold reduction in the cache size instead resulted in a clean accuracy of 61 . 2%. This suggests that the cluster structure in the keys is muc h more prominen t than the linear correlation b et ween the dimensions. Therefore, to mak e the cache more compact, giv en a c hoice b etw een reducing the dimensionality vs. reducing the num b er of items b y the same amoun t, it is preferable to c ho ose the second option for b etter accuracy . T able 1. Clean and adversarial accuracies of texture- and shape-based ResNet-50 backbones and cache models. The adv ersarial accuracies rep ort the results for a standard normalized p erturbation size of  = 0 . 06. Mo del Clean Gray-box Blac k-b o x White-box ResNet-50 0.758 – 0.678 0.000 Cac he ( layer4 bottleneck1 relu , texture) 0.641 0.170 0.552 0.001 Shap e-ResNet-50 0.766 – 0.699 0.000 Cac he ( layer4 bottleneck1 relu , shap e) 0.695 0.336 0.632 0.001 T able 2. Clean and gray-box adversarial accuracies of diﬀerent cache mo dels. As in Figure 4 , only the results for the layer4 bottleneck1 relu la yer are shown. Colors highlight the retriev al metho d (contin uous or 50-nn), cache dimensionalit y (full, 4- or 8-times reduced), and cache size (full, 4- or 8-times reduced). Mo del Clean Gray-box (  = 0 . 06) Cac he ( contin uous , full-dims. , full-cac he ) 0.641 0.170 Cac he ( 50-nn , full-dims. , full-cache ) 0.656 0.181 Cac he ( 50-nn , full-dims. , 1 / 4-cache ) 0.626 0.151 Cac he ( 50-nn , full-dims. , 1 / 8-cache ) 0.612 0.143 Cac he ( 50-nn , 1 / 4-dims. , full-cache ) 0.516 0.109 Cac he ( 50-nn , 1 / 8-dims. , full-cache ) 0.498 0.103 0 0.02 0.04 0.06 0.08 0.1 P e r t u r b a t i o n s i z e ( n o r m a l i z e d l ∞ - n o r m ) 0 0.25 0.5 0.75 1 Top-1 accuracy Cache (continuous, full-dims., full-cache) Cache (50-nn, full-dims., full-cache) Cache (50-nn, full-dims., 1/8-cache) Cache (50-nn, 1/8-dims., full-cache) Gray-box Figure 4. The eﬀects of three cache design choices on the clean and adversarial accuracy in the gray-box setting. The results shown here are for the layer4 bottleneck1 relu la yer. Similar results were observed for other lay ers. 4.3. Caching does not impro ve rob ustness against natural perturbations W e hav e seen that cac hing can improv e robustness against gray-box adv ersarial p erturbations. Do es it also impro ve robustness against more natural perturbations? T able 3 shows that the answ er is no. On ImageNet-C, the backbone ResNet-50 mo del yields an mC E of 0.764. The b est cache mo del obtained approximately the same mC E score. W e suggest that this is b ecause caching impro ves robustness only against small-norm pertur- bations, whereas natural p erturbations in ImageNet-C are typically muc h larger. Even the smallest size p er- turbations in ImageNet-C are clearly visible to the eye ( Hendryc ks & Dietterich , 2019 ) and we calculated that ev en these smallest size p erturbations hav e an av erage normalized l ∞ -norm of  ≈ 1 for all p erturbation types, compared to the largest adversarial p erturbation size of  = 0 . 1 considered in this pap er. This result is also consisten t with a similar observ ation made b y Gu et al. ( 2019 ) suggesting that p erturbations o ccurring b etw een neigh b oring frames in natural videos are muc h larger in magnitude than adversarial p erturbations. W e conjec- ture that robustness against such large p erturbations cannot b e ac hieved with test-time only interv entions suc h as caching and requires learning more robust back- b one features in the ﬁrst place. 4.4. Using shape-based features in the cache im- pro ves both adversarial and natural robust- ness T o in vestigate the eﬀect of diﬀerent kinds of fea- tures in the cache, we rep eated our exp erimen ts using cac he mo dels with Shap e-ResNet-50 as the backbone (see Metho ds for further details ab out Shap e-ResNet- 50). It has b een argued that Shap e-ResNet-50 learns more global, shap e-based representations than a stan- dard ImageNet-trained ResNet-50 and it has already b een sho wn to impro ve robustness on the ImageNet-C b enc hmark ( Geirhos et al. , 2019 ). W e conﬁrm this im- pro vemen t (T able 3 ; ResNet-50 vs. Shap e-ResNet-50) and ﬁnd that cac hing with Shape-ResNet-50 leads to roughly the same mC E as the backbone Shap e-ResNet- 50 itself. T able 3. ImageNet-C results. The num b ers indicate corruption errors ( C E ) for sp eciﬁc corruption types and the mean C E scores as p ercentages. More robust mo dels corresp ond to smaller num b ers. F or the cache mo dels, we only show the results for the b est mo dels (the fc cache mo del in b oth cases). Colors represent noise, blur , weather and digital p erturbations. Mo del mCE Gauss Shot Impul. Defo c. Glass Motion Zoom Snow F rost F og Brigh t Contr. Elastic Pixel JPEG ResNet-50 76.4 78 80 80 75 89 78 80 78 75 67 57 72 86 77 76 Cac he (texture) 76.4 78 79 80 75 89 78 80 78 75 67 57 72 86 77 76 Shap e-ResNet-50 73.5 74 75 75 72 86 74 80 75 73 67 55 68 81 75 72 Cac he (shap e) 73.5 74 75 75 72 86 74 80 75 73 67 55 68 81 75 72 0 0.02 0.04 0.06 0.08 0.1 P e r t u r b a t i o n s i z e ( n o r m a l i z e d l ∞ - n o r m ) 0 0.25 0.5 0.75 1 Top-1 accuracy Shape-ResNet-50 Cache (shape) Cache (texture) a Gray-box b Black-box c White-box Figure 5. The eﬀect of using Shap e-ResNet-50 (shap e) vs. ResNet-50 (texture) derived features in the cache on clean and adv ersarial accuracies in the ( a ) gra y-b ox, ( b ) blac k-b ox and ( c ) white-b o x settings. The results shown here are for the layer4 bottleneck1 relu lay er. Remark ably , how ev er, when used in conjunction with cac hing, these Shap e-ResNet-50 features also substan- tially improv ed the adversarial robustness of the cache mo dels in the gray-box and black-box settings, com- pared to the ImageNet-trained ResNet-50 features. Fig- ure 5 illustrates this for the layer4 bottleneck1 relu cac he mo del. This eﬀect was more prominent for earlier la yers. Shap e-based features, how ev er, did not improv e the adv ersarial robustness in the white-b ox setting, neither for the backbone mo del nor for the cache mo dels (T a- ble 1 ). This suggests that eliminating the heavy texture bias of DNNs does not necessarily eliminate the exis- tence of adversarial examples for these mo dels. The opp osite, how ever, seems to b e true: that is, adversari- ally robust models do not displa y a texture bias; instead they seem to be muc h more shap e-biased, similar to h umans ( Zhang & Zhu , 2019 ). 5. Discussion In this paper, w e ha ve sho wn that a combination of tw o basic ideas motiv ated by the cognitive psychol- ogy of human vision, an explicit cache-based episo dic memory and a shap e bias, impro ves the robustness of image recognition mo dels against b oth natural and ad- v ersarial p erturbations at the ImageNet scale. Caching alone improv es (gra y-b ox) adversarial robustness only , T able 4. Summary of our main results. This table is a dis- tilled v ersion of T ables 1 and 3 . In eac h cell, the ﬁrst n umber represen ts the adversarial accuracy (gray-box accuracy for cac he mo dels and white-b ox accuracy for cacheless mo dels, b oth with  = 0 . 06); the second num b er represents the mCE score. Note that b etter mo dels hav e higher accuracy and lo wer mCE score. Starting from a baseline mo del with no cac he and no shape bias (b ottom righ t), adding a cac he memory (b ottom left) only improv es adversarial accuracy; adding a shap e bias (top right) only improv es natural ro- bustness; adding b oth (top left) improv es b oth natural and adv ersarial robustness with a synergistic impro vemen t in the latter. Cac he + Cac he - Shap e bias + 33.6% 73.5 0.0% 73.5 Shap e bias - 17.0% 76.4 0.0% 76.4 whereas a shap e bias improv es natural robustness only . In combination, they improv e b oth, with a synergistic eﬀect in adversarial robustness (T able 4 ). Wh y do es caching improv e adversarial robustness? Orhan ( 2018 ) suggested that caching acts as a regular- izer. More speciﬁcally it was shown in Orhan ( 2018 ) that caching signiﬁcantly reduces the Jacobian norm at test p oin ts, which could explain its impro ved robustness against small-norm p erturbations such as adversarial attac ks. How ever, since Jacobian norm only measures lo cal sensitivity , this do es not guarantee improv ed ro- bustness against larger p erturbations, suc h as the nat- ural p erturbations in the ImageNet-C b enchmark and indeed we hav e shown that cac hing, b y itself, do es not pro vide any impro vemen t against such p erturbations. It should also b e emphasized that caching improv es adv ersarial robustness only under certain threat mo dels. W e hav e provided evidence for impro ved robustness in the gray-box setting only , Zhao & Cho ( 2018 ) and Dub ey et al. ( 2019 ) also pro vide evidence for impro ved robustness in the blac k-b ox setting ( Orhan ( 2018 ) re- p orts evidence for improv ed robustness through caching in the white-b ox setting in CIF AR-10 mo dels, how ever it is likely that such robustness improv ements are muc h easier to achiev e in CIF AR-10 mo dels than in ImageNet mo dels). The results in Dub ey et al. ( 2019 ) are partic- ularly encouraging, since they suggest that the caching approac h can scale up in the gray-box and black-box attac k scenarios in the sense that larger cache sizes lead to more robust mo dels. On the other hand, neither of these tw o earlier works, nor our o wn results point to an y substan tial improv ement in adv ersarial robustness in the white-b o x setting at the ImageNet scale. The white-b o x setting is the most challenging setting for an adv ersarial defense. Theoretical results suggest that in terms of sample complexity , robustness in the white- b o x setting may b e fundamentally more diﬃcult than ac hieving high generalization accuracy in the standard sense ( Schmidt et al. , 2018 ; Gilmer et al. , 2018 ) and it seems unlik ely that it can b e feasibly ac hieved via test-time only interv en tions suc h as cac hing. Wh y do es a shap e bias improv e natural robustness? Natural p erturbations mo deled in ImageNet-C typically corrupt lo cal information, but preserv e global informa- tion such as shap e. Therefore a mo del that can in tegrate information more eﬀectiv ely o ver long distances, for ex- ample b y computing a global shap e representation is exp ected to be more robust to such natural p erturba- tions. In Shap e-ResNet-50 ( Geirhos et al. , 2019 ), this w as achiev ed by removing the local cues to class lab el in the training data. In principle, a similar eﬀect can b e ac hieved through architectural inductive biases as well. F or example, Hendrycks & Dietterich ( 2019 ) show ed that the so-called feature aggregating arc hitectures such as the ResNeXt arc hitecture ( Xie et al. , 2017 ) are sub- stan tially more robust to natural p erturbations than the ResNet arc hitecture, suggesting that they are more eﬀectiv e at integrating lo cal information into global rep- resen tations. How ev er, it remains to b e seen whether suc h feature aggregating architectures accomplish this b y computing a shap e representation. In this work, we hav e also provided important in- sigh ts in to sev eral cache design choices. Scaling up the cac he mo dels to datasets substanti ally larger than Ima- geNet would require making the cac he as compact as p ossible. Our results suggest that other things b eing equal, this should b e done by clustering the keys rather than by reducing their dimensionality . F or very large datasets, the contin uous cac he retriev al metho d that uses the en tire cache in making predictions (Equations 1 and 2 ) can b e safely replaced with an eﬃcient k -nearest neigh b or retriev al algorithm, e.g. F aiss ( Johnson et al. , 2017 ), without incurring a large cost in accuracy . Our results also highlight the imp ortance of the backbone c hoice (for example, Shap e-ResNet-50 vs. ResNet-50): in general, starting from a more robust bac kb one should mak e the cac he more eﬀectiv e against b oth natural and adv ersarial p erturbations. In future work, we are interested in applications of naturally and adversarially robust features in few-shot recognition tasks and in mo deling neural and b ehavioral data from humans and monkeys ( Schrimpf et al. , 2018 ). References Anish Athaly e, Nic holas Carlini, and Da vid W agner. Ob- fuscated gradients give a false sense of security: Circum- v enting defenses to adversarial examples. arXiv pr eprint arXiv:1802.00420 , 2018. 4 , 5 Timoth y F Brady , T alia Konkle, George A Alv arez, and Aude Oliv a. Visual long-term memory has a massive storage capacit y for ob ject details. Pr o c e e dings of the National A c ademy of Scienc es , 105(38):14325–14329, 2008. 1 Wieland Brendel and Matthias Bethge. Approximating cnns with bag-of-lo cal-features mo dels works surprisingly well on imagenet. arXiv pr eprint arXiv:1904.00760 , 2019. 1 Abhiman yu Dub ey , Laurens v an der Maaten, Zeki Y alniz, Yixuan Li, and Dhruv Maha jan. Defense against adver- sarial images using web-scale nearest neigb or search. In CVPR , 2019. 2 , 3 , 5 , 7 , 8 Gamaleldin Elsa yed, Shrey a Shank ar, Brian Cheung, Nicolas P ap ernot, Alexey Kurakin, Ian Go odfellow, and Jascha Sohl-Dic kstein. Adv ersarial examples that fo ol b oth com- puter vision and time-limited humans. In Advanc es in Neur al Information Pr o cessing Systems , pp. 3910–3920, 2018. 2 Rob ert Geirhos, P atricia Rubisc h, Claudio Michaelis, Matthias Bethge, F elix A. Wichmann, and Wieland Bren- del. Generalisation in humans and deep neural netw orks. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2018. 1 , 2 Rob ert Geirhos, P atricia Rubisc h, Claudio Michaelis, Matthias Bethge, F elix A. Wichmann, and Wieland Bren- del. Imagenet-trained cnns are biased to ward texture; increasing shap e bias improv es accuracy and robustness. In International Confer enc e on L e arning R epr esentations (ICLR) , 2019. 1 , 2 , 3 , 6 , 8 Justin Gilmer, Luke Metz, F artash F aghri, Samuel S Sc ho enholz, Maithra Ragh u, Martin W attenberg, and Ian Goo dfellow. Adversarial spheres. arXiv pr eprint arXiv:1801.02774 , 2018. 8 Edouard Grav e, Armand Joulin, and Nicolas Usunier. Im- pro ving neural language mo dels with a contin uous cache. arXiv pr eprint arXiv:1612.04426 , 2016. 3 Edouard Grav e, Moustapha M Cisse, and Armand Joulin. Un b ounded cac he model for online language modeling with op en vocabulary . In A dvanc es in Neur al Information Pr oc essing Systems , pp. 6042–6052, 2017. 3 Keren Gu, Brandon Y ang, Jiquan Ngiam, Quoc Le, and Jonathan Shlens. Using videos to ev aluate image mo del robustness. arXiv preprint , 2019. 6 Dan Hendryc ks and Thomas Dietterich. Benchmarking neural netw ork robustness to common corruptions and p erturbations. In International Confer enc e on L e arning R epresentations (ICLR) , 2019. 2 , 4 , 6 , 8 Jeﬀ Johnson, Matthijs Douze, and Herv´ e J´ egou. Billion- scale similarity searc h with gpus. arXiv pr eprint arXiv:1702.08734 , 2017. 8 Aleksander Madry , Aleksandar Makelo v, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. T ow ards deep learn- ing mo dels resistan t to adversarial attac ks. arXiv pr eprint arXiv:1706.06083 , 2017. 4 Emin Orhan. A simple cache mo del for image recognition. In A dvances in Neural Information Pr o c essing Systems , pp. 10107–10116, 2018. 2 , 3 , 7 Nicolas Papernot and Patric k McDaniel. Deep k-nearest neigh b ors: T ow ards conﬁdent, interpretable and robust deep learning. arXiv pr eprint arXiv:1803.04765 , 2018. 2 Rishi Ra jalingham, Elias B Issa, Pouy a Bashiv an, Kohitij Kar, Kailyn Schmidt, and James J DiCarlo. Large-scale, high-resolution comparison of the core visual ob ject recog- nition b ehavior of humans, monkeys, and state-of-the-art deep artiﬁcial neural netw orks. Journal of Neuroscienc e , 38(33):7255–7269, 2018. 1 Jonas Raub er, Wieland Brendel, and Matthias Bethge. F o olb ox: A python to olbox to benchmark the ro- bustness of machine learning models. arXiv pr eprint arXiv:1707.04131 , 2017. URL 1707.04131 . 4 Olga Russako vsky , Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpath y , Adity a Khosla, Michael Bernstein, Alexan- der C. Berg, and Li F ei-F ei. ImageNet Large Scale Vi- sual Recognition Challenge. International Journal of Computer Vision (IJCV) , 115(3):211–252, 2015. doi: 10.1007/s11263- 015- 0816- y. 3 Ludwig Schmidt, Shibani Santurk ar, Dimitris Tsipras, Kunal T alwar, and Aleksander Madry . Adversarially robust generalization requires more data. In A dvanc es in Neur al Information Pr o c essing Systems , pp. 5014–5026, 2018. 8 Martin Schrimpf, Jonas Kubilius, Ha Hong, Na jib J Ma ja j, Rishi Ra jalingham, Elias B Issa, Kohitij Kar, Pouy a Bashiv an, Jonathan Prescott-Roy , Kailyn Schmidt, et al. Brain-score: which artiﬁcial neural netw ork for ob ject recognition is most brain-lik e? BioRxiv , pp. 407007, 2018. 1 , 8 Lionel Standing. Learning 10000 pictures. The Quarterly Journal of Exp erimental Psycholo gy , 25(2):207–222, 1973. 1 Christian Szegedy , W o jciech Zaremba, Ilya Sutskev er, Joan Bruna, Dumitru Erhan, Ian Go o dfello w, and Rob F ergus. In triguing properties of neural netw orks. 2013. URL https://arxiv.org/abs/1312.6199 . 1 Saining Xie, Ross Girshick, Piotr Doll´ ar, Zhuo wen T u, and Kaiming He. Aggregated residual transformations for deep neural net works. In Pr o ce e dings of the IEEE c onfer enc e on c omputer vision and pattern r e c ognition , pp. 1492–1500, 2017. 8 Tian yuan Zhang and Zhanxing Zhu. Interpreting adv ersar- ially trained conv olutional neural netw orks. In Interna- tional Conferenc e on Machine L e arning , pp. 7502–7511, 2019. 7 Jak e Zhao and Kyungh yun Cho. Retriev al-augmented con v o- lutional neural net works for improv ed robustness against adv ersarial examples. arXiv pr eprint arXiv:1802.09502 , 2018. 2 , 7 Zhenglong Zhou and Chaz Firestone. Humans can decipher adv ersarial images. Natur e c ommunic ations , 10(1):1334, 2019. 2

Improving the robustness of ImageNet classifiers using elements of human visual cognition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment