Recurrent Connections Aid Occluded Object Recognition by Discounting Occluders

Recurren t Connections Aid Occluded Ob ject Recognition b y Discoun ting Occluders ? Markus Roland Ernst 1 , 2 (  ) , Jo c hen T riesc h 1 , 2 , and Thomas Burwic k 1 , 2 1 F rankfurt Institute for Adv anced Studies, Ruth-Moufang-Straße 1, 60438 F rankfurt am Main, Germany 2 Go ethe-Univ ersit¨ at F rankfurt, Max-v on-Laue-Straße 1, 60438 F rankfurt am Main, German y { mernst,triesch,burwick } @fias.uni-frankfurt.de Abstract. Recurren t connections in the visual cortex are thought to aid ob ject recognition when part of the stimulus is o ccluded. Here we in vestigate if and how recurrent connections in artiﬁcial neural netw orks similarly aid ob ject recognition. W e systematically test and compare ar- c hitectures comprised of b ottom-up (B), lateral (L) and top-down (T) connections. Performance is ev aluated on a no vel stereoscopic o ccluded ob ject recognition dataset. The task consists of recognizing one target digit o ccluded by m ultiple o ccluder digits in a pseudo-3D environmen t. W e ﬁnd that recurren t models p erform signiﬁcantly better than their feedforw ard coun terparts, which were matched in parametric complex- it y . F urthermore, we analyze how the netw ork’s represen tation of the stim uli ev olves o ver time due to recurren t connections. W e show that the recurrent connections tend to mov e the netw ork’s representation of an o ccluded digit tow ards its un-o ccluded v ersion. Our results suggest that both the brain and artiﬁcial neural netw orks can exploit recurren t connectivit y to aid o ccluded ob ject recognition. Keyw ords: Ob ject recognition · Occlusion · Recurren t neural net w orks. 1 In tro duction Giv en the rapidness of inv arian t ob ject recognition in primates [25,9], the pro cess is assumed to b e mostly feedforward [5]. This assumption has b een corrob orated b y the recen t success of feedforw ard neural net works in computer vision [12,13] and led to modelling of the primate visual system using suc h netw orks [19,21]. Ho wev er, b oth anatomical and functional evidence suggest that recurren t con- nections do indeed inﬂuence ob ject recognition. Densities of feedforward and recurren t connections in the ven tral visual pathw a y are comparable in mag- nitude [7,23] and electrophysiological exp erimen ts hav e demonstrated that the ? This work was supported by the Europ ean Unions Horizon 2020 researc h and in- no v ation programme under grant agreement N o 713010 (GOAL-Robots, Goal-based Op en-ended Autonomous Learning Rob ots). 2 M. R. Ernst et al. pro cessing of ob ject information unfolds ov er time, b ey ond what would nor- mally be attributed to a feedforward pro cess [4,3]. In particular, recognition of degraded or o ccluded ob jects pro duces delay ed behavioral and neural resp onses [10,24] b eliev ed to b e caused b y comp etitiv e pro cessing due to lateral recurrent connections [1]. Other evidence suggests that recurren t top-down connections can ﬁll in miss- ing information in partially o ccluded images [18]. Additionally , recurrent conv o- lutional neural net works hav e been sho wn to improv e classiﬁcation p erformance on o ccluded stimuli [22,15]. How ever, the stimuli used in previous researc h did hardly resem ble an authentic natural en vironmen t. The world humans live in and in teract with is inheren tly 3D and o cclusion consists of more than just masking one stimulus with another. Rather, it is highly dep enden t on viewing angle, and primates p erceiv e it stereoscopically with tw o ey es. Unlike previous sim ulations and experimental w ork, where part of the input image w as deleted or masked in tw o dimensions [18,26,24,22], we set out to test the eﬀects of o cclusion in a more natural en vironmen t. Th us, we extended the generativ e model for occluded stim uli presented in [22] to accoun t for 3D p erspective and stereo vision. W e test and compare a range of diﬀeren t recurren t conv olutional neural net- w ork architectures, assuming the naming scheme of [22,15]. Bottom-up (B) and top-do wn (T) connections correspond to pro cessing information from lo wer and higher regions in the v en tral visual hierarc h y , and lateral (L) connections process information within a region. T o inv estigate whether recurren t netw orks outp erform feedforw ard mo dels in a more naturalistic setting, the diﬀerent architectures were tasked with classify- ing ob jects under v arying levels of o cclusion. The accuracy or error rate reﬂects the degree to which the net works learn to recognize the target, and how well they cop e with o cclusion. Additionally w e train and test all netw orks on stereo- scopic image data to quan tify the beneﬁt of bino cular vision. Finally we explore ho w recurren t connections shap e the probability distribution ov er possible out- comes and we analyze the internal represen tation of the o ccluded stimuli. W e conduct a geometrical analysis of activ ations in the ﬁnal hidden la yer and vi- sualize the evolution of the internal represen tation in time using t-distributed sto c hastic neigh b or em b edding (t-SNE) [16]. Our results demonstrate signiﬁcant p erformance adv an tages of recurrent netw orks and reveal how recurrence helps to discoun t the eﬀect of o ccluders. 2 Metho ds 2.1 Stereo-Digits Dataset W e inv estigate the eﬀects of o cclusion using a nov el stereoscopic image data set. Inspired by the generative image mo del for o cclusion stimuli in [22], we fo cus on digit recognition. Our stereo-digits dataset is meant to bridge the gap b et w een the somewhat artiﬁcial task of recognizing computer rendered digits and the natural task of recognizing partially o ccluded ob jects. Contrary to past Recurren t Connections Aid Occluded Ob ject Recognition 3 studies, o cclusion is generated by o verlapping the target stimulus with other digit instances in a pseudo-3D en vironment. All images of the stereo-digits dataset con tain digits of the same font and color. Occlusion is generated by ov erla ying digits on top of each other as shown in Fig. 1 A. The target ob ject, i.e. the hindmost digit, is centered in the middle of the square canv as. Additional digits are then sequentially placed on top of the target ob ject. These occluding ob jects remain ﬁxed along the y-axis as if standing on a surface 5 cm b elo w the view er. The x-co ordinate is drawn from a uniform distribution. The fon t size of the digits w as scaled to giv e the impression of ob jects with 20 cm height placed at diﬀerent depths. W e assumed a distance of 50 cm from the target ob ject to the viewer, and 10 cm less for every added ob ject. The lev el of o cclusion can b e controlled b y v arying the n umber of o ccluder ob jects, whic h increases the diﬃculty of the task. Images for the left and righ t ey e w ere taken giv en an interocular distance of 6.8 cm. Each dataset (2, 3, 4 o ccluders) consists of 100,000 randomly generated images for training and 10,000 images for testing. The images were rendered at a resolution of 512 × 512 and then do wnsampled to 32 × 32. 3 occ. 4 occ.       2 occ. Amount of occlusion B BT BL BL T Fig. 1. The used stimuli and netw ork mo dels. (A) The centered target ob ject is o c- cluded b y 2–4 digits arranged in a 3D-fashion. (B) A sketc h of the four netw ork archi- tectures named after their connection prop erties. B stands for bottom-up, L for lateral and T for top-down connections. 2.2 Net work Mo dels F our basic netw ork mo dels w ere compared as shown in Fig. 1 B: Bottom-up connection only ( B ), b ottom-up and top-down connections ( BT ), b ottom-up and lateral connections ( BL ), and b ottom-up, lateral, and top-down connections ( BL T ). As lateral and top-down connections introduce cycles in to the compu- tational graph, these models represent recurrent neural net works and allow for information to b e retained within a lay er or to ﬂow back in to earlier lay ers. 4 M. R. Ernst et al. Eac h mo del consists of an input la yer, t wo hidden recurren t lay ers and an output lay er. Both b ottom-up and lateral connections are implemented as con- v olutional la yers [14] with a stride of 1 × 1. After conv olution the activ ations go through a 2 × 2 maxp ooling la yer with a stride of 2 × 2. The top-down con- nections are implemen ted as a transp osed con volution [27] with output stride 2 × 2 to match the input size of the con volutional la yer that came before it. Each of the recurrent netw ork mo dels is unrolled and trained for four time steps b y bac kpropagation [20]. When measuring accuracy , the output at the ﬁnal unrolled time step av ailable for the particular architecture is used. T o comp ensate for the fact that recurrent net work mo dels hav e more learnable parameters than their non-recurren t coun terparts, w e in tro duce t wo additional feedforw ard models B-F and B-K . B-F is a feedforw ard mo del where the num ber of conv olutional ﬁlters or k ernels in the hidden la yers is increased from 32 to 64. B-K has an increased con volutional k ernel size of 5 × 5 compared to 3 × 3 of the standard B mo del. As a larger kernel eﬀectively increases the n um b er of connections that eac h unit has, B-K is a more appropriate mo del for con trol. B-F on the other hand alters the represen tational p o w er of the mo del by adding more feature maps. The n umber of learnable parameters for eac h of the mo dels can b e found in T able 1. T able 1. Num b er of learnable parameters for all models and input c hannels. B B-F B-K BT BL BL T Kernel size 3 × 3 3 × 3 5 × 5 3 × 3 3 × 3 3 × 3 Hidden lay er units 32 64 32 32 32 32 Image channels Num b er of learnable parameters 1 9 , 898 38 , 218 26 , 794 19 , 146 28 , 394 37 , 642 2 10 , 186 38 , 794 27 , 594 19 , 434 28 , 682 37 , 930 La yers After the stimulus enters the netw ork, its activ ations pass tw o hidden con volutional lay ers. The inputs to these lay er are denoted by h ( t,l ) i,j . This for- m ulation represents the vectorized input of a patch centered on lo cation ( i, j ) in la yer l computed at time step t across all feature maps indexed b y k . Assuming this notation the input stimulus presen ted to the netw ork b ecomes h ( t, 0) i,j . The activ ation z of a hidden recurren t lay er can then b e written as z ( t,l ) i,j,k =  w ( l ) B k  > h ( t,l − 1) i,j +  w ( l ) L k  > h ( t − 1 ,l ) i,j +  w ( l ) T k  > h ( t − 1 ,l +1) i,j , (1) where w ( l ) · k is the vectorized form of the con volutional kernel at feature map k in la yer l for b ottom-up (B), lateral (L), and top-down (T) connections, resp ec- tiv ely . These kernels b ecome only active for architectures using the particular Recurren t Connections Aid Occluded Ob ject Recognition 5 connection and are otherwise zero. Note that the lateral and top-do wn connec- tions dep end on v alues one time step earlier, so the inputs are deﬁned to b e a v ector of zero es for t = 0 where there would be no previous time step. T op-down connections are only presen t b et ween the tw o hidden lay ers (Fig. 1 B). F ollowing the ﬂo w of information, the z ( t,l ) i,j,k of the hidden lay ers are then batc h-normalized [8]. This tec hnique normalizes an activ ation z using the mean µ B and standard deviation σ B o ver a mini-batc h of activ ations B and adds mul- tiplicativ e and additive noise. BN γ ,β ( z ( t,l ) i,j,k ) = γ ( l ) k · z ( t,l ) i,j,k − µ B σ B + β ( l ) k , (2) where γ and β are additional learnable parameters. The output then is passed to rectiﬁed linear units (ReLU, σ z ) σ z  z ( t,l ) i,j,k  = max  0 , z ( t,l ) i,j,k  (3) and go es through lo cal response normalization (LRN, ω ) ω ( a ( t,l ) i,j,k ) = a ( t,l ) i,j,k   c + α min( n − 1 ,k + n 2 ) X k 0 =max(0 ,k − n 2 )  a ( t,l ) i,j,k 0  2   − β , (4) with n = 5, c = 1, α = 10 − 4 and β = 0 . 5. Similar in justiﬁcation to maxpo oling, LRN implemen ts a form of lateral inhibition by inducing comp etition for large activities amongst outputs computed using diﬀeren t k ernels [12]. Finally the output h ( t,l ) i,j,k for eac h hidden lay er can b e written as: h ( t,l ) i,j,k = ω  σ z  BN γ ,β  z ( t,l ) i,j,k  . (5) After the second hidden lay er the information ﬂows through a fully-connected segmen t with ten output units and softmax activ ation, deﬁned as: softmax( a ) i = exp( a i ) P j exp( a j ) . (6) The resulting netw ork output can be interpreted as the probability distribution o ver the ten classes. Learning The lab els to b e predicted b y the net work are enco ded as one-hot v ectors. T o make the netw orks’ output ˆ y ( τ ) matc h the target y we use the cross- en tropy cost-function summed across all τ time steps and all N output units: J ( ˆ y ( τ ) , y ) = − τ X t =0 N X i =0 y i · log ˆ y ( t ) i + (1 − y i ) · log (1 − ˆ y ( t ) i ) . (7) The Adam algorithm [11] with an initial learning rate of η = 0 . 003 w as used to perform gradient descent. Unless stated otherwise training o ccurred for 25 ep ochs with mini-batches of size 400. 6 M. R. Ernst et al. 2.3 Mo del P erformance Metrics and Ev aluation T echniques The diﬀerent mo dels were ev aluated in terms of classiﬁcation accuracy av eraged across the test set. T est p erformances were compared with each other using pair-wise McNemar’s tests [17] as suggested in [6]. This technique do es not re- quire repeated training and therefore p oses a computationally eﬃcien t metho d to ev aluate a v ariet y of diﬀerent mo dels. As multiple comparisons increase the c hance of false p ositiv es a Bonferroni-type correction pro cedure was emplo yed to con trol the false discov ery rate (FDR) at 0.05 [2]. 3 Results 3.1 P erformance Ev aluation Net works were trained on datasets com bining all three o cclusion levels to ev alu- ate the b eneﬁt of feedbac k connections. T raining lasted t wen ty-ﬁv e ep ochs. Fig. 2 depicts the classiﬁcation error E cl = (1 − accuracy) for the mo dels trained with mono cular (A) and stereoscopic (B) input. Fig. 2. P erformance comparison of diﬀerent netw ork architectures. Error bars indicate the standard error based on ﬁve repetitions of the training and testing pro cedure. Matrices depict results of pairwise McNemar tests, black squares indicating signiﬁcant diﬀerences at p < 0 . 05. (A) Mono cular input. (B) Stereoscopic input. Our results rev eal that recurren t arc hitectures p erform consistently better than feedforw ard net works of appro ximately equal complexit y . Notably , B-K p erforms signiﬁcan tly worse than B-F questioning the b eneﬁts of the increased k ernel size, χ 2 (1 , N = 30 , 000) = 46 . 29 , p < . 01. Signiﬁcan t diﬀerences (FDR = 0.05) can b e attested for all com binations except ( B , B-K ), χ 2 (1 , N = 30 , 000) = 1 . 69 , p = . 13 and ( BL , BL T ), χ 2 (1 , N = 30 , 000) = 0 . 94 , p = . 26. The low er left 3 × 3 square, highlighted b y a white line, indicates that all pair-wise tests b et w een feedforward and recurren t models show a signiﬁcan t adv antage of the re- curren t architectures. The relative diﬀerences in error-rate b et ween feedforward and recurren t mo dels are increased for the stereoscopic case. Recurren t Connections Aid Occluded Ob ject Recognition 7 When trained separately on the three datasets, we observ e almost the same patterns while the error-rates grow with the n umber of o ccluders as exp ected (see T able 2). The BL T model produces the low est error-rates for each data set. T able 2. Error-rates for all mo del architectures, standard error based on ﬁv e inde- p enden t training runs. 2, 3, 4 o cc. runs were trained for 100 ep., batchsize 100. Best p erformance per dataset is highlighted in b old. Channels Occ. B B-F B-K BT BL BL T 1 (mono) 2 . 134 ± . 004 . 123 ± . 003 . 143 ± . 002 . 109 ± . 003 . 103 ± . 003 . 095 ± . 002 3 . 330 ± . 005 . 337 ± . 004 . 359 ± . 004 . 293 ± . 003 . 282 ± . 005 . 280 ± . 003 4 . 512 ± . 005 . 519 ± . 005 . 546 ± . 005 . 477 ± . 003 . 463 ± . 005 . 455 ± . 006 all . 304 ± . 001 . 284 ± . 001 . 306 ± . 002 . 277 ± . 001 . 257 ± . 001 . 253 ± . 000 2 (stereo) 2 . 095 ± . 003 . 078 ± . 003 . 094 ± . 003 . 069 ± . 003 . 059 ± . 002 . 056 ± . 002 3 . 267 ± . 006 . 279 ± . 003 . 287 ± . 004 . 217 ± . 004 . 199 ± . 003 . 189 ± . 003 4 . 455 ± . 006 . 472 ± . 004 . 482 ± . 005 . 395 ± . 003 . 373 ± . 003 . 361 ± . 003 all . 244 ± . 000 . 230 ± . 000 . 245 ± . 001 . 207 ± . 001 . 184 ± . 000 . 180 ± . 001 3.2 Ev olution in Time and Hidden Representation The softmax output of the BL T architecture illustrates how recurren t feedback can revise the netw ork’s belief o ver time. In fact, w e observ e that wrong initial guesses are b eing corrected and correct guesses are reinforced. Sp eciﬁc examples are shown in Fig. 3 A: While the netw ork estimates the target digit to b e 6 at t 0 , the ﬁnal output is the correct answ er 1 (left panel). The mean softmax activ ations for speciﬁc classes, Fig. 3 B, indicate that the probabilities assigned to incorrect classes decrease ov er time. Additionally this visualization reveals systematic visual similarities that the netw ork has discov ered b et w een digits 2 and 7 and digits 4 and 1. T o b etter understand how recurren t connections contribute to the p erfor- mance gains, we consider the activ ation patterns a ( t ) from the last hidden la yer of the netw ork for ev ery time step t . W e visualize the corresponding high- dimensional space using t-SNE [16], see Fig. 4. The black lines represen t the activ ations caused by un-o ccluded stimuli ev olving in time, henceforth called time-tra jectories. The colored scatter-plots illustrate the activ ations corresp ond- ing to o ccluded stimuli of diﬀeren t classes at diﬀerent time steps. In terestingly , the t-SNE visualization rev eals the internal representation at the ﬁrst time step to b e more in termingled where arguably very similar stimuli across classes are placed close to each other. At later time steps, how ever, the represen tation b ecomes w ell-separated and classes hav e seemingly no o verlap. As seen in Fig. 4 A, the representation of the un-o ccluded stim uli mov es to wards a corresp onding class-cluster at t 3 . The ﬁrst time step accounts for most of the distance trav elled, follow ed by ﬁne adjustments at t 2 and t 3 , see detailed view 8 M. R. Ernst et al. 0 1 2 3 4 5 6 7 8 9 1 0 1 8 1 0 1 5 1 0 1 2 1 0 9 1 0 6 1 0 3 1 0 0 Softmax output A 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 time step t 0 t 1 t 2 t 3 0 1 2 3 4 5 6 7 8 9 Class label 1 0 7 1 0 5 1 0 3 1 0 1 Softmax output B 0 1 2 3 4 5 6 7 8 9 Class label 0 1 2 3 4 5 6 7 8 9 Class label Fig. 3. Softmax output of BL T . (A) Sp eciﬁc stim uli (1, 8, 5) illustrating the eﬀect of recurren t feedbac k. (B) Mean softmax output ov er all test stim uli of sp eciﬁc classes (2, 4, 8). Shaded areas corresp ond to standard error. Fig. 4. t-SNE depiction of the net work’s representation of un-o ccluded stim uli (black) ev olving in time. Time tra jectories are sho wn as blac k lines, the colors represent clusters of diﬀerent classes. See text for details. Recurren t Connections Aid Occluded Ob ject Recognition 9 in Fig. 4 B. Class-clusters at t 3 are actually cluster pairs, corresp onding to the o ccluders app earing mainly on the left or on the righ t (Fig. 4 C). Indeed, when w e tested the netw ork with digits that w ere o ccluded from one side only , this structure disapp eared and every digit was represented by just a single cluster at t 3 . Activ ation patterns that do not fall close to a cluster tend to be dominated b y the o ccluders. This is illustrated in Fig. 4 D b y a sample of class 9 (blue), sho wn in high resolution and as seen by the net work. W e h yp othesize that the recurren t connections help to discoun t the occlud- ers by k eeping the internal represen tation of the input close to that of a pure, un-o ccluded target stimulus. T o test this, we compare the distances b et ween ac- tiv ations caused by stereo-digit input, the un-o ccluded target and un-o ccluded o ccluders (Fig. 5). The resulting relativ e distances rev eal that the representation of the input grows closer to the target stimulus relative to the o ccluder stimuli o ver time. This ﬁnding is consistent with the idea that the recurren t connections allo w the netw ork to discount the o ccluders and is also back ed up by the ﬁnd- ing that the sum of recurrent weigh ts (lateral and top-do wn) b ecomes sligh tly negativ e during training (not shown). T arget stimulus Input activation Occluder stimulus 1 Occluder stimulus 2 d o,1 d o,2 d t Fig. 5. Analysis of the internal represen tation of o ccluded stimuli shows discoun ting of o ccluders due to recurrent pro cessing. (A) W e deﬁne a relative distance measure to quan tify if the activ ation of a stimulus is relatively closer to the un-o ccluded target compared to the occluder stimuli. V alues below one indicate relative proximit y to the target. (B) Violin plot displays the relative distances to occluder 1 and 2 at diﬀer- en t time steps for stimuli occluded from the left. Dashed line represents mean of the distribution at t 0 . 4 Discussion W e studied if and how recurrent connections b eneﬁt o ccluded ob ject recognition. Similar to [22], but for a more realistic dataset w e found that recurrent mo d- els signiﬁcantly outp erformed their non-recurrent coun terparts for a near-equal 10 M. R. Ernst et al. amoun t of learnable parameters. Of the feedforw ard ensem ble B-F p erformed b est on the giv en task suggesting that additional con volutional ﬁlters are more helpful than larger ﬁlters. Contrary to the ﬁndings in [22] B-K did not p erform b etter than the standard B mo del. The reported adv an tage of larger receptive ﬁelds might b e link ed to stimuli, where certain irregularities of the o ccluders only b ecome ob vious at larger scales. Of the recurrent net work ensem ble BT p erformed w orst, suggesting that lateral connections are particularly imp ortan t. As information has to pass through a con volution, maxpo oling and bac k through a decon volution instead of b eing directly transferred within the same lay er, in- formation may suﬀer in qualit y . In con trast, the BL mo del takes adv antage of the lateral shortcuts to pass information through time without utilizing more abstract features. The combination of b oth models ( BL T ) p erforms b est in all runs. As previous w ork has sho wn, similar recurrent architectures also outperform parameter matc hed control mo dels when no occlusion is present [15]. This is cor- rob orated by studies that hav e in vestigated how ob ject information is unfolding o ver time in the brain [3]. Therefore, some level of recurren t connectivity in ar- tiﬁcial neural netw orks migh t b e b eneﬁcial ev en for s tandard ob ject recognition tasks. The signiﬁcant p erformance gains for stereoscopic input can b e explained b y the fact that the additional input channel introduces a new p erspective of the scene, thus giving the net work more information about the occluded target. Qualitativ ely , the results of the statistical netw ork comparisons resemble the ones obtained for mono cular stimuli. Interestingly , how ever, the p erformance diﬀerence betw een recurren t and feedforward mo dels was substantially higher for stereoscopic stim uli. The on av erage sligh tly negativ e w eights of the recurren t connections might con tribute to inhibiting or discoun ting o ccluders. With the net works dynamics b eing determined by the ReLU activ ation function a slight bias tow ards inhibitory weigh ts might also b e k ey to keep activ ations cen tered around the non-linearit y . F or recurrent arc hitectures, the probability distribution ov er p ossible out- comes has b een shown to evolv e with time. F or challenging stimuli the recurrent dynamics are able to revise the best guess for the target. In line with [18,26], we h yp othesized that some of the missing information from o ccluded regions of the input image is recov ered ov er time b y the recurrent connections. The evolution of the output and our qualitative t-SNE analysis supp orts this hypothesis: At the ﬁrst time step representations are more intermingled and spread out. Visu- ally very similar stimuli across classes are represen ted close to each other. At later time steps the representation has evolv ed in such a w ay that the classes are w ell-separated. The visualization revealed that the recurren t netw ork distinguishes which side of the target stimulus is o ccluded. This is indicated by the representation of one digit being separated into t wo well-separated clusters. W e conjecture that the netw ork learns the left and righ t half of the un-o ccluded stimulus and therefore is able to ignore distractors on the opp osing sides. The fragmen tation Recurren t Connections Aid Occluded Ob ject Recognition 11 nev ertheless allows for high accuracy classiﬁcation b y the ﬁnal readout lay er, since the subsequent fully connected lay er is able to classify correctly as long as the activ ation patterns remain linearly separable. In vestigating the internal representation we could also show that activ ations of occluded stimuli ov er time are grouped together with the activ ation caused b y the un-o ccluded target stim ulus relative to the un-o ccluded occluder digits. The representation b eing closer to the un-occluded target hints at the recurrent connections pla ying an imp ortan t role in discounting the occluders. In conclusion, recurrent conv olutional neural netw orks hav e b een shown to outp erform feedforw ard netw orks at o ccluded ob ject recognition. Building on previous work where parts of the target ob ject w ere deleted [18] or o ccluded [22] w e could show that the same performance adv an tages exist for a more realistic 3D o cclusion scenario with stereoscopic input. In fact, the adv antages were ev en greater for the more realistic stereoscopic input com p ared to mono cular input. F urthermore, our analysis revealed how recurrent connections revise the net- w ork’s output ov er time, sometimes correcting an incorrect initial output after the ﬁrst feedforward pass through the netw ork. F uture work should in vestigate whether these adv antages generalize to larger and more complex netw ork ar- c hitectures than we considered in this w ork. Giv en the b etter p erformance and greater biological plausibilit y of recurren t netw ork arc hitectures, they deserve more detailed study . References 1. Adesnik, H., Scanziani, M.: Lateral comp etition for cortical space b y la yer-speciﬁc horizon tal circuits. Nature 464 (7292), 1155 (2010). h ttps://doi.org/10.1038/nature08935 2. Benjamini, Y., Ho c hberg, Y.: Controlling the false discov ery rate: A practical and p o w erful approach to m ultiple testing. Journal of the Roy al Statistical Society . Series B (Metho dological) 57 (1), 289–300 (1995). h ttps://doi.org/10.1111/j.2517- 6161.1995.tb02031.x 3. Brincat, S.L., Connor, C.E.: Dynamic shap e syn thesis in p osterior inferotemp oral cortex. Neuron 49 (1), 17–24 (2006). h ttps://doi.org/10.1016/j.neuron.2005.11.026 4. Cic hy , R.M., P antazis, D., Oliv a, A.: Resolving human ob ject recog- nition in space and time. Nature Neuroscience 17 (3), 455 (2014). h ttps://doi.org/10.1038/nn.3635 5. DiCarlo, J.J., Zo ccolan, D., Rust, N.C.: How do es the brain solv e visual ob ject recognition? Neuron 73 (3), 415–434 (2012). h ttps://doi.org/10.1016/j.neuron.2012.01.010 6. Dietteric h, T.G.: Approximate statistical tests for comparing supervised clas- siﬁcation learning algorithms. Neural Computation 10 (7), 1895–1923 (1998). h ttps://doi.org/10.1162/089976698300017197 7. F elleman, D.J., V an Essen, D.C.: Distributed hierarc hical pro cessing in the primate cerebral cortex. Cerebral Cortex 1 (1), 1–47 (1991). h ttps://doi.org/10.1093/cercor/1.1.1 8. Ioﬀe, S., Szegedy , C.: Batc h normalization: Accelerating deep net work training b y reducing in ternal cov ariate shift. In: International Conference on Machine Learning. pp. 448–456 (2015) 12 M. R. Ernst et al. 9. Isik, L., Meyers, E.M., Leib o, J.Z., P oggio, T.: The dynamics of in v ariant ob ject recognition in the human visual system. Journal of Neurophysiology 111 (1), 91– 102 (2014). h ttps://doi.org/10.1152/jn.00394.2013 10. Johnson, J.S., Olshausen, B.A.: The recognition of partially visible natural ob jects in the presence and absence of their o ccluders. Vision Research 45 (25), 3262–3276 (2005). https://doi.org/10.1016/j.visres.2005.06.007 11. Kingma, D.P ., Ba, J.L.: Adam: A metho d for sto c hastic optimization. In: Proc. 3rd Int. Conf. Learn. Represen tations (2014) 12. Krizhevsky , A., Sutsk ever, I., Hinton, G.E.: Imagenet classiﬁcation with deep con- v olutional neural netw orks. In: Adv ances in Neural Information Pro cessing Sys- tems. pp. 1097–1105 (2012) 13. LeCun, Y., Bengio, Y., Hin ton, G.: Deep learning. Nature 521 (7553), 436 (2015). h ttps://doi.org/10.1038/nature14539 14. LeCun, Y., Bottou, L., Bengio, Y., Haﬀner, P .: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), 2278–2324 (1998). h ttps://doi.org/10.1109/5.726791 15. Liang, M., Hu, X.: Recurren t conv olutional neural net work for ob ject recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3367–3375 (6 2015) 16. Maaten, L.v.d., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9 (Nov), 2579–2605 (2008) 17. McNemar, Q.: Note on the sampling error of the diﬀerence betw een cor- related prop ortions or percentages. Psyc hometrik a 12 (2), 153–157 (1947). h ttps://doi.org/10.1007/BF02295996 18. O’Reilly , R.C., Wyatte, D., Herd, S., Mingus, B., Jilk, D.J.: Recurrent pro- cessing during ob ject recognition. F rontiers in Psyc hology 4 , 124 (2013). h ttps://doi.org/10.3389/fpsyg.2013.00124 19. Riesenh ub er, M., Poggio, T.: Hierarchical models of ob ject recognition in cortex. Nature Neuroscience 2 (11), 1019–1025 (1999). h ttps://doi.org/10.1038/14819 20. Rumelhart, D.E., Hin ton, G.E., Williams, R.J.: Learning represen- tations by bac k-propagating errors. Nature 323 (6088), 533 (1986). h ttps://doi.org/10.1038/323533a0 21. Serre, T., Oliv a, A., Poggio, T.: A feedforward architecture accounts for rapid categorization. Pro ceedings of the National Academ y of Sciences 104 (15), 6424– 6429 (2007). h ttps://doi.org/10.1073/pnas.0700622104 22. Sp oerer, C.J., McClure, P ., Kriegeskorte, N.: Recurrent con volutional neural net- w orks: A b etter mo del of biological ob ject recognition. F rontiers in Psychology 8 , 1551 (2017). h ttps://doi.org/10.3389/fpsyg.2017.01551 23. Sp orns, O., Zwi, J.D.: The small world of the cerebral cortex. Neuroinformatics 2 (2), 145–162 (2004). https://doi.org/10.1385/NI:2:2:145 24. T ang, H., Buia, C., Madha v an, R., Crone, N.E., Madsen, J.R., Ander- son, W.S., Kreiman, G.: Spatiotemp oral dynamics underlying ob ject com- pletion in h uman ven tral visual cortex. Neuron 83 (3), 736–748 (2014). h ttps://doi.org/10.1016/j.neuron.2014.06.017 25. Thorp e, S., Fize, D., Marlot, C.: Sp eed of pro cessing in the h uman visual system. Nature 381 (6582), 520–522 (1996). https://doi.org/10.1038/381520a0 26. Wy atte, D., Curran, T., O’Reilly , R.: The limits of feedforward vi- sion: recurren t processing promotes robust ob ject recognition when ob jects are degraded. Journal of Cognitive Neuroscience 24 (11), 2248–2261 (2012). h ttps://doi.org/10.1162/jo cn a 00282 Recurren t Connections Aid Occluded Ob ject Recognition 13 27. Zeiler, M.D., Krishnan, D., T a ylor, G.W., F ergus, R.: Deconv olutional net- w orks. In: 2010 IEEE Computer So ciet y Conference on Computer Vi- sion and Pattern Recognition (CVPR). pp. 2528–2535. IEEE (2010). h ttps://doi.org/10.1109/CVPR.2010.5539957

Recurrent Connections Aid Occluded Object Recognition by Discounting Occluders

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment