High-Performance Neural Networks for Visual Object Classification

High-P erformance Neural Net w orks for Visual Ob ject Classiﬁcation Dan C. Cire ¸ san, Ueli Meier, Jonathan Masci, Luca M. Gam bardella and J ¨ urgen Sc hmidhuber T ec hnical Rep ort No. IDSIA-01-11 Jan uary 2011 IDSIA / USI-SUPSI Dalle Molle Institute for Artiﬁcial Intelligence Galleria 2, 6928 Manno, Switzerland IDSIA is a joint institute of both Universit y of Lugano (USI) and Universit y of Applied Sciences of Southern Switzerland (SUPSI), and was founded in 1988 b y the Dalle Molle F oundation whic h promoted quality of life. This work w as partially supported b y the Swiss Commission for T ec hnology and Innov ation (CTI), Pro ject n. 9688.1 IFF: Intelligen t Fill in F orm. T ec hnical Rep ort No. IDSIA-01-11 1 High-P erformance Neural Net w orks for Visual Ob ject Classiﬁcation Dan C. Cire ¸ san, Ueli Meier, Jonathan Masci, Luca M. Gam bardella and J ¨ urgen Sc hmidhuber Jan uary 2011 Abstract W e presen t a fast, fully parameterizable GPU implementation of Conv olutional Neural Net work v ariants. Our feature extractors are neither carefully designed nor pre-wired, but rather learned in a sup ervised w ay . Our deep hierarc hical architectures achiev e the b est published results on benchmarks for ob ject classiﬁcation (NORB, CIF AR10) and handwritten digit recognition (MNIST), with error rates of 2.53%, 19.51%, 0.35%, resp ectively . Deep nets trained by simple bac k-propagation p erform b etter than more shallo w ones. Learning is surprisingly rapid. NORB is completely trained within ﬁv e epo chs. T est error rates on MNIST drop to 2.42%, 0.97% and 0.48% after 1, 3 and 17 ep o c hs, respectively . 1 In tro duction The human visual system eﬃcien tly recognizes and lo calizes ob jects within cluttered scenes. F or artiﬁcial systems, how ever, this is still diﬃcult, due to viewp oint-dependent ob ject v ariabilit y , and the high in-class v ariability of man y ob ject t yp es. Deep hierarc hical neural mo dels roughly mimic k the nature of mammalian visual cortex, and b y communit y consensus are among the most promising arc hitectures for suc h tasks. The most successful hierarc hical ob ject recognition systems all extract lo calized features from input images, con volving image patches with ﬁlters. Filter resp onses are then rep eatedly sub-sampled and re-ﬁltered, resulting in a deep feed-forw ard net w ork arc hitecture whose output feature vectors are ev entually classiﬁed. One of the ﬁrst hierarc hical neural systems w as the Neo cognitron (F ukushima, 1980) which inspired man y of the more recent v arian ts. Unsup ervised learning methods applied to patches of natural images tend to pro duce localized ﬁlters that resem ble oﬀ-center-on-surround ﬁlters, orien tation-sensitive bar detectors, Gabor ﬁlters (Sc hmidh ub er et al. , 1996; Olshausen and Field, 1997; Ho yer and Hyv¨ arinen, 2000). These ﬁndings in conjunction with exp erimental studies of the visual cortex justify the use of such ﬁlters in the so-called standar d mo del for ob ject recognition (Riesenh ub er and P oggio, 1999; Serre et al. , 2007; Mutc h and Low e, 2008), whose ﬁlters are ﬁxed, in contrast to those of Conv olutional Neural Net w orks (CNNs) (LeCun et al. , 1998; Behnk e, 2003; Simard et al. , 2003), whose weigh ts (ﬁlters) are randomly initialized and c hanged in a sup ervised wa y using back-propagation (BP). Despite the hardware progress of the past decades, computational sp eed is still a limiting factor for CNN architectures characterized b y many building blo cks typically set by trial and error. T o systematically test the impact of v arious architectures on classiﬁcation p erformance, w e present a fast CNN implemen tation on Graphics Pro cessing Units (GPUs). Previous GPU implemen tations of CNNs (Chellapilla et al. , 2006; Uetz and Behnk e, 2009) were hard-coded to satisfy GPU hardware constraints, whereas our implementation is ﬂexible and fully online (i.e., T ec hnical Rep ort No. IDSIA-01-11 2 w eigh t updates after eac h image). It allo ws for training large CNNs within da ys instead of mon ths, suc h that we can inv estigate the inﬂuence of v arious structural parameters by exploring large parameter spaces (Pin to et al. , 2009) and performing error analysis on rep eated exp eriments. W e ev aluate v arious net works on the handwritten digit b enchmark MNIST (LeCun et al. , 1998) and tw o image classiﬁcation b enchmarks: NORB (LeCun et al. , 2004) and CIF AR10 (Krizhevsky, 2009). 2 Con v olutional neural net w orks CNNs are hierarchical neural netw orks whose conv olutional lay ers alternate with subsampling la y ers, reminiscen t of simple and complex cells in the primary visual cortex (Wiesel and Hubel, 1959). CNNs v ary in how con volutional and subsampling lay ers are realized and how the nets are trained. The CNN arc hitecture considered in this study diﬀers from the one of Simard et al. (2003) in the sense that after eac h CNN-lay er an optional max-po oling la yer (Scherer et al. , 2010) can b e used. Here w e giv e a complete description of this indep endent implemen tation (Fig. 1). 2.1 Image pro cessing lay er The image pro cessing la y er is an optional pre-pro cessing lay er of predeﬁned ﬁlters that are kept ﬁxed during training. Th us additional information b esides the ra w input image can b e provided to the netw ork, such as edges and gradients. In particular, we ﬁnd that a contrast-extracting la yer (F ukushima, 2003) helps to impro v e the recognition rate for NORB. 2.2 Con volutional lay er A con volutional la yer is parametrized by the size and the num b er of the maps, k ernel sizes, skipping factors, and the connection table. Each lay er has M maps of equal size ( M x , M y ). A kernel (blue rectangle in Fig 1) of size ( K x , K y ) is shifted ov er the v alid region of the input image (i.e. the k ernel has to be completely inside the image). The skipping factors S x and S y deﬁne how many pixels the ﬁlter/kernel skips in x- and y-direction b etw een subsequent con v olutions. The size of the output map is then deﬁned as: M n x = M n − 1 x − K n x S n x + 1 + 1; M n y = M n − 1 y − K n y S n y + 1 + 1 (1) where index n indicates the lay er. Eac h map in lay er L n is connected to at most M n − 1 maps in la y er L n − 1 . Neurons of a giv en map share their weigh ts but hav e diﬀeren t receptiv e ﬁelds. 2.3 Max-p o oling la yer The biggest architectural diﬀerence betw een our implemen tation and the CNN of LeCun et al. (1998) is the use of a max-p o oling la yer instead of a sub-sampling la yer. No suc h la yer is used b y Simard et al. (2003) who simply skips nearby pixels prior to con volution, instead of p o oling or av eraging. Sc herer et al. (2010) found that max-po oling can lead to faster con v ergence, select sup erior inv ariant features, and improv e generalization. The output of the max-po oling la yer is giv en b y the maxim um activ ation o ver non-o verlapping rectangular regions of size ( K x , K y ). Max- p o oling enables p osition inv ariance o ver larger lo cal regions and downsamples the input image b y a factor of K x and K y along eac h direction. T ec hnical Rep ort No. IDSIA-01-11 3 2.4 Classiﬁcation la yer Kernel sizes of conv olutional ﬁlters and max-p o oling rectangles as well as skipping factors are c hosen such that either the output maps of the last conv olutional la yer are do wnsampled to 1 pixel per map, or a fully connected la yer combines the outputs of the topmost conv olutional la yer in to a 1D feature v ector. The top lay er is alwa ys fully connected, with one output unit p er class lab el. Figure 1: Arc hitecture of a con volutional neural net work. In this case, the c on volutional lay ers are fully connected. Both con v olutional la y ers use a kernel of 5 x 5 and skipping factors of 1. 3 GPU implemen tation The latest generation of NVIDIA GPUs, the 400 and 500 series (w e use GTX 480 & GTX 580), has man y adv antages o ver older GPUs, most notably the presence of a R/W L2 global cac he for device memory . This p ermits faster programs and simpliﬁes writing the co de. In fact, the corresp onding transfer of complexit y into hardware alleviates many softw are and optimization problems. Our exp erimen ts sho w that the CNN program becomes 2-3 times faster just b y switching from GTX 285 to GTX 480. T ec hnical Rep ort No. IDSIA-01-11 4 Man ual optimization of CUDA co de is v ery time-consuming and error prone. W e optimize for the new arc hitecture, relying on the L2 cache for man y of the device memory accesses, instead of man ually writing code that uses textures and shared memory . Co de obtained by this pragmatic strategy is fast enough. W e use the follo wing t yp es of optimization: pre-computed expressions, unrolled loops within template kernels, strided matrices to obtain coalesced memory accesses and registers wherever possible. Additional manual optimizations are p ossible in case future image classiﬁcation problems will require ev en more computing p ow er. 3.1 Data structures Both outputs y and deltas δ of lay er L n are 2D strided. Their original size is M x × M M y , but they are horizontally strided with a pitch of 32 ﬂoats (we use this stride for all 2D data), resulting in coalesced memory accesses. The vertical stride a v oids additional b ounding tests in CUDA k ernels. All connections b etw een maps of consecutive lay ers L n − 1 and L n are stored in matrix C n . Eac h row of C n con tains all connections that feed into a particular map in lay er L n . Because w e aim for a ﬂexible architecture with partially connected lay ers, in the ﬁrst column w e store the n um b er of previous connections. This index is useful for F orw ard Propagation (FP) and Adjusting W eigh ts (A W) CUD A kernels. The second column stores the n umber of connections, follow ed by corresp onding indices of maps in L n − 1 connected to the curren t map. F or BP and FP , analogous information ab out connections is needed. W e therefore store bac k- w ard connections in C B P . A W requires a list of all map connections (see Subsection 3.4), stored as an arra y of map index pairs. Dealing with biases in BP k ernel requires to know where the w eigh ts of particular connections start; this information is stored in a 2D arra y W I D X B P of size M n × M n − 1 . 3.2 F orward propagation A straightforw ard wa y of parallelizing FP is to assign a thread blo ck to each map that has to b e computed. F or maps bigger than 1024 neurons, the job is further split into smaller blocks by assigning a blo c k to each line of the map, because the num b er of threads p er blo ck is limited (1024 for GTX 480). A one to one correspondence b etw een threads and the map’s neurons is assumed. Because of w eight sharing, threads inside a blo ck can access data in parallel, in particular the same weigh ts and inputs from the previous la yer. Each thread starts by initializing its sum with the bias, then loops o ver all map connections, conv olving the appropriate patc h of the input map with the corresponding kernel. The output is obtained by passing the sum through a scaled tanh activ ation function, and then written to device memory . 3.3 Bac kward propagation BP of deltas can be done in t wo w ays: b y pushing or by pulling. Pushing deltas means taking each delta from the curren t lay er and computing the corresponding deltas for the previous la yer. F or an architecture with shared weigh ts this has the disadv an tage of being hard to co de. Eac h delta from the curren t lay er con tributes to man y deltas in the previous la yer, which translates into a lot of programming. There are t wo wa ys of a voiding this: either writing partial deltas to a separated blo c k of memory and then putting everything together by calling another k ernel (slo w b ecause of a tremendous increase in the n umber of memory accesses, and the need of another kernel), or using atomic writes (to a v oid data hazards) to up date deltas (very slo w b ecause many writings are serialized). W e implemen t pulling deltas, whic h has almost none of the abov e sp eed-limiting dra wbac ks, but is a bit more complicated. T ec hnical Rep ort No. IDSIA-01-11 5 The (uni- or bi-dimensional) thread grid assigns a (bi- or uni-dimensional) thread block to each map in the previous lay er and a thread to each neuron in ev ery map. Similar to FP , for maps with more than 1024 neurons, the 2D grid is further split into smaller 1D blo cks by assigning a 2D block to eac h row of the map. Each thread computes the delta of its corresp onding neuron b y pulling deltas from the current lay er. F or every neuron in the previous la yer we ha ve to determine the list of neurons in the current lay er whic h are connected to it. Let us consider neuron ( i, j ) from a map in lay er L n − 1 , and then assume that ( x, y ) are the co ordinates of neurons in maps of L n that con tribute to the delta of neuron ( i, j ). The ( x, y ) neuron is connected to kernel size n um b er neurons ( K x × K y ) from each connected map in the previous la y er. The indices in L n − 1 of the neurons connected through a kernel to the ( x, y ) neuron are: x ( S x + 1) ≤ i ≤ x ( S x + 1) + K x − 1 , y ( S y + 1) ≤ j ≤ y ( S y + 1) + K y − 1 . W e can no w compute the inequalities for ( x, y ): i − K x + 1 S x + 1 ≤ x ≤ i S x + 1 , j − K y + 1 S y + 1 ≤ y ≤ j S y + 1 . Because ( x, y ) has to b e inside the map, the ﬁnal inequalities are: max  i − K x + 1 S x + 1  , 0  ≤ x ≤ min  i S x + 1  , M x − 1  , max  j − K y + 1 S y + 1  , 0  ≤ y ≤ min  j S y + 1  , M y − 1  . The ab o v e inequalities state that the delta of neuron ( i, j ) from L n − 1 is computed from deltas of neurons in a rectangular area in maps of L n (Fig. 2). After summing up the deltas, each thread m ultiplies the result b y the deriv ativ e of the activ ation function. 3.4 Adjusting w eights FP and BP hav e a grid on the list of maps, but the A W thread grid is on the list of kernels (ﬁlters) b et w een maps of tw o consecutiv e lay ers. The 1D grid has a block for each connection b et w e en tw o maps. Thread blocks are 2D, with a corresponding thread for ev ery k ernel w eigh t. The bias w eight is included as an en tire row of threads, th us requiring thread blo cks to ha ve ( K x + 1) × K y threads. Most of the time these additional K y threads will do nothing, thread (0,0) being activ ated only for blo c ks that ha v e to pro cess the bias. 4 Exp erimen ts W e use a system with a Core i7-920 (2.66GHz), 12 GB DDR3 and four graphics cards: 2 x GTX 480 and 2 x GTX 580. The correctness of the CPU v ersion is c heck ed by comparing the analytical gradient with its ﬁnite diﬀerence approximation. On GPU this is not p ossible because all computations are p erformed with single precision ﬂoating p oin t num b ers. Hence the GPU implemen tation’s correctness is c heck ed by comparing its results to those of a randomly initialized net after training it for sev eral ep o chs on the more accurate CPU v ersion. Obtaining iden tical results after trillions of op erations is a strong indication of correctness. T ec hnical Rep ort No. IDSIA-01-11 6 Figure 2: Bac k propagating deltas. A connection b etw een t wo maps from tw o consecutiv e la yers is displa yed. The map in L n − 1 has 29 x 29 neurons; the map in L n has 13 x 13 neurons. They are link ed through a 5 x 5 k ernel K . Skipping factors of S x = 1 and S y = 1 are assumed. Arrows and colors depict the corresp ondence b et w een neurons in L n − 1 and their sources in L n . The implemen ted CNN’s plain feed-forward architecture is trained using on-line gradient de- scen t. All images from the training set are used for training and also for v alidation. If deformations are enabled, only the images from the training set will b e deformed. W eights are initialized ac- cording to a uniform random distribution in the range [ − 0 . 05 , 0 . 05]. Each neuron’s activ ation function is a scaled h yp erb olic tangen t: y ( a ) = 1 . 7159 tanh(0 . 6666 a ) (LeCun et al. , 1998). W e pick the trained CNN with the low est v alidation error, and ev aluate it on the test set (T est for b est V alidation - TfbV). The best test error (bT) is also listed for all exp eriments. The rep orted computation times per ep o ch include training, v alidation and testing as well as all data transfers. 4.1 Exp erimen ts on MNIST F or the MNIST dataset the net works are trained on deformed images, con tinually generated in on-line fashion. Aﬃne (translation, rotation, scaling, horizon tal shearing) and elastic deformations (Simard et al. , 2003) are com bined. W e use a v ariable learning rate that shrinks by a m ultiplicativ e constan t after eac h ep o c h, from 10 − 3 do wn to 3 · 10 − 5 after 500 ep o c hs. F ully connected con volutional la yers lead to an explo ding n umber of netw ork connections and w eigh ts, making training of big and deep CNNs for hundreds of ep o chs impractical ev en on GPUs. P artial connectivit y alleviates this problem and is also biologically more plausible. W e reduce the n umber of connections b etw een conv olutional lay ers in a random wa y . T able 1 lists results of v arious netw orks with 2 to 7 hidden la yers with random connections. Additional lay ers result in b etter netw orks, the best one achieving a test error of 0.35% for b est v alidation and a b est test error of 0.27%. The b est previous CNN result on MNIST is 0.40% (Simard et al. , 2003). A 0.35% error rate was recen tly also obtained by a big, deep MLP (Cire¸ san et al. , 2010) with man y more free parameters. Deep er nets require more computation time to complete an ep o ch, but we observe that they also need fewer epo chs to achiev e go o d test errors. The deep est CNN T ec hnical Rep ort No. IDSIA-01-11 7 T able 1: Error rates on MNIST test set for randomly connected CNNs with 2 to 6 con v olutional la y ers with M Maps and an optional fully connected lay er with N neurons. V arious kernel sizes and skipping factors w ere used. #M, #N bT TfbV in Hidden La y ers [%] [%] 20M-60M 0.95 1.02 20M-60M-150N 0.50 0.55 20M-60M-100M-150N 0.33 0.38 20M-40M-60M-80M-100M-120M-150N 0.27 0.35 from T able 1 reac hes 2.42%, 0.97% and 0.48% after one, three and seven teen epo chs, resp ectively . On the other hand, the net work with 4 instead of 7 hidden lay ers reac hes 4.71%, 1.58%, 0.68% after one, three and sev enteen ep o chs, ac hieving a test error below 0.50% after only 34 epo chs. This sho ws once more that deep netw orks, con trary to common b elief, can b e trained successfully b y bac k-propagation. Despite the numerous free parameters, deep netw orks seem to learn faster (b etter recognition rates after few er ep o c hs) than shallo w ones. W e consider MNIST an almost solv ed problem. The remaining errors stem from digits that are am biguous or miss parts. 4.2 Exp erimen ts on NORB NORB contains stereo images of 3D ob jects. Hence there are tw o maps on the input la yer. Rotation, scaling, shearing and elastic distortions seem to ha v e a negative impact on generalization. These deformations impro v e recognition rates for digits that are intrinsically 2D (Cire¸ san et al. , 2010), but seem inadequate for 3D ob jects. Initial experiments on NORB show that unlik e with MNIST where w e use deformations, the CNN needs only 3 to 6 ep o chs to reac h zero v alidation error. This allows us to quickly run n umerous rep etitiv e experiments with huge net works with hundreds of maps p er la yer. W e decided to use a CNN with ﬁve hidden lay ers: lay er1, a con volutional la yer with 300 maps, kernel size 6 × 6 and skipping factors 1 × 1; la y er2, a max-po oling lay er o ver a 2 × 2 region; lay er3, a conv olutional lay er with 500 maps, kernel size 4 × 4, skipping factors 0 × 0; lay er4, a max-p o oling la yer ov er a 4 × 4 region; la yer5, a fully connected la yer with 500 neurons. The learning rate is initialized b y 0.001 and m ultiplied b y 0.95 after ev ery ep o c h. T able 2 summarizes the results of four diﬀeren t experiments by switc hing on/oﬀ translation as w ell as the ﬁxed image processing lay er. W e report the av erage error rate as w ell as the standard deviation of N indep endent runs w ith identical architectures but diﬀerent weigh t initializations. F or the ﬁrst exp eriment without translation and no image processing (IP), an av erage test error rate of 7.86% is obtained. With additional translations of at most 5%, the a verage error rate drops to 4.71%, contradicting the common b elief that CNNs are translation in v ariant. These results are on par or better than others in the literature: 5.90% error rate for a combination of CNNs and SVMs (LeCun et al. , 2004) and 5.20% error rate for restricted Boltzman machines (Nair and Hin ton, 2009). The best previously published result on NORB (2.87%) w as obtained b y a hierarchical neural net w ork whic h to ev ery con v olutional la yer provides a subsampled version plus edge information of the original image (Uetz and Behnke, 2009). This motiv ated us to implement a pre-pro cessing la y er with ﬁxed ﬁlters. W e tried simple edge masks (Sob el, Sc harr) but obtained b est results with a con trast-extraction lay er (F ukushima, 2003) realized by Mexican hat-shap ed ﬁlters of size 21 × 21, one with a concentric on-center receptive ﬁeld and one with a concen tric oﬀ-cen ter receptiv e ﬁeld, T ec hnical Rep ort No. IDSIA-01-11 8 T able 2: Average error rates and standard deviations of N runs for a ﬁve hidden lay er CNN on the NORB test set (see text for details). trans. [%] IP TfbV [%] runs time/ep o ch [s] 0 no 7.86 ± 0.55 50 1141 5 no 4.71 ± 0.57 50 1563 0 y es 3.94 ± 0.48 50 1658 5 y es 2.53 ± 0.40 100 2080 similar to the ﬁlters automatically created by unsupervised Pr e dictability Minimization (Sc hmid- h ub er, 1992) applied to natural images (Sc hmidhuber et al. , 1996). The ﬁrst ﬁlter extracts p ositive con trast in brightness, whereas the latter extracts negative con trast. Each image from the original NORB is ﬁltered, consequently the input of the CNN has six maps: the original image plus the p ositiv e and negative contrast for eac h of the tw o stereo c hannels. Using suc h a pre-processing la y er results in low er av erage error rates, 3.94% without translation and 2.53% with translation. This result impro v es the previous state of the art on NORB (Uetz and Behnke, 2009). Exp erience with other image datasets tells us that NORB is un usual. The training set has only ﬁv e instances p er class. The resulting p o or training set v ariability mak es the nets learn quickly but generalize badly . NORB is the only dataset that proﬁts from a ﬁxed pre-pro cessing lay er in a substan tial w a y . F or MNIST and CIF AR10 suc h pre-pro cessing has little or no eﬀect. It is also w orth noting that NORB’s standard error rate deviation is bigger than CIF AR10’s (see T ables 2 and 3). Identical nets with diﬀerent initializations do not pro duce very consistent results. The b est net had an error rate of 1.72%, the worst 3.69%. 4.3 Exp erimen ts on CIF AR 10 CIF AR10 is a collection of natural color images of 32x32 pixels. It con tains 10 classes, eac h of them with 5000 samples in the training set and 1000 in the test set. The images greatly v ary inside each class. They are not necessarily cen tered, may con tain only parts of the ob ject, and ha ve v arying bac kgrounds. All of this mak es CIF AR10 the hardest problem addressed in this paper. The CNN has three maps, one for eac h color c hannel (R GB). The CIF AR10 images are relativ ely small in comparison to NORB’s, and force us to use small k ernels. The tested CNNs diﬀer only in the n um b er of maps p er conv olutional and max-po oling lay er. All hav e eight hidden la yers: la y er1, a con v olutional la yer with 3 × 3 kernels and skipping factor of 0; la yer2, a max-p o oling lay er ov er a 3 × 3 region; la y er3, a con volutional la yer with 3 × 3 k ernels and skipping factors of 0 × 0; lay er4, a max-po oling ov er a 2 × 2 region; lay er5, a con v olutional la yer with 3 × 3 k ernels and a skipping factors of 0 × 0; lay er6, a max po oling la y er ov er a 2 × 2 region; lay er7, a fully connected lay er with 300 neurons; la y er8, a fully connected lay er with 100 neurons. Lik e for MNIST, the learning rate is initialized by 0.001 and multiplied b y 0.993 after every ep o c h. Results in T able 3 show that without translation the error rate do es not drop b elow 28%; adding edge information do es not help at all. T ranslations hav e a v ery p ositive eﬀect, decreasing the error rate to almost 20%. Con trast extraction ﬁlters are b etter than the Sobel/Scharr ﬁlters but still w orse than no pre-processing la yer at all. Despite some CNN-inherent translation in v ariance, additional training image translations cause b etter generalization; additional image pro cessing pro v ed useless though. T o see if bigger nets are b etter, w e increase the num b er of maps p er lay er from 100 to 200, 300 and 400, resp ectively (last three rows in T ab. 3). T raining time increases exp onentially , but the test error decreases, reaching a minim um for nets with 300 maps per la yer. Our 19.51% error rate is b etter than the previous state of the art for this dataset, 20.40% (Coates et al. , 2010) and 25.50% T ec hnical Rep ort No. IDSIA-01-11 9 T able 3: Average error rates and standard deviations for N runs of an eigh t hidden la yer CNN on the CIF AR10 test set (see text for details). The ﬁrst ﬁve nets hav e 100 maps per con volutional and max-po oling lay er, whereas the sixth, seven th and eigh th hav e 200, 300 and 400 maps p er hidden la yer, resp ectively . IP - image pro cessing lay er: edge - 3 × 3 Sob el and Sc harr ﬁlters; hat - 13 × 13 p ositiv e and negativ e con trast extraction ﬁlters. trans. [%] maps IP TfbV [%] runs time/ep o ch [s] 0 100 no 28.87 ± 0.37 11 93 0 100 edge 29.11 ± 0.36 15 104 5 100 no 20.26 ± 0.21 11 111 5 100 edge 21.87 ± 0.57 5 120 5 100 hat 21.44 ± 0.44 4 136 5 200 no 19.90 ± 0.16 5 248 5 300 no 19.51 ± 0.18 5 532 5 400 no 19.54 ± 0.16 5 875 (Y u and Zhang, 2010). Unlik e Coates et al. (2010), ho wev er, we use the original images without an y particular input normalization. Note that the error rate standard deviations are smaller than those obtained on NORB, that is, diﬀerent initializations yield consisten t results. 4.4 Sp eedup factor of GPU co de The GPU code scales w ell with net work size. F or small nets the speedup is small (but still o ver 10) since they ﬁt b etter inside the CPU cac he, and GPU resources are underutilized. F or huge nets (ex: T able 2) the GPU implementation is more than 60 times faster than a compiler-optimized CPU v ersion. Giv en the ﬂexibility of our GPU v ersion, this is a signiﬁcant sp eedup. One epo ch tak es 35 GPU min utes but more than 35 CPU hours. 5 Conclusion W e presented high-p erformance GPU-based CNN v ariants trained by on-line gradien t descent, with sparse random connectivity , computationally more eﬃcien t and biologically more plausible than fully connected CNNs. Principal adv antages include state-of-the-art generalization capabilities, great ﬂexibility and speed. All structural CNN parameters such as input image size, n umber of hidden lay ers, num b er of maps per lay er, k ernel sizes, skipping factors and connection tables are adaptable to an y particular application. W e applied our netw orks to b enchmark datasets for digit recognition (MNIST), 3D ob ject recognition (NORB), and natural images (CIF AR10). On MNIST the best net work achiev ed a recognition test error rate of 0.35%, on NORB 2.53% and on CIF AR10 19.51%. Our results are raising the bars for all three b enchmarks. Currently the particular CNN t yp es discussed in this paper seem to b e the b est adaptiv e image recognizers, pro vided there is a lab eled dataset of suﬃcient size. No unsup ervised pretraining is required. Go o d results require big and deep but sparsely connected CNNs, computationally prohibitiv e on CPUs, but feasible on curren t GPUs, where our implementation is 10 to 60 times faster than a compiler-optimized CPU v ersion. T ec hnical Rep ort No. IDSIA-01-11 10 Ac kno wledgmen t This w ork w as partially funded b y the Swiss Commission for T echnology and Innov ation (CTI), Pro ject n. 9688.1 IFF: In telligen t Fill in F orm. References S. Behnke. Hier ar chic al Neur al Networks for Image Interpr etation , v olume 2766 of L e ctur e Notes in Computer Scienc e . Springer, 2003. K. Chellapilla, S. Puri, and P . Simard. High p erformance con volutional neural net w orks for do cumen t pro cessing. In International Workshop on F r ontiers in Handwriting R e c o gnition , 2006. D. C. Cire¸ san, U. Meier, L. M. Gambardella, and J. Sc hmidhuber. Deep big simple neural nets for handwritten digit recogn tion. Neur al Computation , 22(12):3207–3220, 2010. A. Coates, H. Lee, and A. Ng. An analysis of single-la yer net works in unsupervised feature learning. In A dvanc es in Neur al Information Pr o c essing Systems , 2010. K. F ukushima. Neo cognitron: A self-organizing neural net work for a mechanism of pattern recog- nition unaﬀected b y shift in p osition. Biolo gic al Cyb ernetics , 36(4):193–202, 1980. K. F ukushima. Neo cognitron for handwritten digit recognition. Neur o c omputing , 51:161–180, 2003. P . O. Ho yer and A. Hyv¨ arinen. Indep endent component analysis applied to feature extraction from colour and stero images. Network: Computation in Neur al Systems , 11(3):191–210, 2000. A. Krizhevsky . Learning m ultiple la yers of features from tin y images. Master’s thesis, Computer Science Departmen t, Univ ersit y of T oron to, 2009. Y. LeCun, L. Bottou, Y. Bengio, and P . Haﬀner. Gradien t-based learning applied to document recognition. Pr o c e e dings of the IEEE , 86(11):2278–2324, Nov ember 1998. Y. LeCun, F.-J. Huang, and L. Bottou. Learning metho ds for generic ob ject recognition with in- v ariance to pose and ligh ting. In Pr o c. of Computer Vision and Pattern R e c o gnition Confer enc e , 2004. J. Mutc h and D. G. Low e. Ob ject class recognition and lo calization using sparse features with limited receptiv e ﬁelds. Int. J. Comput. Vision , 56(6):503–511, 2008. V. Nair and G. E. Hin ton. 3d ob ject recognition with deep belief nets. In A dvanc es in Neur al Information Pr o c essing Systems , 2009. B. A. Olshausen and D. J. Field. Sparse co ding with an o v ercomplete basis set: A strategy emplo y ed b y v1? Vision R ese ar ch , 37(23):3311–3325, Decem b er 1997. N. Pin to, D. Doukhan, J. J. DiCarlo, and D. D. Cox. A high-throughput screening approach to discov ering go o d forms of biologically inspired visual represen tation. PL oS c omputational biolo gy , 5(11):e1000579, Nov em b er 2009. M. Riesenhuber and T. Poggio. Hierarchical models of ob ject recognition in cortex. Nat. Neur osci. , 2(11):1019–1025, 1999. T ec hnical Rep ort No. IDSIA-01-11 11 D. Sc herer, A. M ¨ uller, and S. Behnke. Ev aluation of po oling operations in con volutional architec- tures for ob ject recognition. In International Confer enc e on A rtiﬁcial Neur al Networks , 2010. J. Schmidh ub er, M. Eldracher, and B. F oltin. Semilinear predictability minimization pro duces w ell-kno wn feature detectors. Neur al Computation , 8(4):773–786, 1996. J. Sc hmidhuber. Learning factorial codes by predictabilit y minimization. Neur al Computation , 4(6):863–879, 1992. T. Serre, L. W olf, and T. Poggio. Ob ject recognition with features inspired by visual cortex. In Pr o c. of Computer Vision and Pattern R e c o gnition Confer enc e , 2007. P . Simard, D. Steinkraus, and J. Platt. Best practices for con v olutional neural netw orks applied to visual document analysis. In Seventh International Confer enc e on Do cument Analysis and R e c o gnition , pages 958–963, 2003. R. Uetz and S. Behnke. Large-scale ob ject recognition with cuda-accelerated hierarchical neural net w orks. In IEEE International Conver enc e on Intel ligent Computing and Intel ligent Systems (ICIS) , 2009. D. H. Wiesel and T. N. Hub el. Receptive ﬁelds of single neurones in the cat’s striate cortex. J. Physiol. , 148:574–591, 1959. K. Y u and T. Zhang. Impro ved lo cal co ordinate coding using local tangen ts. In Pr o c e e dings of the International Confer enc e on Machine L e arning , 2010.

High-Performance Neural Networks for Visual Object Classification

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment