Stochastic Pooling for Regularization of Deep Convolutional Neural Networks

Stochastic P ooling f or Regularization of Deep Con volutional Neural Netw orks Matthew D. Zeiler Department of Computer Science Courant Institute, New Y ork Univ ersity zeiler@cs.nyu.edu Rob Fer gus Department of Computer Science Courant Institute, New Y ork Univ ersity fergus@cs.nyu.edu Abstract W e introduce a simple and effecti ve method for regularizing large con volutional neural networks. W e replace the conv entional deterministic pooling operations with a stochastic procedure, randomly picking the activ ation within each pool- ing region according to a multinomial distribution, gi ven by the acti vities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. W e achiev e state-of-the-art performance on four image datasets, relati ve to other ap- proaches that do not utilize data augmentation. 1 Introduction Neural netw ork models are prone to ov er-ﬁtting due to their high capacity . A range of re gularization techniques are used to prev ent this, such as weight decay , weight tying and the augmentation of the training set with transformed copies [9]. These allow the training of lar ger capacity models than would otherwise be possible, which yield superior test performance compared to smaller un- regularized models. Dropout, recently proposed by Hinton et al. [2], is another regularization approach that stochastically sets half the activ ations within a layer to zero for each training sample during training. It has been shown to deliv er signiﬁcant gains in performance across a wide range of problems, although the reasons for its efﬁcac y are not yet fully understood. A drawback to dropout is that it does not seem to hav e the same beneﬁts for con volutional layers, which are common in man y networks designed for vision tasks. In this paper, we propose a novel type of regularization for con v olutional layers that enables the training of larger models without ov er-ﬁtting, and produces superior performance on recognition tasks. The key idea is to make the pooling that occurs in each conv olutional layer a stochastic process. Con ventional forms of pooling such as av erage and max are deterministic, the latter selecting the largest activ ation in each pooling region. In our stochastic pooling, the selected acti v ation is drawn from a multinomial distribution formed by the acti vations within the pooling region. An alternate vie w of stochastic pooling is that it is equiv alent to standard max pooling b ut with man y copies of an input image, each having small local deformations. This is similar to explicit elastic deformations of the input images [13], which deli vers excellent MNIST performance. Other types of data augmentation, such as ﬂipping and cropping differ in that the y are global image transforma- tions. Furthermore, using stochastic pooling in a multi-layer model giv es an exponential number of deformations since the selections in higher layers are independent of those below . 1 2 Review of Con volutional Networks Our stochastic pooling scheme is designed for use in a standard conv olutional neural network archi- tecture. W e ﬁrst re view this model, along with conv entional pooling schemes, before introducing our nov el stochastic pooling approach. A classical conv olutional network is composed of alternating layers of con volution and pooling (i.e. subsampling). The aim of the ﬁrst conv olutional layer is to extract patterns found within local regions of the input images that are common throughout the dataset. This is done by con volving a template or ﬁlter ov er the input image pix els, computing the inner product of the template at e very location in the image and outputting this as a feature map c , for each ﬁlter in the layer . This output is a measure of ho w well the template matches each portion of the image. A non-linear function f () is then applied element-wise to each feature map c : a = f ( c ) . The resulting activ ations a are then passed to the pooling layer . This aggregates the information within a set of small local regions, R , producing a pooled feature map s (of smaller size) as output. Denoting the aggre gation function as pool () , for each feature map c we have: s j = pool ( f ( c i )) ∀ i ∈ R j (1) where R j is pooling region j in feature map c and i is the index of each element within it. The motiv ation behind pooling is that the activ ations in the pooled map s are less sensiti ve to the precise locations of structures within the image than the original feature map c . In a multi-layer model, the con v olutional layers, which take the pooled maps as input, can thus extract features that are increasingly in v ariant to local transformations of the input image. This is important for classiﬁcation tasks, since these transformations obfuscate the object identity . A range of functions can be used for f () , with tanh () and logistic functions being popular choices. In this is paper we use a linear rectiﬁcation function f ( c ) = max (0 , c ) as the non-linearity . In general, this has been sho wn [10] to ha ve signiﬁcant beneﬁts over tanh () or logistic functions. Howe ver , it is especially suited to our pooling mechanism since: (i) our formulation in volves the non-negati vity of elements in the pooling regions and (ii) the clipping of neg ativ e responses intro- duces zeros into the pooling regions, ensuring that the stochastic sampling is selecting from a fe w speciﬁc locations (those with strong responses), rather than all possible locations in the region. There are two con ventional choices for pool () : av erage and max. The former takes the arithmetic mean of the elements in each pooling region: s j = 1 | R j | X i ∈ R j a i (2) while the max operation selects the largest element: s j = max i ∈ R j a i (3) Both types of pooling hav e drawbacks when training deep con volutional netw orks. In av erage pool- ing, all elements in a pooling re gion are considered, ev en if many hav e low magnitude. When com- bined with linear rectiﬁcation non-linearities, this has the effect of down-weighting strong activ a- tions since many zero elements are included in the average. Even worse, with tanh () non-linearities, strong positiv e and ne gati ve activ ations can cancel each other out, leading to small pooled responses. While max pooling does not suf fer from these drawbacks, we ﬁnd it easily ov erﬁts the training set in practice, making it hard to generalize well to test examples. Our proposed pooling scheme has the advantages of max pooling b ut its stochastic nature helps prevent o ver -ﬁtting. 3 Stochastic Pooling In stochastic pooling, we select the pooled map response by sampling from a multinomial distri- bution formed from the activ ations of each pooling region. More precisely , we ﬁrst compute the probabilities p for each region j by normalizing the acti vations within the re gion: p i = a i P k ∈ R j a k (4) 2 W e then sample from the multinomial distribution based on p to pick a location l within the region. The pooled activ ation is then simply a l : s j = a l where l ∼ P ( p 1 , . . . , p | R j | ) (5) The procedure is illustrated in Fig. 1. The samples for each pooling region in each layer for each training example are drawn independently to one another . When back-propagating through the network this same selected location l is used to direct the gradient back through the pooling region, analogous to back-propagation with max pooling. Max pooling only captures the strongest acti v ation of the ﬁlter template with the input for each region. Howe ver , there may be additional activ ations in the same pooling re gion that should be taken into account when passing information up the network and stochastic pooling ensures that these non-maximal activ ations will also be utilized. ★ ! "#!$%"&'! (#!)*+,'-! .#!/'.01'2!3*4'"-! '#!5-6("(*+*0'78! p i 9! 9! 9! 9! 9! 9! 9! :;! 9! 9! 9! 9! 9! 9! 9! 9;>! 9;c%Train% Stochas>c%Test% Figure 4: CIF AR-10 train and test error rates for various pooling re gion sizes with each method. tic pooling outperforms all other methods that do not use data augmentation methods such as jittering or elastic distortions [7]. The current state-of-the-art single model approach by Ciresan et al. [1] uses elastic distortions to augment the original training set. As stochastic pooling is a dif ferent type of regularization, it could be combined with data augmentation to further impro ve performance. T rain Error % T est Error % 2-layer Con v . Net + 2-layer Classiﬁer [3] – 0 . 53 6-layer Con v . Net + 2-layer Classiﬁer + elastic distortions [1] – 0 . 35 A vg Pooling 0 . 57 0 . 83 Max Pooling 0 . 04 0 . 55 Stochastic Pooling 0 . 33 0.47 T able 2: MNIST Classiﬁcation performance for v arious pooling methods. Ro ws 1 & 2 sho w the current state-of-the-art approaches. 4.4 CIF AR-100 The CIF AR-100 dataset is another subset of the tin y images dataset, b ut with 100 classes [5]. There are 50,000 training examples in total (500 per class) and 10,000 test e xamples. As with the CIF AR- 10, we scale to [0,1] and subtract the per-pix el mean from each image as shown in Fig. 2(h). Due to the limited number of training e xamples per class, typical pooling methods used in con volutional networks do not perform well, as shown in T able 3. Stochastic pooling outperforms these methods by prev enting over -ﬁtting and surpasses what we belie ve to be the state-of-the-art method by 2 . 66 %. T rain Error % T est Error % Receptiv e Field Learning [4] – 45 . 17 A vg Pooling 11 . 20 47 . 77 Max Pooling 0 . 17 50 . 90 Stochastic Pooling 21 . 22 42.51 T able 3: CIF AR-100 Classiﬁcation performance for v arious pooling methods compared to the state- of-the-art method based on receptiv e ﬁeld learning. 4.5 Street V iew House Numbers The Street V iew House Numbers (SVHN) dataset is composed of 604,388 images (using both the difﬁcult training set and simpler extra set) and 26,032 test images [11]. The goal of this task is to classify the digit in the center of each cropped 32x32 color image. This is a difﬁcult real world problem since multiple digits may be visible within each image. The practical application of this is to classify house numbers throughout Google’ s street view database of images. W e found that subtracting the per-pix el mean from each image did not really modify the statistics of the images (see Fig. 2(b)) and left large v ariations of brightness and color that could mak e clas- 6 siﬁcation more difﬁcult. Instead, we utilized local contrast normalization (as in [12]) on each of the three RGB channels to pre-process the images Fig. 2(c). This normalized the brightness and color variations and helped training proceed quickly on this relati vely large dataset. Despite ha ving signiﬁcant amounts of training data, a lar ge con volutional network can still ov erﬁt. For this dataset, we train an additional model for 500 epochs with 64, 64 and 128 feature maps in layers 1, 2 and 3 respectively . Our stochastic pooling helps to pre vent ov erﬁtting e ven in this large model (denoted 64-64-128 in T able 4), despite training for a long time. The existing state-of-the- art on this dataset is the multi-stage con volutional network of Sermanet et al. [12], but stochastic pooling beats this by 2 . 10 % (relativ e gain of 43% ). T rain Error % T est Error % Multi-Stage Con v . Net + 2-layer Classiﬁer [12] – 5 . 03 Multi-Stage Con v . Net + 2-layer Classifer + padding [12] – 4 . 90 64-64-64 A vg Pooling 1 . 83 3 . 98 64-64-64 Max Pooling 0 . 38 3 . 65 64-64-64 Stochastic Pooling 1 . 72 3 . 13 64-64-128 A vg Pooling 1 . 65 3 . 72 64-64-128 Max Pooling 0 . 13 3 . 81 64-64-128 Stochastic Pooling 1 . 41 2.80 T able 4: SVHN Classiﬁcation performance for various pooling methods in our model with 64 or 128 layer 3 feature maps compared to state-of-the-art results with and without data augmentation. 4.6 Reduced T raining Set Size T o further illustrate the ability of stochastic pooling to pre vent ov er-ﬁtting, we reduced the training set size on MINST and CIF AR-10 datasets. Fig. 5 shows test performance when training on a random selection of only 1000, 2000, 3000, 5000, 10000, half, or the full training set. In most cases, stochastic pooling ov erﬁts less than the other pooling approaches. 1000 2000 3000 5000 10000 30000 60000 0 1 2 3 4 5 6 7 8 9 # of Training Cases % Error Avg Max Stochastic 1000 2000 3000 5000 10000 25000 50000 15 20 25 30 35 40 45 50 55 60 65 # of Training Cases % Error Avg Max Stochastic Figure 5: T est error when training with reduced dataset sizes on MNIST (left) and CIF AR-10 (right). Stochastic pooling generally ov erﬁts the least. 4.7 Importance of Model A veraging T o analyze the importance of stochastic sampling at training time and probability weighting at test time, we use different methods of pooling when training and testing on CIF AR-10 (see T able 5). Choosing the locations stochastically at test time de grades performance slightly as could be ex- pected, howe ver it still outperforms models where max or average pooling are used at test time. T o conﬁrm that probability weighting is a valid approximation to a veraging many models, we draw N samples of the pooling locations throughout the network and average the output probabilities from those N models (denoted Stochastic- N in T able 5). As N increases, the results approach the prob- ability weighting method, but ha ve the ob vious do wnside of an N -fold increase in computations. Using a model trained with max or average pooling and using stochastic pooling at test time per- forms poorly . This suggests that training with stochastic pooling, which incorporates non-maximal elements and sampling noise, makes the model more robust at test time. Furthermore, if these non- maximal elements are not utilized correctly or the scale produced by the pooling function is not correct, such as if av erage pooling is used at test time, a drastic performance hit is seen. 7 When using probability weighting during training, the network easily ov er-ﬁts and performs sub- optimally at test time using any of the pooling methods. Ho we ver , the beneﬁts of probability weighting at test time are seen when the model has speciﬁcally been trained to utilize it through either probability weighting or stochastic pooling at training time. T rain Method T est Method T rain Error % T est Error % Stochastic Pooling Probability W eighting 3 . 20 15.20 Stochastic Pooling Stochastic Pooling 3 . 20 17 . 49 Stochastic Pooling Stochastic-10 Pooling 3 . 20 15 . 51 Stochastic Pooling Stochastic-100 Pooling 3 . 20 15.12 Stochastic Pooling Max Pooling 3 . 20 17 . 66 Stochastic Pooling A vg Pooling 3 . 20 53 . 50 Probability W eighting Probability W eighting 0 . 0 19 . 40 Probability W eighting Stochastic Pooling 0 . 0 24 . 00 Probability W eighting Max Pooling 0 . 0 22 . 45 Probability W eighting A vg Pooling 0 . 0 58 . 97 Max Pooling Max Pooling 0 . 0 19 . 40 Max Pooling Stochastic Pooling 0 . 0 32 . 75 Max Pooling Probability W eighting 0 . 0 30 . 00 A vg Pooling A vg Pooling 1 . 92 19 . 24 A vg Pooling Stochastic Pooling 1 . 92 44 . 25 A vg Pooling Probability W eighting 1 . 92 40 . 09 T able 5: CIF AR-10 Classiﬁcation performance for v arious train and test combinations of pooling methods. The best performance is obtained by using stochastic pooling when training (to prev ent ov er-ﬁtting), while using the probability weighting at test time. 4.8 V isualizations Some insight into the mechanism of stochastic pooling can be gained by using a decon volutional network of Zeiler et al. [15] to provide a no vel visualization of our trained con volutional netw ork. The decon volutional network has the same components (pooling, ﬁltering) as a conv olutional net- work but are in verted to act as a top-down decoder that maps the top-layer feature maps back to the input pixels. The unpooling operation uses the stochastically chosen locations selected during the forward pass. The decon v olution network ﬁlters (no w applied to the feature maps, rather than the input) are the transpose of the feed-forward ﬁlters, as in an auto-encoder with tied encoder/decoder weights. W e repeat this top-down process until the input pixel level is reached, producing the vi- sualizations in Fig. 6. W ith max pooling, many of the input image edges are present, but average pooling produces a reconstruction with no discernible structure. Fig. 6(a) shows 16 examples of pixel-space reconstructions for different location samples throughout the network. The reconstruc- tions are similar to the max pooling case, but as the pooling locations change they result in small local deformations of the visualized image. Despite the stochastic nature of the model, the multinomial distrib utions ef fectiv ely capture the re g- ularities of the data. T o demonstrate this, we compare the outputs produced by a decon volutional network when sampling using the feedforward (FF) proabilities versus sampling from uniform (UN) distributions. In contrast to Fig. 6(a) which uses only feedforward proabilities, Fig. 6(b-h) replace one or more of the pooling layers’ distrib utions with uniform distributions. The feed forward proba- bilities encode signiﬁcant structural information, especially in the lower layers of the model. Addi- tional visualizations and videos of the sampling process are provided as supplementary material at www.matthewzeiler.com/pubs/iclr2013/ . 5 Discussion W e propose a simple and ef fectiv e stochastic pooling strategy that can be combined with any other forms of regularization such as weight decay , dropout, data augmentation, etc. to prev ent ov er- ﬁtting when training deep con volutional networks. The method is also intuiti ve, selecting from information the network is already providing, as opposed to methods such as dropout which throw information aw ay . W e show state-of-the-art performance on numerous datasets, when comparing to other approaches that do not employ data augmentation. Furthermore, our method has negligible computational overhead and no hyper-parameters to tune, thus can be swapped into to any existing con v olutional network architecture. 8 a)#FF(3)#–#FF(2)#–#FF(1)## b)#UN(3)#–#FF(2)#–#FF(1)# c)#FF(3)#–#UN(2)#–#FF(1)# d)#FF(3)#–#FF(2)#–#UN(1)# h)#UN(3)#–#UN(2)#–#UN(1)# e)#UN(3)#–#UN(2)#–#FF(1)# g)#FF(3)#–#UN(2)#–#UN(1)# f)#FF(3)#–#UN(2)#–#UN(1)# Image# Avg # Max# Figure 6: T op do wn visualizations from the third layer feature map activ ations for the horse image (far left). Max and average pooling visualizations are also shown on the left. (a)–(h): Each image in a 4x4 block is one instantiation of the pooling locations using stochastic pooling. For sampling the locations, each layer (indicated in parenthesis) can either use: (i) the multinomial distribution ov er a pooling region deri ved from the feed-forward (FF) acti v ations as in Eqn. 4, or (ii) a uniform (UN) distribution. W e can see that the feed-forward probabilities encode much of the structure in the image, as almost all of it is lost when uniform sampling is used, especially in the lower layers. References [1] D. C. Ciresan, U. Meier , J. Masci, L. M. Gambardella, and J. Schmidhuber . Flexible, high performance con v olutional neural networks for image classiﬁcation. In IJCAI , 2011. [2] G.E. Hinton, N. Sriv astav e, A. Krizhevsky , I. Sutske ver , and R. R. Salakhutdinov . Improving neural networks by pre venting co-adaptation of feature detectors. arXiv:1207.0580, 2012. [3] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y . LeCun. What is the best multi-stage architec- ture for object recognition? In ICCV , 2009. [4] Y . Jia and C. Huang. Beyond spatial pyramids: Receptive ﬁeld learning for pooled image features. In NIPS W orkshops , 2011. [5] A. Krizhevsky . Learning multiple layers of featurs from tiny images. T echnical Report TR- 2009, Univ ersity of T oronto, 2009. [6] A. Krizhevsk y . cuda-con vnet. http://code.google.com/p/cuda- convnet/ , 2012. [7] Y . LeCun. The MNIST database. http://yann.lecun.com/exdb/mnist/ , 2012. [8] Y . LeCun, L. Bottou, Y . Bengio, and P . Haffner . Gradient-based learning applied to document recognition. Pr oceedings of the IEEE , 86(11):2278–2324, 1998. [9] G. Montavon, G. Orr , and K.-R. Muller , editors. Neural Networks: T ricks of the T rade . Springer , San Francisco, 2012. [10] V . Nair and G.E. Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In ICML , 2010. [11] Y . Netzer , T . W ang, A. Coates, A. Bissacco, B. W u, and A. Y . Ng. Reading digits in natural images with unsupervised feature learning. In NIPS W orkshop , 2011. [12] P . Sermanet, S. Chintala, and Y . LeCun. Con volutional neural networks applied to house numbers digit classiﬁcation. In ICPR , 2012. [13] P . Simard, D. Steinkraus, and J. Platt. Best practices for con volutional neural networks applied to visual document analysis. In ICD AR , 2003. [14] http://gp- you.org/ . GPUmat. http://sourceforge.net/projects/ gpumat/ , 2012. [15] M. Zeiler , G. T aylor , and R. Fer gus. Adaptiv e decon volutional networks for mid and high le vel feature learning. In ICCV , 2011. 9

Stochastic Pooling for Regularization of Deep Convolutional Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment