A Generative Model for Deep Convolutional Learning

Accepted as a workshop contrib ution at ICLR 2015 A G E N E R A T I V E M O D E L F O R D E E P C O N V O L U T I O N A L L E A R N I N G Y unchen Pu, Xin Y uan and Lawrence Carin Department of Electrical and Computer Engineering, Duke Uni versity , Durham, NC, 27708, USA { yunchen.pu,xin.yuan,lcarin } @duke.edu A B S T R A C T A generativ e model is de veloped for deep (multi-layered) con volutional dictionary learning. A novel probabilistic pooling operation is integrated into the deep model, yielding ef ﬁcient bottom-up (pretraining) and top-do wn (reﬁnement) probabilistic learning. Experimental results demonstrate powerful capabilities of the model to learn multi-layer features from images, and excellent classiﬁcation results are obtained on the MNIST and Caltech 101 datasets. 1 I N T R O D U C T I O N W e dev elop a deep generative statistical model, which starts at the highest-lev el features, and maps these through a sequence of layers, until ultimately mapping to the data plane (e.g., an image). The feature at a gi ven layer is mapped via a multinomial distribution to one feature in a block of features at the layer below (and all other features in the block at the next layer are set to zero). This is analogous to the method in Lee et al. (2009), in the sense of imposing that there is at most one non-zero activ ation within a pooling block. W e use bottom-up pretraining, in which initially we sequentially learn parameters of each layer one at a time, from bottom to top, based on the features at the layer below . Howe v er , in the reﬁnement phase, all model parameters are learned jointly , top- down. Each consecuti ve layer in the model is locally conjugate in a statistical sense, so learning model parameters may be readily performed using sampling or variational methods. 2 M O D E L I N G F R A M E W O R K Assume N gray-scale images { X ( n ) } n =1 ,N , with X ( n ) ∈ R N x × N y ; the images are analyzed jointly to learn the con v olutional dictionary { D ( k ) } k =1 ,K . Speciﬁcally consider the model X ( n ) = K X k =1 D ( k ) ∗ ( Z ( n,k )  W ( n,k ) ) + E ( n ) , (1) where ∗ is the conv olution operator ,  denotes the Hadamard (element-wise) product, the elements of Z ( n,k ) are in { 0 , 1 } , the elements of W ( n,k ) are real, and E ( n ) represents the residual. Z ( n,k ) indicates which shifted version of D ( k ) is used to represent X ( n ) . Assume an L -layer model, with layer L the top layer, and layer 1 at the bottom, closest to the data. In the pretraining stage, the output of layer l is the input to layer l + 1 , after pooling. Layer l ∈ { 1 , . . . , L } has K l dictionary elements, and we hav e: X ( n,l +1) = P K l +1 k l +1 =1 D ( k l +1 ,l +1) ∗  Z ( n,k l +1 ,l +1)  W ( n,k l +1 ,l +1)  + E ( n,l +1) (2) X ( n,l ) = P K l k l =1 D ( k l ,l ) ∗  Z ( n,k l ,l )  W ( n,k l ,l )  | {z } = S ( n,k l ,l ) + E ( n,l ) (3) The expression X ( n,l +1) may be vie wed as a 3D entity , with its k l -th plane deﬁned by a “pooled” version of S ( n,k l ,l ) . The 2D acti v ation map S ( n,k l ,l ) is partitioned into n x × n y dimensional contiguous blocks (pooling blocks with respect to layer l + 1 of the model); see the left part of Figure 1. Associated with each 1 Accepted as a workshop contrib ution at ICLR 2015 0.3 0.1 0.9 0.8 0 0.3 0 0 0 0.1 0 0 0.8 0 0 0 0.9 0 0 0 X ( n ,1) S ( n , k 1 ,1) Z ( n , k 1 ,1) X ( n ,2 ) D ( k 2 ,2 ) D ( k 1 ,1) S ( n , k 2 ,2 ) 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0.3 0 0.9 0.8 0 0.3 0 0 0 0 0 0 0.8 0 0 0 0.9 0 0 0 X ( n ,1) S ( n , k 1 ,1) X ( n ,2 ) D ( k 2 ,2 ) D ( k 1 ,1) S ( n , k 2 ,2 ) Figure 1: Schematic of the proposed generativ e process. Left: bottom-up pretraining, right: top-down reﬁne- ment. (Zoom-in for best visulization and a larger version can be found in the Supplementary Material.) block of pixels in S ( n,k l ,l ) is one pixel at layer k l of X ( n,l +1) ; the relati ve locations of the pixels in X ( n,l +1) are the same as the relativ e locations of the blocks in S ( n,k l ,l ) . W ithin each block of S ( n,k l ,l ) , either all n x n y pixels are zero, or only one pixel is non-zero, with the position of that pixel selected stochastically via a multinomial distribution. Each pixel at layer k l of X ( n,l +1) equals the largest-amplitude element in the associated block of S ( n,k l ,l ) ( i.e. , max pooling). The learning performed with the top-down generativ e model (right part of Fig. 1) constitutes a r eﬁnement of the parameters learned during pretraining, and the excellent initialization constituted by the parameters learned during pretraining is key to the subsequent model performance. In the reﬁnement phase, we now proceed top down, from (2) to (3). The generative process consti- tutes D ( k l +1 ,l +1) and Z ( n,k l +1 ,l +1)  W ( n,k l +1 ,l +1) , and after con volution X ( n,l +1) is manifested; the E ( n,l ) is no w absent at all layers, except layer l = 1 , at which the ﬁt to the data is performed. Each element of X ( n,l +1) has an associated pooling block in S ( n,k l ,l ) . 3 E X P E R I M E N TA L R E S U L T S W e here apply our model to the MNIST and Caltech 101 datasets. T able 1: Classiﬁcation Error of MNIST data Methods T est error 6-layer Con v . Net + 2-layer Classiﬁer + elastic distortions Ciresan et al. (2011) 0.35% MCDNN Ciresan et al. (2012) 0.23% SPCNN Zeiler & Fergus (2013) 0.47% HBP Chen et al. (2013), 2-layer cF A + 2-layer features 0.89% Ours, 2-layer model + 1-layer features 0.42% MNIST Dataset T able 1 summaries the clas- siﬁcation results of our model compared with some related results, on the MNIST data. The second (top) layer features corresponding to the reﬁned dictionary are sent to a nonlinear support vector machine (SVM) (Chang & Lin, 2011) with Gaussian kernel, in a one-vs-all multi-class classiﬁer , with classiﬁer parameters tuned via 5-fold cross-validation (no tuning on the deep feature learning). T able 2: Classiﬁcation Accuracy Rate of Caltech-101. # T raining Images per Category 15 30 DN Zeiler et al. (2010) 58.6 % 66.9% CBDN Lee et al. (2009) 57.7 % 65.4% HBP Chen et al. (2013) 58% 65.7% ScSPM Y ang et al. (2009) 67 % 73.2% P-FV Seidenari et al. (2014) 71.47% 80.13% R-KSVD Li et al. (2013) 79 % 83% Con vnet Zeiler & Fergus (2014) 83.8 % 86.5% Ours, 2-layer model + 1-layer features 70.02% 80.31% Ours, 3-layer model + 1-layer features 75.24% 82.78% Caltech 101 Dataset W e ne xt consider the Caltech 101 dataset.For Caltech 101 classi- ﬁcation, we follow the setup in Y ang et al. (2009), selecting 15 and 30 images per cat- egory for training, and testing on the rest. The features of testing images are inferred based on the top-layer dictionaries and sent to a multi-class SVM; we again use a Gaus- sian k ernel non-linear SVM with parameters tuned via cross-v alidation. Ours and related results are summarized in T able 2. 4 C O N C L U S I O N S A deep generative con v olutional dictionary-learning model has been dev eloped within a Bayesian setting. The proposed framework enjoys ef ﬁcient bottom-up and top-down probabilistic inference. A probabilistic pooling module has been integrated into the model, a key component to de v eloping a principled top-down generativ e model, with efﬁcient learning and inference. Extensive experimental results demonstrate the efﬁcac y of the model to learn multi-layered features from images. 2 Accepted as a workshop contrib ution at ICLR 2015 R E F E R E N C E S Chang, C.-C. and Lin, C.-J. LIBSVM: A library for support vector machines. ACM T ransactions on Intellig ent Systems and T echnology , 2011. Chen, B., Polatkan, G., Sapiro, G., Blei, D., Dunson, D., and Carin, L. Deep learning with hierarchical con volutional f actor analysis. IEEE T -P AMI , 2013. Ciresan, D., Meier , U., and Schmidhuber, J. Multi-column deep neural networks for image classiﬁcation. In CVPR , 2012. Ciresan, D. C., Meier , U., Masci, J., Gambardella, L. M., and Flexible, J. Schmidhuber . high performance con volutional neural netw orks for image classiﬁcation. IJCAI , 2011. Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y . Con v olutional deep belief networks for scalable unsupervised learning of hierarchical representations. ICML , 2009. Li, Q., Zhang, H., Guo, J., Bhanu, B., and An, L. Reference-based scheme combined with K-svd for scene image categorization. IEEE Signal Pr ocessing Letters , 2013. Seidenari, L., Serra, G., Bagdanov , A., and Del Bimbo, A. Local pyramidal descriptors for image recognition. IEEE T -P AMI , 2014. Y ang, J., Y u, K., Gong, Y ., and Huang, T . Linear spatial pyramid matching using sparse coding for image classiﬁcation. In CVPR , 2009. Zeiler , M. and Fergus, R. Stochastic pooling for regularization of deep conv olutional neural networks. ICLR , 2013. Zeiler , M. and Fergus, R. V isualizing and understanding conv olutional networks. ECCV , 2014. Zeiler , M., Kirshnan, D., T aylor , G., and Fergus, R. Decon v olutional networks. CVPR , 2010. 3

A Generative Model for Deep Convolutional Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment