Curriculum Dropout

Curriculum Dr opout Pietro Morerio 1 , Jacopo Cav azza 1 , 2 , Riccardo V olpi 1 , 2 , Ren ´ e V idal 3 and V ittorio Murino 1 , 4 1 Pattern Analysis & Computer V ision (P A VIS) – Istituto Italiano di T ecnologia – Genova, 16163, Italy 2 Electrical, Electronics and T elecommunication Engineering and Nav al Architecture Department (DITEN) – Uni versit ` a degli Studi di Geno va – Genova, 16145, Italy 3 Department of Biomedial Engineering – Johns Hopkins Uni versity – Baltimor e, MD 21218, USA 4 Computer Science Department – Uni versit ` a di V erona – V er ona, 37134, Italy { pietro.morerio,jacopo.cavazza,riccardo.volpi,vittorio.murino } @iit.it, rvidal@cis.jhu.edu Abstract Dr opout is a very effective way of r egularizing neural networks. Stochastically “dr opping out” units with a cer- tain pr obability discour ages over -speciﬁc co-adaptations of featur e detectors, pr eventing overﬁtting and impr oving net- work generalization. Besides, Dr opout can be interpr eted as an appr oximate model aggr e gation technique, wher e an exponential number of smaller networks are avera ged in or- der to get a mor e powerful ensemble. In this paper , we show that using a ﬁxed dr opout pr obability during training is a suboptimal choice. W e thus pr opose a time scheduling for the pr obability of r etaining neur ons in the network. This in- duces an adaptive r egularization scheme that smoothly in- cr eases the difﬁculty of the optimization pr oblem. This idea of “starting easy” and adaptively incr easing the difﬁculty of the learning pr oblem has its roots in curriculum learning and allows one to train better models. Indeed, we pr ove that our optimization strate gy implements a very general cur- riculum scheme, by gradually adding noise to both the in- put and intermediate featur e r epr esentations within the net- work arc hitectur e. Experiments on seven image classiﬁca- tion datasets and differ ent network arc hitectures show that our method, named Curriculum Dr opout, fr equently yields to better gener alization and, at worst, performs just as well as the standar d Dr opout method. 1. Introduction Since [ 17 ], deep neural networks have become ubiqui- tous in most computer vision applications. The reason is generally ascribed to the powerful hierarchical feature rep- resentations directly learnt from data, which usually outper- form classical hand-crafted feature descriptors. As a drawback, deep neural networks are difﬁcult to train Figure 1. From left to right, during training (red arrows), our curriculum dropout gradually increases the amount of Bernoulli multiplicati ve noise, generating multiple partitions (orange boxes) within the dataset (yellow frame) and the feature repre- sentation layers (not shown here). Differently , the original dropout [ 13 , 25 ] (blue arrow) mainly focuses on the hardest partition only , complicating the learning from the beginning and potentially damaging the network classiﬁcation performance. because non-con ve x optimization and intensive computa- tions for learning the network parameters. Relying on av ail- ability of both massi ve data and hardware resources, the aforementioned training challenges can be empirically tack- led and deep architectures can be effecti vely trained in an end-to-end fashion, exploiting parallel GPU computation. Howe ver , overﬁtting remains an issue. Indeed, such a gigantic number of parameters is likely to produce weights that are so specialized to the training examples that the net- work’ s generalization capability may be extremely poor . The seminal work of [ 13 ] argues that overﬁtting occurs as the result of excessi ve co-adaptation of feature detec- tors which manage to perfectly explain the training data. This leads to overcomplicated models which unsatisfac- tory ﬁt unseen testing data points. T o address this issue, the Dropout algorithm was proposed and inv estigated in [ 13 , 25 ] and is nowadays extensi vely used in training neu- ral networks. The method consists in randomly suppress- ing neurons during training according to the values r sam- pled from a Bernoulli distribution. More speciﬁcally , if r = 1 that unit is kept unchanged, while if r=0 the unit is suppressed. The effect of suppressing a neuron is that the value of its output is set to zero during the forward pass of 1 training, and its weights are not updated during the back- ward pass. One one forward-backward pass is completed, a new sample of r is drawn from each neuron, and another forward-backward pass is done and so on till con vergence. At testing time, no neuron is suppressed and all activ ations are modulated by the mean value of the Bernoulli distribu- tion. The resulting model is in fact often interpreted as an av erage of multiple models, and it is argued that this im- prov es its generalization ability [ 13 , 25 ]. Lev eraging on the Dropout idea, many works hav e pro- posed v ariations of the original strategy [ 14 , 22 , 31 , 30 , 1 , 20 ]. Ho wever , it is still unclear which variation improves the most with respect to the original dropout formulation [ 13 , 25 ]. In many works (such as [ 22 ]) there is no real the- oretical justiﬁcation of the proposed approach other than fa vorable empirical results. Therefore, providing a sound justiﬁcation still remains an open challenge. In addition, the lack of publicly available implementations ( e.g ., [ 20 ]) make fair comparisons problematic. The point of departure of our work is the intuition that the excessi ve co-adaptation of feature detectors, which leads to ov erﬁtting, are very unlikely to occur in the early epochs of training. Thus, Dropout seems unnecessary at the beginning of training. Inspired by these considerations, in this work we propose to dynamically increase the number of units that are suppressed as a function of the number of gra- dient updates. Speciﬁcally , we introduce a generalization of the dropout scheme consisting of a temporal scheduling - a curriculum - for the expected number of suppressed units. By adapting in time the parameter of the Bernoulli distribu- tion used for sampling, we smoothly increase the suppres- sion rate as training evolv es, thereby improving the gener- alization of the model. In summary , the main contributions of this paper are the following. 1. W e address the problem of overﬁtting in deep neural networks by proposing a nov el regularization strategy called Curriculum Dropout that dynamically increases the expected number of suppressed units in order to improv e the generalization ability of the model. 2. W e draw connections between the original dropout framew ork [ 13 , 25 ] with regularization theory [ 8 ] and curriculum learning [ 2 ]. This provides an improv ed justiﬁcation of (Curriculum) Dropout training, relating it to existing machine learning methods. 3. W e complement our foundational analysis with a broad experimental validation, where we compare our Cur - riculum Dropout versus the original one [ 13 , 25 ] and anti-Curriculum [ 22 ] paradigms, for (con volutional) neural network-based image classiﬁcation. W e ev al- uate the performance on standard datasets (MNIST [ 19 , 26 ], SVHN [ 21 ], CIF AR-10/100 [ 16 ], Caltech- 101/256 [ 9 , 10 ]). As the results certify , the pro- posed method generally achieves a superior classiﬁca- tion performance. The remaining of paper is outlined as follows. Rele- vant related works are summarized in § 2 and Curriculum Dropout is presented in § 3 and § 4 , providing foundational interpretations. The experimental ev aluation is carried out in § 5 . Conclusions and future work are presented in § 6 . 2. Related W ork As previously mentioned, dropout is introduced by Hin- ton et al. [ 13 ] and Si vrastava et al. [ 25 ]. Therein, the method is detailed and ev aluated with different types of deep learning models (Multi-Layer Perceptrons, Con volu- tional Neural Networks, Restricted Boltzmann Machines) and datasets, conﬁrming the effecti veness of this approach against overﬁtting. Since then, many works [ 29 , 20 , 30 , 1 , 31 , 14 , 28 , 22 ] hav e inv estigated the topic. W an et al. [ 29 ] propose Drop-Connect, a more gen- eral version of Dropout. Instead of directly setting units to zero, only some of the network connections are suppressed. This generalization is proven to be better in performance but slower to train with respect to [ 13 , 25 ]. Li et al. [ 20 ] introduce data-dependent and Ev olutional-dropout for shal- low and deep learning, respectively . These versions are based on sampling neurons form a multinomial distrib ution with dif ferent probabilities for dif ferent units. Results sho w faster training and sometimes better accuracies. W ang et al. [ 30 ] accelerate dropout. In their method, hidden units are dropped out using approximated sampling from a Gaussian distribution. Results show that [ 30 ] leads to fast conv er- gence without deteriorating the accuracy . Bayer et al. [ 1 ] carry out a ﬁne analysis, sho wing that dropout can be proﬁ- ciently applied to Recurrent Neural Networks. W u and Gu [ 31 ] analyze the effect of dropout on the conv olutional lay- ers of a CNN: they deﬁne a probabilistic weighted pooling, which ef fectively acts as a regularizer . Zhai and Zhang [ 33 ] in vestigate the idea of dropout once applied to matrix factor- ization. Ba and Frey [ 14 ] introduce a binary belief network which is overlaid on a neural network to selectiv ely sup- press hidden units. The two networks are jointly trained, making the ov erall process more computationally expen- siv e. W ager et al. [ 28 ] apply Dropout on generalized linear models and approximately prov e the equiv alence between data-dependent L 2 regularization and dropout training with AdaGrad optimizer . Rennie et al. [ 22 ] propose to adjust the dropout rate, linearly decreasing the unit suppression rate during training, until the network experiences no dropout. While some of the aforementioned methods can be ap- plied in tandem, there is still a lack of understanding about which one is superior - this is also due to the lack of pub- licly released code (as happens in [ 20 ]). In this respect, [ 22 ] is the most similar to our work. A few papers do not go beyond a bare experimental ev aluation of the proposed dropout variation [ 20 , 1 , 31 , 14 , 22 ], omitting to justify the soundness of their approach. Conv ersely , while some works are much more formal than ours [ 30 , 28 , 33 ], all of them rely on approximations to carry out their analysis which is biased to wards shallo w models (logistic [ 28 ] or linear re- gression [ 30 , 28 ] and matrix factorization [ 33 ]). Differently , in our paper, in addition to its experimental effecti veness, we provide sev eral natural justiﬁcations to corroborate the proposed dropout generalization for deep neural networks. 3. A Time Scheduling f or the Dropout Rate Deep Neural Networks display co-adaptations between units in terms of concurrent activ ations of highly organized clusters of neurons. During training, the latter specialize themselves in detecting certain details of the image to be classiﬁed, as shown by Zeiler and Fergus [ 32 ]. They visu- alize the high sensitivity of certain ﬁlters in dif ferent lay- ers in detecting dogs, people’ s faces, wheels and more gen- eral ordered geometrical patterns [ 32 , Fig. 2]. Moreover , such co-adaptations are highly generalizable across differ - ent datasets as prov ed by T orralba’ s work [ 34 ]. Indeed, the ﬁlter responses provided in the AlexNet within con v1 , pool2/5 and fc7 layers are very similar [ 34 , Fig. 5], despite the images used for the training are very different: objects from ImageNet versus scenes from Places datasets. These arguments support the existence of some posi- tive co-adaptations between neurons in the network. Nev- ertheless, as soon as the training keeps going, some co- adaptations can also be ne gative if excessi vely speciﬁc of the training images exploited for updating the gradients. Consequently , exaggerated co-adaptations between neurons weaken the network generalization capability , ultimately re- sulting in overﬁtting. T o prevent it, Dropout [ 13 , 25 ] pre- cisely contrasts those negati ve co-adaptations. The latter can be remov ed by randomly suppressing neu- rons of the architecture, restoring an improved situation where the neurons are more “independent”. This empiri- cally reﬂects into a better generalization capability [ 13 , 25 ]. Network training is a dynamic process. Despite the pre- vious interpretation is totally sound, the original Dropout algorithm cannot precisely accommodate for it. Indeed, the suppression of a neuron in a given layer is modeled by a Bernoulli ( θ ) random variable 1 , 0 < θ ≤ 1 . Employing such distribution is very natural, since it statistically mod- els binary acti vation/inhibition processes. In spite of that, it seems suboptimal that θ should be ﬁxed during the whole 1 T o a void confusion in our notation, please note that θ is the equi valent of p in [ 13 , 25 , 28 ], i.e the probability of retaining a neuron. t = 0 t = T θ 1 Figure 2. Curriculum functions. Eq. ( 1 ) (red), polynomial (blue) and exponential (green). training stage. W ith this operative choice, [ 13 , 25 ] is actu- ally treating the ne gati ve co-adaptations phenomena as uni- formly distributed during the whole training time. Differently , our intuition is that, at the beginning of the training, if any co-adaptation between units is displayed, this should be preserved as positively representing the self- organization of the network parameters towards their opti- mal conﬁguration. W e can understand this by considering the random ini- tialization of the network’ s weights. They are statistically independent and actually not co-adapted at all. Also, it is quite unnatural for a neural network with random weights to overﬁt the data. On the other hand, the risk of ov er- done co-adaptations increases as the training proceeds since the loss minimization can achie ve a small objectiv e v alue by ov ercomplicating the hierarchical representation learnt from data. This implies that overﬁtting caused by excessive co-adaptations appears only after a while . Since a ﬁxed parameter θ is not able to handle increasing lev els of negati ve co-adaptations, in this work, we tackle this issue by proposing a temporal dependent θ ( t ) parame- ter . Here, t denotes the training time, measured in gradient updates t ∈ { 0 , 1 , 2 , . . . } . Since θ ( t ) models the probabil- ity for a gi ven neuron to be retained, D · θ ( t ) will count the av erage number of units which remain active over the total number D in a gi ven layer . Intuitiv ely , such quantity must be higher for the ﬁrst gradient updates, then starting decreasing as soon as the training gears. In the late stages of training, such decrease should be stopped. W e thus con- strain θ ( t ) to be θ ( t ) ≥ θ for any t , where θ is a limit value, to be taken as 0 . 5 ≤ θ ≤ 0 . 9 as prescribed by the original dropout scheme [ 25 , §A.4] (the higher the layer hierarchy , the lower the retain probability). Inspired by the previous considerations, we propose the following deﬁnition for a curriculum function θ ( t ) aimed at improving dropout training (as it will become clear in section 4 , from no w on we will often use the terms curricu- lum and scheduling interchangeably). Deﬁnition 1. Any function t 7→ θ ( t ) such that θ (0) = 1 and lim t →∞ θ ( t ) & θ is said to be a curriculum function to gener alize the original dr opout [ 13 , 25 ] formulation with r etain pr obability θ . Starting from the initial condition θ (0) = 1 where no unit suppression is performed, dropout is gradually intro- duced in a way that θ ( t ) ≥ θ for any t . Eventually ( i.e . when t is big enough), the con ver gence θ ( t ) → θ models the fact that we retrie ve the original formulation of [ 13 , 25 ] as a particular case of our curriculum. Among the functions as in Def. 1 , in our work we ﬁx θ curriculum ( t ) = (1 − θ ) exp( − γ t ) + θ , γ > 0 (1) By considering Figure 2 , we can provide intuitiv e and straightforward moti vations reg arding our choice. The blue curves in Fig. 2 are polynomials of increasing degree δ = { 1 , . . . , 10 } (left to right). Despite fulﬁlling the initial constraint θ (0) = 1 , the y ha ve to be manually thresh- olded to impose θ ( t ) → θ when t → ∞ . This introduces two more (undesired) parameters ( δ and the threshold) with respect to [ 13 , 25 ], where the only quantity to be selected is θ . The v ery same argument discourages the replacement of the v ariable t by t α in ( 1 ), (green curves in Fig. 2 , α = { 2 , . . . , 10 } , left to right). Moreover , by e valuating the area under the curve, we can intuitively measure ho w ag- gressiv ely the green curves behave while delaying the drop- ping out scheme they e ventually con verge to (as θ ( t ) → θ ). Precisely , that con ver gence is f aster while moving to the green curves more on the left, being the fastest one achie ved by our scheduling function ( 1 ) (red curve, Fig. 2 ). One could still argue that the parameter γ > 0 is annoy- ing since it requires cross v alidation. This is not necessary: in fact, γ can actually be ﬁxed according to the following heuristics. Despite Def. 1 considers the limit of θ ( t ) for t → ∞ , such condition has to be operatively replaced by t ≈ T , being T the total number of gradient updates needed for optimization. It is thus totally reasonable to assume that the order of magnitude of T is a priori known and ﬁxed to be some power of 10 such as 10 4 , 10 5 . Therefore, for a curriculum function as in Def. 1 , we are interested in fur- thermore imposing θ ( t ) ≈ θ when t ≈ T . Actually , a rule of thumb such as γ = 10 /T (2) implies | θ curriculum ( T ) − θ | < 10 − 4 and was used for all the experiments in § 5 . Additionally , from Figure 2 , we can grab some intuitions about the fact that the asymptotic con- ver gence to θ is indeed realized for a quite consistent part of the training and well before t ≈ T . This means that dur- ing a big portion of the training, we are actually dropping out neurons as prescribed in [ 13 , 25 ], addressing the ov erﬁt- ting issue. In addition to these arguments, we will provide complementary insights on our scheduled implementation for dropout training. Smarter initialization f or the network weights. The problem of optimizing deep neural networks is non-con vex due to the non-linearities (ReLUs) and pooling steps. In spite of that, a few theoretical papers have in vestigated this issue under a sound mathematical perspecti ve. For instance, under mild assumptions, Haeffele and V idal [ 11 ] derive suf- ﬁcient conditions to ensure that a local minimum is also a global one to guarantee that the former can be found when starting from any initialization. The same theory presented in [ 11 ] cannot be straightforwardly applied to the dropout case due to the pure deterministic framework of the theo- retical analysis that is carried out. Therefore, it is still an open question whether all initializations are equiv alent for the sake of a dropout training and, if not, which ones are preferable. Far from providing any theoretical insight in this ﬂav or , we posit that Curriculum Dropout can be interpreted as a smarter initialization. Indeed, we implement a soft tran- sition between a classical dropout-free training of a netw ork versus the dropout one [ 13 , 25 ]. Under this perspecti ve, our curriculum seems equi valent to performing dropout train- ing of a network whose weights have already been slightly optimized, evidently resulting in a better initialization for them. As a nai ve approach, one can think to perform regu- lar training for a certain amount of gradient updates and then apply dropout during the remaining ones. W e call that Switch-Curriculum . This actually induces a discontinuity in the objecti ve v alue which can damage the performance with respect to the smooth transition performed by our curricu- lum ( 1 ) - check Fig. 4 . Curriculum Dropout as adaptive regularization. Se v- eral connections [ 28 , 29 , 25 , 33 ] ha ve been established between Dropout and model training with noise addition [ 3 , 23 , 33 ]. The common trend discov ered is that when an unregularized loss function is optimized to ﬁt artiﬁcially corrupted data, this is actually equivalent to minimize the same loss augmented by a data dependent penalizing term. In both [ 28 , T able 2.] for linear/logistic regression and [ 25 , §9.1] for least squares, it is proved that Dropout induces a regularizer which is scaled When θ = θ , the impact of the regularization is just ﬁxed , therefore rising potential over - and under-ﬁtting issues [ 8 ]. But, for θ = θ curriculum ( t ) , when t is small, the regular - izer is set to zero ( θ curriculum (0) = 1 ) and we do not per- form any regularization at all. Indeed, the latter is simply not necessary: the network weights still hav e values which are close to their random and statistically independent ini- tialization. Hence, overﬁtting is unlikely to occur at early training steps. Dif ferently , we should expect it to occur as soon as training proceeds: by using ( 1 ), the regularizer is now weighted by θ curriculum ( t )(1 − θ curriculum ( t )) , (3) which is an increasing function of t . Therefore, the more the gradient updates t , the heavier the effect of the regular - ization. This is the reason why overﬁtting is better tack- led by the proposed curriculum. Despite the overall idea of an adaptiv e selection of parameters is not novel for either regularization theory [ 12 , 7 , 4 , 24 , 6 ] or tuning of network hyper-parameters (e.g. learning rate, [ 5 ]), to the best of our knowledge, this is the ﬁrst time that this concept of time- adaptiv e regularization is applied to deep neural networks. Compendium . Let us conclude with some general com- ments. W e posit that there is no ov erﬁtting at the begin- ning of the network training. Therefore, dif ferently from [ 13 , 25 ], we allo w for a scheduled retain probability θ ( t ) which gradually drops neurons out. Among other plausi- ble curriculum functions as in Def. 1 , the proposed choice ( 1 ) introduces no additional parameter to be tuned and im- plicitly provides a smarter weight initialization for dropout training. The superiority of ( 1 ) also relates to i ) the smoothly in- creasingly amount of units suppressed and ii ) the soft adap- tiv e regularization performed to contrast ov erﬁtting. Throughout these interpretations, we can retrie ve a com- mon idea of smoothly changing difﬁculty of the training which is applied to the network. This fact can be better un- derstood by ﬁnding the connections with Curriculum Learn- ing [ 2 ], as we explain in the ne xt section. 4. Curriculum Learning, Curriculum Dr opout For the sake of clarity , let us remind the concept of cur- riculum learning [ 2 ]. W ithin a classical machine learning algorithm, all training examples are presented to the model in an unordered manner, frequently applying a random shuf- ﬂing. Actually , this is very dif ferent from what happens for the human training process, that is education. Indeed, the latter is highly structured so that the lev el of dif ﬁculty of the concepts to learn is proportional to the a ge of the people, managing easier knowledge when babies and harder when adults. This “start small” paradigm will likely guide the learning process [ 2 ]. Follo wing the same intuition, [ 2 ] proposes to subdivide the training examples based on their difﬁculty . Then, the learning is conﬁgured so that easier examples come ﬁrst, ev entually complicating them and processing the hardest ones at the end of the training. This concept is formalized by introducing a learning time λ ∈ [0 , 1] , so that training be- gins at λ = 0 and ends at λ = 1 . At time λ , Q λ ( z ) denotes the distribution which a training example z is drawn from. The notion of curriculum learning is formalized requiring that Q λ ensures a sampling of examples z which are easier than the ones sampled from Q λ + ε , ε > 0 . Mathematically , this is formalized by assuming Q λ ( z ) ∝ W λ ( z ) P ( z ) . (4) In ( 4 ), P ( z ) is the target training distribution, accounting for all examples, both easy and hard ones. The sampling from P is corrected by the factor 0 ≤ W λ ( z ) ≤ 1 for any λ and z . The interpretation for W λ ( z ) is the measure of the difﬁculty of the training e xample z . The maximal complex- ity for a training example is ﬁxed to 1 and reached at the end of the training, i.e . W 1 ( z ) = 1 , i.e . Q 1 ( z ) = P ( z ) . The relationship W λ ( z ) ≤ W λ + ε ( z ) (5) represents the increased complexity of training examples from instant λ to λ + ε . Moreover , the weights W λ ( z ) must be chosen in such a way that H ( Q λ ) < H ( Q λ + ε ) , (6) where Shannon’ s entropy H ( Q λ ) models the fact that the quantity of information exploited by the model during train- ing increases with respect to λ . In order to prove that our scheduled dropout fulﬁlls this deﬁnition, for simplicity , we will consider it as applied to the input layer only . This is not restrictive since the same considerations apply to any intermediate layer , by consider- ing that each layer trains the feature representation used as input by the subsequent one. As the images exploited for training, consider the par- titions in the dataset including all the (original) clean data and all the possible ways of corrupting them through the Bernoulli multiplicati ve noise (see Fig. 1 ). Let π denote the probability of sampling an uncorrupted d -dimensional im- age within an image dataset (nothing more than a uniform distribution over the a vailable training examples). Let us ﬁx the gradient update t . The case of sampling a dropped-out z is equiv alent to sampling the corresponding uncorrupted image z 0 from π and then overlapping it with a binary mask b (of size d ), where each entry of b is zero with probability 1 − θ ( t ) . By mapping b to the number i of its zeros, P [ z ] = P [ z 0 , i ] =  d i  (1 − θ ( t )) i θ ( t ) d − i · π ( z 0 ) . (7) Indeed, (1 − θ ( t )) i θ ( t ) d − i is the probability of sampling one binary mask b with i zeros and  d i  accounts for all the possible combinations. Re-parameterizing the training time t = λT , we get Q λ ( z ) =  d i  (1 − θ ( λT )) i θ ( λT ) d − i · π ( z 0 ) . (8) By deﬁning P ( z ) = Q 1 ( z ) and W λ ( z ) = 1 P ( z )  d i  (1 − θ ( λT )) i θ ( λT ) d − i · π ( z 0 ) , (9) one can easily prov e that the deﬁnition in [ 2 ] is fulﬁlled by the choice ( 8 ) for curriculum learning distribution Q λ ( z ) . T o conclude, we give an additional interpretation to Cur - riculum Dropout. At λ = 0 , θ (0) = 1 and no entry of z 0 is set to zero. This clearly corresponds to the easiest a vailable example, since the learning starts at t = 0 by considering all possible av ailable visual information. When θ start de- creasing to θ ( λT ) ≈ 0 . 99 , only 1% of z 0 is suppressed (on av erage) and still almost all the information of the origi- nal dataset Z 0 is av ailable for training the network. But, as λ grows, θ ( λT ) decreases and a bigger number of entries are set to zero. This complicates the task, requiring an im- prov ed effort from the model to capitalize from the reduced uncorrupted information which is av ailable at that stage of the training process. After all, this connection between Dropout and Cur- riculum Learning was possible thanks to our generaliza- tion through Def. 1 . Consequently , the original Dropout [ 13 , 25 ] can be interpreted as considering the single spe- ciﬁc v alue λ such that θ ( λT ) = θ , being θ the constant re- tain probability on [ 13 , 25 ]. This means that, as previously found for the adaptiv e regularization (see § 3 ), the lev el of difﬁculty W λ ( z ) of the training examples z is ﬁxed in the original Dropout. This encounters the concrete risk of ei- ther ov ersimplifying or overcomplicating the learning, with detrimental effects on the model’ s generalization capabil- ity . Hence, the proposed method allows to setup a progres- siv e curriculum Q λ ( z ) , complicating the examples z in a smooth and adapti ve manner , as opposed to [ 13 , 25 ], where such complication is ﬁxed to equal the maximal one from the very be ginning (Fig. 1 ). T o conclude, let us note that the aforementioned work [ 22 ] proposes a linear incr ease of the retain probability . Ac- cording to equations ( 4 - 6 ) this implements what [ 2 ] calls an anti-curriculum: this is shown to perform slightly better or worse than the no-curriculum strategy [ 2 ] and always worse than an y curriculum implementation. Our experiments con- ﬁrm this ﬁnding. 5. Experiments In this Section, we applied Curriculum Dropout to neu- ral networks for image classiﬁcation problems on different datasets, using Con volutional Neural Network (CNN) ar- chitectures and Multi-Layer Perceptrons (MLPs) 2 . In par- ticular , we used two different CNN architectures: LeNet [ 18 ] and a deeper one (con v-maxpool-conv-maxpool-con v- maxpool-fc-fc-softmax), further called CNN-1 and CNN-2, respectiv ely . In the following, we detail the datasets used and the network architectures adopted in each case. MNIST [ 19 ] - A dataset of grayscale images of hand- written digits (from 0 to 9), of resolution 28 × 28. Train- ing and test sets contain 60.000 and 10.000 images, respec- 2 Code av ailable at https://github.com/pmorerio/ curriculum- dropout . tiv ely . For this dataset, we used a three-layer MLP , with 2.000 units in each hidden layer , and CNN-1. Double MNIST - This is a static version of [ 26 ], gener- ated by superimposing two random images of two digits (ei- ther distinct or equal), in order to generate 64 × 64 images. The total amount of images are 70.000, with 55 total classes (10 unique digits classes +  10 2  = 45 unsorted couples of digits) . T raining and test sets contain 60.000 and 10.000 images, respectiv ely . Training set’ s images were generated using MNIST training images, and test set’ s images were generated using MNIST test images. W e used CNN-2. SVHN [ 21 ] - Real world RGB images of street view house numbering. W e used the cropped 32 × 32 images representing a single digit (from 0 to 9). W e exploited a subset of the dataset, consisting in 6.000 images for train- ing and 1.000 images for testing, randomly selected. W e used CNN-2 also in this case. CIF AR-10 and CIF AR-100 [ 16 ] - These datasets col- lect 32 × 32 tiny RGB natural images, reporting 6000 and 600 elements per each of the 10 or 100 classes, respec- tiv ely . In both datasets, training and test sets contain 50.000 and 10.000 images, respectiv ely . W e used CNN-1 for both datasets. Caltech-101 [ 9 ] - 300 × 200 resolution RGB images of 101 classes. For each of them, a variable size of instances is av ailable: from 30 to 800. T o have a balanced dataset, we used 20 and 10 images per class for training and testing, respectiv ely . Images were reshaped to 128 × 128 pixels. W e used CNN-2 again here. Caltech-256 [ 10 ] - 31000 RGB images for 256 total classes. For each class, we used 50 and 20 images for training and testing, respectively . Images were reshaped to 128 × 128 pixels. W e used CNN-2. For training CNN-1, CNN-2 and MLP , we exploited a cross-entropy cost function with Adam optimizer [ 15 ] and a momentum term of 0.95, as suggested in [ 25 ]. W e used mini-batches of 128 images and ﬁxed the learning rate to be 10 − 4 . W e applied curriculum dropout using the function ( 1 ) where γ is picked using the heuristics ( 2 ) and θ is ﬁxed as follows. For both CNN-1 and CNN-2, the retain proba- bility for the input layer was set to θ input = 0 . 9 , selecting θ conv = 0 . 75 and θ fc = 0 . 5 for con volutional and fully con- nected layers, respectively . For the MLP , θ input = 0 . 8 and θ hidden = 0 . 5 . In all cases, we adopted the recommended values [ 25 , §A.4]. Before reporting our results, let us emphasize that our aim is to improv e the standard dropout framew ork [ 13 , 25 ], not to compete for the state-of-the art performance in im- age classiﬁcation tasks. For this reason, we did not use en- gineering tricks such as data augmentation or any particu- lar pre-processing, and neither we tried more complex (or deeper) network architectures. MNIST [ 19 ] (MLP) MNIST [ 19 ] (CNN-1) Double MNIST [ 26 ] n ﬁxed Double MNIST [ 26 ] nθ ﬁxed SVHN [ 21 ] n ﬁxed SVHN [ 21 ] nθ ﬁxed CIF AR-10 [ 16 ] CIF AR-100 [ 16 ] Caltech-101 [ 9 ] Caltech-256 [ 10 ] Figure 3. Curriculum Dropout (green) compared with regular Dropout [ 13 , 25 ] (blue), anti-Curriculum (red) and a regular training of a network with no units suppression (black). For all cases, we plot mean test accuracy (av eraged over 10 different re-trainings) as a function of gradient updates. Shadows represent standard deviation errors. Best viewed in colors. Double MNIST [ 26 ] Double MNIST [ 26 ] Figure 4. Switch-Curriculum. W e compare the Curriculum (green) and the regular Dropout (blue) with three cases where we switch from regular to dropout training i) at the beginning (pink) ii) in the middle (violet), iii) almost at the end (purple) of the learning. From left to right, curriculum functions, cross-entropy loss and test accuracy curves. In Fig. 3 , we qualitati vely compared Curriculum Dropout (green) versus the original Dropout [ 13 , 25 ] (blue), anti-Curriculum Dropout (red) and an unregularized, i.e . no Dropout, training of a network (black). Since CNN- 1, CNN-2 and MLP are trained from scratch, in order to ensure a more robust experimental ev aluation, we hav e re- peated the weight optimization 10 times for all the cases. Hence, in Fig. 3 , we report the mean accurac y v alue curves, representing with shadows the standard de viation errors. Additionally , we report in T able 1 the percentage accu- racy improv ements of Dropout [ 13 , 25 ], anti-Curriculum Dropout [ 22 ] and Curriculum Dropout (proposed) versus a Dataset Architecture Conﬁguration ( n or nθ ﬁxed) Classes Unregularized network Dropout [ 13 , 25 ] Anti-Curriculum Curriculum Dr opout (percent boost [ 27 ] over Dropout [ 13 , 25 ]) MNIST [ 19 ] MLP n 10 98.67 +0.38 +0.04 +0.36 (-5.3%) CNN-1 n 99.25 +0.15 -0.05 +0.18 (20.0%) Double MNIST CNN-2 n 55 92.48 +1.42 +0.73 +2.35 (65.5%) CNN-2 nθ +0.87 +0.53 +1.11 (27.6%) SVHN [ 21 ] CNN-2 n 10 84.63 +2.35 +1.17 +2.65 (12.8%) CNN-2 nθ +1.59 +1.51 +2.06 (29.6%) CIF AR-10 [ 16 ] CNN-1 n 10 73.06 +0.22 -0.68 +0.62 (182%) CIF AR-100 [ 16 ] CNN-1 n 100 39.70 +1.01 +0.01 +1.66 (64.4%) Caltech-101 [ 9 ] CNN-2 n 101 28.56 +4.21 +1.57 +4.72 (12.1%) Caltech-256 [ 10 ] CNN-2 n 256 14.39 +2.36 -0.22 +3.23 (36.9%) T able 1. Comparison of the proposed scheduling versus [ 13 , 25 ] in terms of percentage accuracy impro vement. baseline network where no neuron is suppressed. T o do that, we selected the av erage of the 10 highest mean accuracies obtained by each paradigm during each trial; then we aver - aged them over the 10 runs. W e accommodated the metric of [ 27 ] to measure the boost in accuracy ov er [ 13 , 25 ]. Also, we reproduced for two datasets the cases of ﬁxed layer size n or ﬁxed nθ as in [ 25 , §7.3]. Here the network layers’ size n is preliminary increased by a factor 1 /θ , since on av erage a fraction θ of the units is dropped out. Ho wev er , we notice that those bigger architectures tend to ov erﬁt the data. Switch-Curriculum. Figure 4 shows the results obtained on Double MNIST dataset by scheduling the dropout with a step function, i.e . no suppression is performed until a cer - tain switch-epoch is reached (§ 3 ). Precisely , we switched at 10-20-50 epochs. This curriculum is similar to the one induced by the polynomial functions of Figure 2 : in fact, both curves ha ve a similar shape and share the drawback of a threshold to be introduced. Y et, Switch-Curriculum sho ws an additional shortcoming: as highlighted by the spikes of both training and test accuracies, the sudden change in the network connections, induced by the sharp shift in the retain probabilities, makes the network lose some of the concepts learned up to that moment. While early switches are able to recov er quickly to good performances, late ones are delete- rious. Moreov er , we were not able to ﬁnd an y heuristic rule for the switch-epoch , which would then be a parameter to be validated. This makes Switch-Curriculum a less power- ful option compared to a smoothly-scheduled curriculum. Discussion. The proposed Curriculum Dropout, imple- mented through the scheduling function ( 1 ), improves the generalization performance of [ 13 , 25 ] in almost all cases. As the only exception, in MNIST [ 19 ] with MLP , the scheduling is just equiv alent to the original dropout frame- work [ 13 , 25 ]. Our guess is that the simpler the learning task, the less effecti ve Curriculum Learning. After all, for a task which is relativ ely easy itself, there is less need for “starting easy”. This is in any case done at no additional cost nor training time requirements. As expected, anti-Curriculum was improv ed by a more signiﬁcant gap by our scheduling. Also, sometimes, an anti-Curriculum strategy even performs worse than a non- regularized network ( e.g ., Caltech 256 [ 10 ]). This is co- herent with the ﬁndings of [ 2 ] and with our discussion in § 4 concerning Annealed Dropout [ 22 ], of which anti- Curriculum represents a generalization. In addition, while neither re gular nor Curriculum Dropout e ver need early stopping, anti-Curriculum often does. 6. Conclusions and Future W ork In this paper we hav e propose a scheduling for dropout training applied to deep neural networks. By softly in- creasing the amount of units to be suppressed layerwise, we achiev e an adaptiv e regularization and provide a better smooth initialization for weight optimization. This allows us to implement a mathematically sound curriculum [ 2 ] and justiﬁes the proposed generalization of [ 13 , 25 ]. Through a broad experimental e valuation on 7 image classiﬁcation tasks, the proposed Curriculum Dropout hav e prov ed to be more effecti ve than both the original Dropout [ 13 , 25 ] and the Annealed [ 22 ], the latter being an exam- ple of anti-Curriculum [ 2 ] and therefore achieving an infe- rior performance to our more disciplined approach in ease dropout training. Globally , we always outperform the orig- inal Dropout [ 13 , 25 ] using various architectures, and we improv e the idea of [ 22 ] by margin. W e hav e tested Curriculum Dropout on image classiﬁ- cation tasks only . Howe ver , our guess is that, as standard Dropout, our method is very general and thus applicable to different domains. As a future work, we will apply our scheduling to other computer vision tasks, also e xtending it for the case of inter-neural connection inhibitions [ 29 ] and Recurrent Neural Networks. Acknowledgment W e gratefully ackno wledge the support of NVIDIA Cor- poration with the donation of one T esla K40 GPU used for part of this research. References [1] J. Bayer, C. Osendorfer , and N. Chen. On fast dropout and its applicability to recurrent networks. In CoRR:1311.0701 , 2013. 2 , 3 [2] Y . Bengio, J. Louradour, R. Collobert, and J. W eston. Cur - riculum learning. In ICML , 2009. 2 , 5 , 6 , 8 [3] C. M. Bishop. Training with noise is equi valent to Tikhono v regularization. Neural Comput. , 7(1):108–116, 1995. 4 [4] S. Boyd, N. P arikh, E. Chu, B. Peleato, and J. Eckstein. Dis- tributed optimization and statistical learning via the alternat- ing direction method of multipliers. F ound. T rends Mach. Learn. , 3(1):1–122, 2011. 5 [5] G. Caglar , S. Jose, M. Marcin, and Y . Bengio. A rob ust adaptiv e stochastic gradient method for deep learning. In CoRR:1703.00788 , 2017. 5 [6] J. Cavazza and V . Murino. Active Regression with Adapti ve Huber loss. In CoRR:1606.01568 , 2016. 5 [7] K. Crammer , A. Kulesza, and M. Dredze. Adaptiv e regular - ization of weight vectors. In NIPS . 2009. 5 [8] T . Evgeniou, M. Pontil, and T . Poggio. Regularization net- works and support vector machines. Advances in Computa- tional Mathematics , 13(1), 2000. 2 , 4 [9] L. Fei-Fei, R. Fergus, and P . Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR workshop , 2004. 2 , 6 , 7 , 8 [10] G. Grifﬁn, A. Holub, and P . Perona. Caltech-256 object cat- egory dataset. In T echnical Report 7694, California Institute of T echnology , 2007. 2 , 6 , 7 , 8 [11] B. D. Haef fele and R. V idal. Global optimality in tensor fac- torization, deep learning, and beyond. In CoRR:1506.07540 , 2015. 4 [12] L. Hansen and C. Rasmussen. Pruning from adaptive regu- larization. Neural Computation , 6(6):1222–1231, 1994. 5 [13] G. E. Hinton, N. Sriv astav a, A. Krizhe vsky , I. Sutske ver , and R. Salakhutdinov . Improving neural networks by prev ent- ing co-adaptation of feature detectors. In CoRR:1207.0580 , 2012. 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 [14] B. Jimmy and B. Frey . Adaptive dropout for training deep neural networks. In NIPS , 2016. 2 , 3 [15] D. Kingma and J. Ba. Adam: A Method for Stochastic Opti- mization. In CoRR:1412.6980 , 2014. 6 [16] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master’ s thesis, Department of Computer Science, University of T oronto , 2009. 2 , 6 , 7 , 8 [17] A. Krizhevsk y , I. Sutske ver , and G. E. Hinton. Imagenet classiﬁcation with deep conv olutional neural networks. In NIPS . 2012. 1 [18] Y . LeCun, B. Boser, J. Denker , D. Henderson, R. How ard, W . Hubbard, and L. Jackel. Backpropagation applied to handwritten zip code recognition. In Neural Computation , pages 541–551, 1989. 6 [19] Y . LeCun, L. Bottou, Y . Bengio, and P . Haffner . Gradient- based learning applied to document recognition. In Pr oceed- ings of the IEEE, 86(11):22782324 , 2009. 2 , 6 , 7 , 8 [20] Z. G. Li and T . Boqing Y ang. Improv ed dropout for shallow and deep learning. In NIPS , 2016. 2 , 3 [21] Y . Netzer , T . W ang, A. Coates, A. Bissacco, B. Wu, and A. Y . Ng. Reading digits in natural images with unsupervised fea- ture learning. In NIPS workshop , 2011. 2 , 6 , 7 , 8 [22] S. J. Rennie, V . Goel, and S. Thomas. Annealed dropout training of deep networks. In Pr oceedings onf the IEEE W orkshop on SLT , pages 159–164, 2014. 2 , 3 , 6 , 7 , 8 [23] S. Rifai, X. Glorot, B. Y oshua, and P . V incent. Adding noise to the input of a model trained with a regularized objective. In CoRR:1104.3250 , 2011. 4 [24] M. Soltanolkotabi, E. Elhamifar , and E. J. Cand ` es. Robust subspace clustering. In CoRR:1301.2603 , 2013. 5 [25] N. Sriv astava, G. Hinton, A. Krizhevsk y , I. Sutskever , and R. Salakhutdinov . Dropout: A simple way to prevent neural networks from overﬁtting. J. Mach. Learn. Res. , 15(1):1929– 1958, 2014. 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 [26] N. Srivasta va, E. Mansimov , and R. Salakhutdinov . Unsu- pervised Learning of V ideo Representations using LSTMs. In CoRR:1502.04681 , 2015. 2 , 6 , 7 [27] A. T orralba and A. A. Efros. Unbiased look at dataset bias. In CVPR , 2011. 8 [28] S. W ager, S. W ang, and P . S. Liang. Dropout training as adaptiv e regularization. In NIPS . 2013. 2 , 3 , 4 [29] L. W an, M. Zeiler , S. Zhang, Y . L. Cun, and R. Fergus. Reg- ularization of neural networks using dropconnect. In ICML , 2013. 2 , 4 , 8 [30] S. W ang and C. D. Manning. Fast dropout training. In ICML , pages 118–126, 2013. 2 , 3 [31] H. W u and X. Gu. T owards dropout training for con volu- tional neural networks. Neural Networks , 71:1–10. 2 , 3 [32] M. D. Zeiler and R. Fergus. V isualizing and understanding con volutional networks. In ECCV , 2014. 3 [33] S. Zhai and Z. M. Zhang. Dropout training of matrix factor- ization and autoencoders for link prediction in sparse graphs. In CoRR:1512.04483 , 2015. 2 , 3 , 4 [34] B. Zhou, A. Lapedriza, J. Xiao, A. T orralba, and A. Oliv a. Learning deep features for scene recognition using places database. In NIPS . 2014. 3

Curriculum Dropout

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment