Normalization Before Shaking Toward Learning Symmetrically Distributed Representation Without Margin in Speech Emotion Recognition

SUBMISSION TO THE IEEE TRANSA CTIONS 1 Nor malization Bef ore Shaking T o w ar d Lear ning Symmetr icall y Distr ib uted Representation Without Margin in Spee ch Emotion Recognition Che-W ei Huang, Student Member , I EEE, and Shr ikanth S Nara yanan, F ell ow , IEEE Abstract —Regularization is crucial to the success of many practical deep learning models, in par ticular in a more of ten than not scenario where there are only a few to a moderate number of accessible training sampl es. In addition to weight deca y , data augmentation and dropout, regular ization based on mult i -branc h architectures, such as Shake-Shak e regularization, has been proven successful i n m any applications and attracted m ore and more attention. Howe v er , bey ond model-based representation augmentation, it is unclear how Shake-S hake regularization helps to provide f ur ther improv ement on classiﬁcation tasks, let alone t he bafﬂing interaction between batch nor malization and shaking. In t his work, we present our inv estigation on Shak e-Shak e regularization, dra wing connections to the vicinal r isk minimization principle and discriminative feature learning in veriﬁcation tasks . Fur ther more, we identify a strong resemblance between batch normalized residual bloc ks and batch nor malized recurrent neur al networks, where both of them share a similar conv ergence behavior , which could be mitigated by a proper init i alization of batch norm alization. Based on the ﬁndings, our experiments on speech emotion recognition demonstrate simultaneously an improv ement on the classiﬁcation accuracy and a reduction on t he generalization gap both with statisti cal signiﬁcance. Index T erms —Shake-Shak e Regularization, Batch Norm alization, Symmetrically Distributed Representation, Vicinal Risk Minimization, Prediction Uncer t ainty Minimization, Aff ective Computing, Speech Emotion Recognition ✦ 1 I N T R O D U C T I O N D E ep convolutional neural networks have b een s uccess- fully applied to several p a ttern rec ognition tasks such as ima ge recognition [1], machine trans lation [2] a nd s p eech emotion recognition [3]. Currently , to successfully train a deep neural network, one needs either a su fﬁcient number of training s amples to implicitly regularize the lea rning pro- cess, or employ techniques like weight decay and dropout [4] a nd its v ariants to e x plicitly keep the model from over- ﬁtting. In the recent years, one of the most popula r and s uccess- ful arch itectures is the residual neura l network (ResNet) [1] and its v a riant ResNeXt [5] with multip le residual branches. The ResNet a rchitectur e wa s des igned bas ed on a key as- sumption that it is mor e efﬁcient to opt im iz e the r esidual term than the original task mapping. Since then, a great deal of ef fort in machine learning a nd computer vision has been dedicated to study t he multi-branch architecture. Deep convolutional neural networks ha ve also gained much att e nt ion in the community of a ffective computing mainly because of its outstanding ability to formulate dis- criminative features for the top-layer class iﬁ e r . Usu a lly the number of parameters in a model is far more than the number of training s amples a nd t hus it requires heavy regularization to train deep neural net works for affective computing. However , since the introduction of batch nor- • The authors are with the Signal Analysis and Interpretation Laboratory (SAIL), Depa r tment of Electrica l Engineering and Si gnal a nd Image Processing Institute, University of Southern Californi a, Los Angeles, CA 90089 USA E-mail: cheweihu@usc.edu; shri@sipi.usc.edu • This work is supported by NSF . malization [6], the gains obta ined by using dropout for regularization have decreased [6], [7], [8]. A recent work dedicated t o s tudy the disharmony between dropout and batch normalization [9 ] suggests that dropout introduces a variance shift betwee n training a nd testing, which causes batch normalization to malfunction if batch normalization is placed after dropout, which severely limits the app lication of su ccessful architectures su ch as ResNet or the app lication of dropout to the top-most fully connected layers. Y et, multi-branch arc hitectures ha ve emer ged as a promising alternative for regularizing convolutional layers. Regularization techniques based on multi-branch ar- chitectures such as Sha ke-Shake [10] and ShakeDrop [11] have delivered impressive performances on standard image datasets such as the CIF A R-10 [12] dataset. I n a clever way , both of them u tilize multiple branches to le a rn differ ent aspects of the relevant information and then a summation in the end follows for information alignment among branches. Also, both of S hake-Shake and ShakeDrop regularizations emphasize on the important interaction between batch nor- malization and shaking. However , none of them gave an explanation for this phenomenon, other than a brief discus - sion on limiting the strength of sha king. Instead of using multiple branches, a recent work [13] based on a mixtu re of experts showed that randomly projecting s amples is able to break the structure of adversarial noise tha t could easily confound the model and a s a result mislead the lea rning process. Despite not being an end-to-end approach, it shares the sa m e idea of integrating mu ltiple st reams of model- based diversity . In addition, a recent trend of studies on data aug- mentation, based on the V icinal Risk Minimiza tion (VRM) SUBMISSION TO THE IEEE TRANSA CTIONS 2 [14] principle, proposed to interpolat e and/or extrap olate training samples in feature space, for example, [15] and [16]. Szegedy et al. [17] s tudied regularization of training by la- bel smoothing. Furthermore, Mixup [18] performed convex combinations of p airs of feature a nd label to demonstrate that exp anding t he coverage of t raining s a mples in feat u re space, lea v ing little to none of margin between classes, could not only imp rove the pe rformance of a model b ut also make it robust to adversarial samples . Effectively , Mixu p reduces the uncerta inty in prediction of a testing samp le lying outside of the coverage of training samples in feature space by linear interpolation. Based on Mixup, Manifold Mixup [19] ca lled for mixing intermediate representations instead of raw inputs. Our work follows the model-bas ed representation aug- mentation thread like Shake-Shake, ShakeDrop regulariza- tion and Manifold Mixu p . In this work, we study the Shake- Shake regularized Res NeXt for speech emotion recognition. In addition to s haking the entire spectral-temporal feature maps with the sa me s trength, we p ropose to addr ess differ- ent spectral sub-bands independently b a sed on our hypoth- esis of the non-uniform distribution of affective information over the s p ectral a xis. Furthermore, we investigate and come up with a n exp lanation for the ability of sha king regularization to improve classiﬁca tion tasks and its crucial interaction with bat ch normaliz a tion. In order to achieve our goal, we conduct abla t ion studies on MNIST [20] and CIF AR-10 datasets to highlight a subtle differ ence in the requir ement of op timal embeddings by clas siﬁcation tasks based on the VRM principle and by v e riﬁca t ion tasks. In a d- dition, we identify a strong resemblance in the mathematical formulation between ba t ch normalized residual blocks and batch normalized recurrent neural networks, wher e both of them suffer from a shared iss ue: faster convergence but m ore over-ﬁtting and could be ﬁxed by the s ame technique. Our contributions are multi-fold. First, our work ex- plains with visualiz a tion the key factor to the s uccess of shaking regularization a nd the crucial property that ba tch normalization plays in a s haking regularized architecture, drawing a connection between shaking and the VRM p rin- ciple, and between ba tch normalization and discriminative embedding learning. Second, to the best of our knowledge, our work is the ﬁrst to highlight the s ubtle differ ence in the requirement of optimal embeddings by classiﬁcat ion tasks and by veriﬁca t ion tasks. T hird, our work identify the resemblance between batch normaliz ed residual blocks and batch normalized recurrent neural networks, and the shared issue they have. Based on the solution to batch normalized recurr ent neural networks, we demonstrate a signiﬁcant reduction on the genera liz ation gap, i.e. reduced over- ﬁtting, in a batch-normalized shaking-regularized ResNeXt for speech emotion r ecognition without sacriﬁcing t he vali- dation accuracy . The outline of this p aper is a s follows. W e review related work in the next section, including Shake -S ha ke regulariza- tion and its va ria nt s, discriminative feature learning in ver- iﬁcation tasks and the vicinal risk minimiza tion principle. Section 3 introduces sub-ba nd s haking. Section 4 presents batch normalized shaking, including ablat ion studies on MNIST and CIF A R-10 datasets, and the identiﬁcation of batch normalized residual blocks with batch normalized re- β← rand(0,1 ) α← rand(0,1 ) (a) (b) (c) Mul(  ) Conv 3x3 Mul(1-  ) Mul(  ) Conv 3x3 Mul(0.5) Mul(0.5) Mul(  ) Conv 3x3 Mul( β ) Mul(1- β ) Mul( α ) Conv Conv Conv Conv Conv Conv Conv Conv Conv Conv Conv Conv Fig. 1: A n overview of a 3 -branch Shake-Shake regularized residual block. (a) Forward propagation during the training phase (b) Backward propagation during t he training p hase (c) T esting phase. The coefﬁcients α a nd β are s ampled from the uniform distribution over [0 , 1] to scale down the forward a nd backward ﬂows during the training phase. current neu ral networks. S ection 5 and 6 cover the datasets, the network architecture, the exp e rimenta l setup and the results for speech emotion recognition. Section 7 concludes our work. 2 R E L A T E D W O R K In t his section, we review some of the recently developed techniques on which we ba se our work and those tha t are related to our ﬁndings. The ﬁrst subs ection covers Shake- Shake regularization [10] and its va riants, including S u b- band Sha king [22], S hakeDrop [11] and S tochastic S hake- Shake [23 ] regularization. Another subsect ion is devoted to end-to-end discriminative feature lea rning algorithms in face veriﬁcation, in p articular the t hread of work that focus on the large-margin and symmetrical representation learning in [24], [25], [26], [27], [28]. In the las t s ubsection, we ta lk about a recent success in supervised lea rning bas ed on the V icinal Risk Minimization (VRM) principle [14], called Mixup [18 ]. Moreover , we highlight one subtle but deﬁnite differ ence in the desired quality of intermediate representations between clas s iﬁcation and veriﬁcation tasks. 2.1 Shake-Shake Regularization and Its V ariants Shake-Shake regularization is a recently proposed technique to regularize training of deep convolutional neural networks for image recognition tas ks . This regularization technique based on multi-branch architectures promotes stochastic mixtures of forward and backward propagations from net- work branches in order to create a ﬂow of model-based adversarial learning samples/gra dients during the training phase. Owing to it ex cellent ability to combat over-ﬁtting even in the presence of b atch normalization, the Shake- Shake r egularized 3 -branch residual neural network [10] has achieved one of current state-of-the-art p e rformances on the CIF AR-10 image dataset. An overview of a 3 -branch Shake-Sha ke regularized ResNeXt is depicted in Fig. 1. Sha ke-Shake regularization adds t o the aggregate of the output of ea ch branch an addi- SUBMISSION TO THE IEEE TRANSA CTIONS 3 x l x l+1 Addition R eL U Conv Conv BN BN α 1- α x l x l+1 Addition R eL U R eL U R eL U R eL U Conv Conv Conv Conv BN BN BN BN α 1- α x l x l+1 Addition R eL U R eL U R eL U R eL U Conv Conv Conv Conv BN BN BN BN BN BN α 1- α x l x l+1 Addition R eL U R eL U R eL U R eL U Conv Conv Conv Conv BN BN BN BN α 1- α R eL U Conv Conv BN BN R eL U (a) original (b) R eL U-only pre-activation (c) full pre-activation (d) pre-activation + BN Fig. 2: Shaking regularized Res N eXt architectures with differ ent la youts introduced in [21] tional layer , called the s ha king layer , to randomly generate adversarial ﬂows in the following way: ResBlock N ( X ) = X + N X n =1 Shaking  { B n ( X ) } N n =1  where in t he forward propagation for a = [ α 1 , · · · , α N ] sampled from the ( N − 1 )-simplex (Fig. 1 (a)) ResBlock N ( X ) = X + N X n =1 α n B n ( X ) , while in the backward propagation for b = [ β 1 , · · · , β N ] sampled from the ( N − 1 )-simplex a nd g the gradient from the top layer , the gradient entering into B n ( x ) is β n g (Fig. 1 (b)). At tes t ing time, t he expected model is then evalu- ated for inference by taking the exp ectation of the random sources in the ar chitecture (Fig. 1 (c)). In each mini-ba t ch, to apply s caling coefﬁcients α or β either on the entire mini- batch or on each indi vidual sample independently can also make a difference. Instead of the commonly known original or the fully pre-activation r esidual block, Shake-Shake regularization is proposed with the R e LU-only pre-activation residual block. Refer to Fig. 2 for more details. In addition, it has been shown tha t s haking with the abs ence of both batch normal- ization layers could caus e the training p rocess to diverge. One proposed remedy to this s ituation in [10] is to employ a shallower architecture and more importantly to reduce the range of values α can take on, i.e. to reduce the strength of shaking. ShakeDrop regularization [11], ba sed on the same idea of model-based representation a ugmentation but on the deep p yramidal residual ar chitecture, reached an improved accuracy on CI F AR-10/100 dataset s. The a uthors further empirically observed that each residual block should end with bat ch normalization before the shaking la yer to prevent training from diverging. In ou r previous work on a coustic sub-ba nd shaking [22] and stochas t ic Shake-Shake regularizat ion [23] for affective computing from speech, we found that in a fully pre- activation a rchitectur e without a ba tch normaliz a tion layer right before shaking, the shaking mechanism contributes much more to constraining the lea rning process than to boosting the generalization power . I n addition, we s howed methods to relax or control the impact of s haking, either de- terministically by sub-band s haking or p robabilistically b y randomly turning off sha king, in or der to trade-off bet ween a higher accuracy and reduced over-ﬁtting. Fortunately , with app ropriate hyp er-parameters, we could achieve both with stat is tical signiﬁca nce, compared to the ba seline. All t hese ﬁndings indicated there is a close interaction between sha king a nd batch normalization. However , these studies did not giv e a n explana tion for the crucial location of batch normaliza t ion in a shaking regularized a rchitecture, other than a brief discussion of the range of α . 2.2 End-to -End Discriminative Feature Learning Recently , there ha s been a trend to focus on the design of loss functions so that a neural network sup e rvis ed by such a loss function is able to formulate more discriminative features, usu a lly comp ared to the ordinary softmax loss, for face veriﬁcation. Inspired by the contrastiv e loss [29 ] a nd the triplet loss [30], the main design objective aims to simultane- ously minimize intra-class disp ersion and m a ximize inter- class margin. However , using the contrastive loss or the triplets loss often involves training with p a irs and triplets on the order of O ( N 2 ) and O ( N 3 ) when N is the number of training samp les, or one has t o rely on a carefully selected sampling strategy . A series of work hence reviewed the interpretation of the softmax loss as a normalized exp onential of inner pr oducts between feature vector and clas s cente r vectors, and came up with various modiﬁcations for achieving the a foremen- tioned design objective, including the large-margin softmax [24], SphereFace [25], CosFa ce [26], ArcFace [27] and Cen- tralized Coordinate Learning (CCL) [28]. SUBMISSION TO THE IEEE TRANSA CTIONS 4 Differ ent from t he other four modiﬁcations, CCL dis - tributes features dispersedly by centralizing the fea t ures to the origin of the spa ce during the lea rning process so that feature vectors from differ ent clas ses can be mor e s eparable in t erms of a large angle b etween neighboring classes, a nd ideally symmet rically distributed in the whole feature space. The CCL loss is presented as follows: L = − 1 N N X i log e Φ( x i ) cos( θ y i ) P K j e Φ( x i ) cos( θ y j ) (1) where Φ( x i ) j = x ij − o j σ j . (2) Φ( x i ) j and x ij are j -th coord inate of Φ( x i ) and x i , respec- tively , and θ y j is the angle between x i and the class center vector w y j . It is immediately clear that Eq. (2) resembles the famous ba t ch normaliza tion: Φ( x i ) j = γ j x ij − o j σ j + β j (3) except that the trainable afﬁne transformation, deﬁned b y γ and β , after the normaliza tion are missing in the formu- lation. In Eq. (2) and (3), σ a nd o ar e running standard deviation and running mean u pdated per mini-bat ch dur- ing t raining. In fact, it wa s showed that there would be slight degradation in performance when the trainable afﬁne transformation is employed. In the rest of this work, we use batch normalization or CCL interc hangeably t o r efer t o this discriminative feature learning. 2.3 Vicinal Risk Minimization and Mixup In s tatistical learning theory , becaus e the data distribution P ( x, y ) is u nknown in most p ra ct ica l a pplications, one may approximate it by the empirical distribution P δ ( x, y ) = 1 N X n δ ( x = x n , y = y n ) , (4) where δ ( x = x n , y = y n ) is a Dirac centered at ( x n , y n ) , and D = { ( x n , y n ) : 1 ≤ n ≤ N } is a given training set sampled from P ( x, y ) . This p aradigm of lea rning via minimizing a risk based on the approximation in Eq. (4) is referred to as the Empirical Ris k Minimization (ERM) [31]. Instead of a Dirac function, one may employ a v icinity function to ta ke int o consideration the vicinity of a true training pair ( x n , y n ) , and this alternative is therefore called the V icinal Risk Minimization (VRM) principle [14]. The Gaussian vicinity [14] ν ( x, y | x n , y n ) = N ( x − x n , σ 2 ) δ ( y = y n ) is one of t he well-known examp les, which is equivalent to data augmentation with additive Gaussian noise. Recently , a novel vicinity function, called Mixup [18], is proposed to cover the entire convex hull of training pairs: ν ( x, y | x n , y n ) = 1 N X m E λ [ δ ( x = u m , y = v m )] , (5) u m = λ · x n + (1 − λ ) · x m , (6) v m = λ · y n + (1 − λ ) · y m , (7) where λ ∈ [0 , 1 ] . T his data augmenta tion scheme by Mixup is believed t o reduce the amount of undesirable oscillations when predicting out side the original training examples as it ﬁlls in the feature space bet wee n clas s es by convex combinations of true training pairs, a nd hence contribut e both to improvement on clas siﬁcation accuracy a s well as robustness to adversarial examples . Classiﬁcation versus V eriﬁcation : One could easily observe that the directions for imp roving clas siﬁcation tasks based on VRM and for reﬁning em beddings in veriﬁcation tasks based on the aforementioned design objective are drastically different, where the former s trives to minimize or eliminate t he margin between class es while t he latter aims to ensure a deﬁnite minimum margin. W e pro vide more details in Section 4 . 3 S U B - B A N D S H A K I N G Shake-Shake regularization delivers different results de- pending on the strength of the s haking, in terms of the range of values α (or β ) takes on as well as whether the same α (or β ) is shared within a mini-bat ch. In addition to batch- or sample-wise shaking, when it comes to t he a rea of acoustic p rocessing, there is another orthogonal dimension to consider: the spectral domain. Leveraging domain knowl- edge, ou r ﬁrst p roposed models are b ased on a simple bu t plausible hyp othesis that affective information is distributed non-uniformly over the s pectral axis [32]. T herefor e, there is no reason to enforce the entire s pectral axis to be sha ken with the same strength concurrently . Furthermore, adver- sarial noise may exist and extend over the spe ctra l axis. By deliberately shaking spectral sub -bands independently , the structure of adversa rial noise may be broken and become less confounding to the model. There has been work on mu lti-stream framework in speech processing. For example, Mallidi et al. [33] designed a robust s peech reco gnition system using multiple streams, each of t hem attending to a differ ent part of the fea ture space, to ﬁght agains t noise. However , lacking both mu ltiple branches a nd the ﬁnal information alignment, the design philosophy is fundamenta lly differ ent from that of multi- branch architectures. In fact, sub-band s haking could be viewed as a bridge bet we e n the multi-stream framework and the mu lti-branch architecture. T ime F requency Upper1 Upper1 Lower1 residual branch 1 } T ime F requency Upper1 Upper2 Lower2 residual branch 2 F ull Fig. 3: An illustration for the sub-band deﬁnitions Befor e we formally deﬁne the proposed models, we introduce the deﬁnition of sub-bands ﬁrst. Fig. 3 depicts the deﬁnition for s ub-bands in a 3 -branch residual block. Here we slightly abuse the notations of frequency and time because a fter t wo convolutional layers t hese axes a re not exactly the s ame as those of inpu t to the branches; however , since convolution is a local op e ra t ion they s till hold the corresponding spectral and temporal nature. At the output of each bra nch, we deﬁne the high-frequency half to be SUBMISSION TO THE IEEE TRANSA CTIONS 5 the upp er sub-band while the low-frequency ha lf to be the lower sub -ba nd. W e take the middle point on the sp e ct ra l axis to be the bord er line for simplicity . T he entire outp ut is called the full band. Having deﬁned these concepts, we denote X the inp ut to a residual block, X i the full band from the i -th branch, X i u the upper sub -band fro m the i -th branch and X i l the lower sub-band from the i -th branch. Natura lly , the relationship between them is given by X i =  X i u | X i l  . W e also denote Y the output of a Shake-Shake regularized residual block. T o study t he effectiveness of sub-band shaking for speech emotion recognition, we p ropose the following mod- els for benchmarking: 1) Shake t he full band ( Fu ll ) Y = X + N X n =1 ShakeShake  { X n } N n =1  . (8) 2) Shake b oth sub-bands but independently ( Both ) Y = X + [ Y u | Y l ] , (9) Y u = N X n =1 ShakeShake  { X n u } N n =1  , Y l = N X n =1 ShakeShake  { X n l } N n =1  . 4 B AT C H N O R M A L I Z E D S H A K I N G It ha s been shown that without batch normalization in a residual block, the shaking operation could easily cause training to diverge [10]. Even wit h various methods to reduce the s trength of shaking such as to limit the range of values α can ta ke on or to decrease the number of residual blocks a nd hence the number of shaking layers, applying shaking regularization without batch normaliz ation leads to a much inferior model, compared to a unregularized ResNeXt model with batch normalization. Batch Normalization for Symmetrically Distributed Representations : From t he persp ective of discriminative feature learning, one may assume tha t without ba tch nor- malization, feature vectors from different cla sses may lie close to each other in an uneven fas hion. In this sit uation, the app ropriate strength of shaking ma y be limited by the margin of two closest classes as in the reported experiments in [10]. In fact, a recent stu dy [34] refuted the commonly believed view on the role of ba tch normalization in re- ducing the so-called internal covariate shift. Instea d, they pointed out ba tch normalizat ion a ctually help s to signiﬁ- cantly smooth t he optimiz ation landscape during t raining. Intuitively , it is expectedly easier to exp lore the neighbor- hood of well-separa ted class center vectors, compared to a set of unevenly distributed class center vectors, since it is less likely to ha ve overla p ping embeddings from different classes in the former case. W ithout batch normaliz ation, any inapp ropriately la rge strength of shaking may lead to overshot embeddings, cause two or more classes to overlap and most importantly result in a slower and inferior con- vergence. Empirically , we ﬁnd it is the key factor to keep a distribution of dispersed representations tha t spa ns over the e nt ire feature sp ace in a symmet ric fashion and extends away from the origin, even when sha king is a pplied. T he re- fore, we hypothes iz e tha t the batch normaliz ation (or CCL) layer right before shaking serves to disp erse embeddings and therefore prevent them from overlapp ing each other under the inﬂue nce of perturba tion that is encouraged by shaking around class center vectors. Model-Based Representation Augmentation for Cov- erage of Between-Class Feature Space : From the review in Section 2, we hav e noticed t ha t classiﬁcation tasks ba s ed on the VRM principle and veriﬁcation t asks ask for sub t ly diff erent dist ributions of representations. The s uccess of Mixup in clas s iﬁcation ta sks advocates for minimizing or eliminating the gap between classes in feature s pace by augmented training samp les. By doing so, the chance of predicting s a mples lying outs ide of the coverage of the original or augmented training samples, which leaves r oom for attacks by adversarial s a mples [35] or results in un- certain predictions, is reduced. On the other hand, CCL, and the other discriminative featu re learning algorithms for veriﬁcation, could leave a larger margin between classes (for examp le, c.f. Fig. 4 (a) and (c)). W e hypothesize tha t shaking following a batch normalization layer expands the coverage of training sa mples and effectively it eliminates the margin between clas ses in feature spa ce to achieve a similar distribution as in Mixup. One caveat is that shaking does not s m ooth labels during exploration, and ther efore is more similar to a prior work of Mixu p, where interp ola tion and extrapolat ion of the nea rest neighbors of the sa me class in feature spa ce is proposed to enhance generaliz ation [15 ]. However , shaking operates s olely on a single samp le a t a time, unlike [18] a nd [15]. 4.1 Embed ding Learning on MNIST and CIF AR-10 T o demonstrate the close interaction between s haking a nd batch normaliza t ion in representation learning, we present two abla t ion s tudies, one on the MN IST datas et and the other on the CI F AR-10 dataset. For both sets of experiments , we employ the ordinary softmax loss to exam ine the effec- tiveness of ba tch normalization in representation learning when it is not coupled with the softmax function. The ﬁrst set aims to visualiz e the embeddings of ha nd- written digit images learned under the inﬂuence of batch normalization a nd shaking. W e employ a ResNeXt (20, 2 × 4 d ) architecture, where the last residual block reduces the feature dimension to 2 for the purpose of visualiza tion. Fig. 4 depicts emb eddings by four layout s of r esidual blocks, where the t op and bottom ro ws correspond to embeddings of the t raining and t esting samples, respectively . From left to right, the columns represent emb eddings learned by models of fully pre-activation (Fig. 2(c) PreA ct ) without sha king, PreAct with shaking, fully pre-activation + BN (Fig. 2(d) PreActBN ) without s haking and PreActBN with shaking, respectively . The ﬁrst column s erves as the ba seline in this s et of experiments. Immediately , we can obs erve a severe degra- dation in sepa rability when applying shaking to Pre Act , comparing Fig. 4(a,e) with Fig. 4(b,f). Also notice that shaking without a directly preceding batch normalization could perturb or destroy the symm etric distribution, which is obvious when there is no shaking (the symmetry in Fig. SUBMISSION TO THE IEEE TRANSA CTIONS 6 −2 0 −1 5 −1 0 −5 0 5 1 0 1 5 2 0 (e ) T e st  Acc:  9 8 . 2 8  % −1 5 −1 0 −5 0 5 1 0 1 5 −2 0 −1 5 −1 0 −5 0 5 1 0 1 5 2 0 (f ) T e st  Acc:  9 3 . 2 2  % −1 5 −1 0 −5 0 5 1 0 1 5 −2 0 −1 5 −1 0 −5 0 5 1 0 1 5 2 0 (g ) T e st  Acc:  9 8 . 6 6  % −1 5 −1 0 −5 0 5 1 0 1 5 −2 0 −1 5 −1 0 −5 0 5 1 0 1 5 2 0 (h ) T e st  Acc:  9 8 . 9 0  % −1 5 −1 0 −5 0 5 1 0 1 5 Fig. 4: MNIS T embeddings based on different layouts of residual blocks. W e s et the feature dimension entering into the output layer to be two and train them in an end-to-end fashion. The top and bottom rows depict embeddings of the training samples ex tracted in the t rain mode (i.e. α ∈ [0 , 1]) without updating paramete rs , and test ing samp les extracted in the eva l mode ( α = 0 . 5 ), r espectively . (a,e) fully p re-activation (Fig. 2(c) ) without shaking (b,f) fully p re-activation (Fig. 2(c) ) wit h shaking (c,g) fully pre-activation + BN (Fig. 2(d) ) without shaking (d,h) fully pre-activation + BN (Fig. 2 (d)) with s haking 4(a,e)). T his is rather interesting as batch normalization s till exists in PreAct residual block, only not directly connected to the shaking layer . It seems the exploration encouraged by shaking a round each class center has expanded its coverage but without a directly preceding bat ch normalization to maintain a good dispersion between classes, each class only expands to overlap wit h neighboring classes, and the result- ing distribution is hea vily tiled. Consequently , PreAct with shaking delivers a much inferior performance compared to PreAct . The comparison between PreAct (Fig. 4(a,e)) and Pre- ActBN (Fig. 4(c,g)), both without shaking, demonstrates the effectiveness of CCL in discriminative feature learning although it is not coupled with the loss function. In Fig. 4(c), not only does it maintain a symmetric distribution of classes, b ut als o it encourages each class to expand outwa rd and to leave m ore ma rgin between neighboring classes. As a result, Pre ActBN wit hout s ha king is a ble to r each a better performance compared to the baseline. Finally , PreActBN with s haking (Fig. 4(d,h)) , on the contrary , does not lead to a la rger margin b etween cla s ses as PreActBN without shaking does. On the other ha nd, it seems that the shaking ha s expa nded the coverage of each class s o that all of them are directly adja cent to each other with a minima l or z ero margin. W e could also ob- serve that, alt hough ba t ch normalization tries to maint a in a symmetric distribution of feature vectors, s ome of the classes in Fig. 4(d,h) are slightly t ilted around the outer most region. However , the mos t salient differ ence is the distribution of tes ting samples , where ea ch class becomes more compact. T he performance of PreA ctBN is therefor e the highest among a ll of the models. Based on the VRM principle, the region close to bound- aries between classes in feat u re spa ce is covered by aug- Fig. 5: Embeddings of training sa mples extracted in the (a) train (b) eval mode fr om PreActBN with shaking 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 re l a t i v e  d i st a n ce  t o  cl a ss ce n t e r 0 2 0 4 0 6 0 8 0 1 0 0 p e rce n t a g e Pre Act Pre Act  w /  Sh a ke Pre Act BN Pre Act BN  w /  Sh a ke Fig. 6: Percentage of class sa mples within in a relative distance to class center mented training sa mples. Since in supe rvis ed learning we assume t e s ting sa m p les are drawn from a s imilar or the same distribution as training samples, the ma jority of t e s ting samples are mapped to embeddings close t o the center of class, where most original training samples are ma pped to, SUBMISSION TO THE IEEE TRANSA CTIONS 7 and thus the dist ribu tion seems more compact. In Fig. 5, from the comparison of training embeddings in the train and eval modes, it is visually clear that shaking is expa nd- ing t he coverage of training embeddings. T o qua nt itatively show t hat this is the case, for each layout we calculate distances of the original training embeddings to their re- spective class center vectors in the eval mode and p lot t he percentage of class samples within dista nces relative to the largest distance in the class that is calcula red in the tra in mode, in Fig. 6. It is clear that with shaking most of t he original training embeddings are concentrated around the class center . For examp le, almost 100 % of original training embeddings lies within 0 . 3 × largest distance for PreActBN with s haking and jus t a s mall number of original training samples are lying close to the boundaries. Model Depth Params Error (%) ResNeXt ( 29 , 16 × 64 d ) [5 ] 29 68 . 1 M 3 . 58 W ide R esNet [7] 28 36 . 5 M 3 . 80 Shake-Shake ( 26 , 2 × 96 d ) [10] 26 26 . 2 M 2 . 86 Shake-Shake ( 26 , 2 × 64 d ) [10] 26 11 . 7 M 2 . 98 ResNeXt ( 26 , 2 × 64 d ) [10] 26 11 . 7 M 3 . 76 PreActBN ( 26 , 2 × 64 d ) † 26 11 . 7 M ∗ 2 . 95 BN-Shake ( 26 , 2 × 64 d ) 26 11 . 7 M ∗ 3 . 65 PreAct ( 26 , 2 × 64 d ) † 26 11 . 7 M ∗ 6 . 92 ∗ average over three runs † with shaking T ABLE 1: T est error (%) and model siz e on CI F AR-10 The second set of experiments on CIF AR-10 is designed to measu re the contribution of the last batch normaliza tion in PreActBN with shaking. I n order to do so, we remove the ﬁrst two ba tch normaliza t ion from the PreA ctBN residual block and rename the new one, t he BN-Shake residual block (ReLU-Conv-ReLu-Conv-BN-Mul), as s uming sha king is ap plied. Along with PreActBN and Pre Act , by presenting BN-Shake , all of them with shaking, we a re able t o quan- titatively demonstrate the crucial location of b a tch normal- ization in a s haking regularized architecture. W e modify the open-sourced T orch-based Sha ke-Shake implementation 1 that wa s released with [10] to build these three architectures. A ll of the r est of parameters such as the cosine learning rate scheduling and t he numb er of epochs remain unchanged. Only the pa rt tha t involves residual block deﬁnition is modiﬁed to s e rve our need. W e run each experiment for three times to obtain a robust estimate of t he performance using differ ent random seeds. T able 1 presents the results of our experiments as well as the quoted performances on CIF A R-10 from [10]. Note that Sha ke-Shake ResNeXt in [10] is based on the ReLu-only pre-activation residual block (Fig. 2(b) RPreAct ) and is thu s different from PreA ct we have here. Although the ResN e X t- 26 2 × 6 4 d is bas ed on the RPreA ct structure, with a shallow dept h of 2 6 , it s hould be compa ra ble to one that is bas ed on the PreAct when no shaking is ap plied [21]. Therefore, we also take it as the baseline for the pre- activation layout. The performance of PreA ctBN with shaking ( 2 . 95 % , mean of 2 . 89% , 3 . 00% and 2 . 9 5% ) is comparab le to the reported pe rformance of RPre Act with shaking ( 2 . 98 % ), 1. https://github.com/xgastal di/shake-shake. where both of them have a directly preceding batch normal- ization la ye r before shaking. As exp ected, the performance of PreA ct with shaking ( 6 . 92% , mean of 6 . 76% , 6 . 8 2% and 7 . 1 9 % ) is much worse than every model in T able 1, including the baseline Res NeXt- 26 2 × 64 d . On the other ha nd, the result of BN-Shake ( 3 . 65% , mea n of 3 . 56% , 3 . 7 6% and 3 . 62% ) is rather positive. W it h only one batch normalization layer , it outp erforms not only PreA ct with shaking but also the b a seline ResNeXt- 26 2 × 6 4 d . This ﬁnding highlights the fact that the directly p receding batch normalization p la ys a crucial role in keeping a good dispersion of intermediate representations when shaking is applied to e x plore unseen feature space, while the dispers- ing effect of any other b a tch normalization that is sepa rated by convolutional layers from the shaking layer is reduced or only au x iliary . Based on thes e t wo sets of experiments, it is safe to state that in the close interaction between ba tch normaliz ation and shaking, batch normalizat ion is m a inly responsible for keeping a dispersed symmetric distribution of intermediat e representations from perturbation by s ha king, while the shaking mecha nis m expa nds the coverage of training sam - ples to e limina te the margin bet ween classes by p romoting stochastic convex combinations of model branches, similar to [18] a nd [15]. For speech emotion recognition, we conduct exp e riments with 3 -branch ResNeXt based on the original (Fig. 2(a) PostAct ), RPreAct , Pre ActBN and PreAct la youts with and without shaking to benchmark these architectures. 4.2 Relation with Batch Normaliz ed Recurrent Neural Netw orks So far , we have presented arguments and analyses of exper- iments to clarify the interplay of ba t ch normalizat ion and shaking, a nd drawn a connection between ea ch of them t o discriminative feature learning and VRM, respectively . In our previous work on sub -band shaking [22] a nd stochastic Sha ke-Shake regularization [23] in affective computing, we found that sha king regularization con- tributes more to constraining the training process, meas ured in terms of the generalizat ion gap between the accuracies of tra ining and validation, than imp roving ge nera lization when u sing PreAct with shaking, which is understandable based on previous analyses . However , when employing PreActBN for speech emotion recognition, we further ob- serve another interesting behavior of the resulting models. Compared to PreAct , a lt hough the validation p erformance is improved thanks to the addition of batch normalization, the gap between training and validation accuracies also drastically increased, which suggests a s igniﬁcantly amount of increased over-ﬁtting. Architecture Shake V alid UA (%) Gap (%) PreAct ✗ 61 . 342 7 . 485 X 62 . 989 − 1 . 128 PreActBN ✗ 63 . 407 7 . 410 X 66 . 194 8 . 348 * average over three runs T ABLE 2 : Performances of PreAct and PreA ctBN with and without shaking for speech emotion recognition SUBMISSION TO THE IEEE TRANSA CTIONS 8 For example, T able 2 gives a comparison bet we e n PreAct and PreActBN with and without s haking. W ith shaking, PreActBN achieves a n improvement of 3 . 20 5% on the un- weighted accuracy (UA) at the cost of a n increment of 9 . 476% on the generalization gap (Gap) between training and validation UA s from PreAct , s imply because of the addition of batch normalization. On the other hand, it is 2 . 787% increment on t he UA and 0 . 938% increment on the Gap, respectively , for Pre ActBN due to the addition of shaking. Apparently , the addition of batch normalization triggers something that cau ses this signiﬁcant change in the generalization gap when sha king is applied. W e su s pect that this behavior partia lly resembles the reported situation in app lica tion of batch normaliz ation to recurr ent neural networks [36], [37], where Laurent et al. [36] attempted to a pply batch normalization to the recurr ent formulation of temporal modeling and concluded that it leads to fas ter conver gence but more over-ﬁtting. T o address this issue, Cooijmans et al. [37] attributed the difﬁculties with recurrent batch normalization to gradient vanishing due to improper initialization of the ba tch normalization parameters, and proposed to initializ e the sta ndard de- viation parameter γ in Eq. (3) to a s mall va lu e s uch a s 0 . 1 , contrary to the common p ractice of unit initialization. W it h a proper initializat ion, they demonstrated successful applications of batch normalized Lon g Short T erm Memory (LSTM) net works that converge faster and generalize b etter . In fact, the formulations are awfully simila r in bat ch normalized residual blocks and in batch normalized recur - rent neura l networks, wher e both of them consist of a batch normalized sum of multiple batch normalized streams and so on so forth . For example, in PreActBN with shaking the outpu t of the l -th block is as follows: X l +1 = α BN ( f 1 l ( X l )) + (1 − α ) BN ( f 2 l ( X l )) + X l , (10) and in the ( l + 1) -th block, f i l +1 ( X l +1 ) = Conv ( ReLU ( BN ( Conv ( Y i )))) , (11) Y i = ReLU ( BN ( X l +1 )) . (12) On the other hand, the batch normalized LSTM is given by the following equations [37]:     f t i t o t g t     = BN ( W h h t − 1 ) + BN ( W x x t ) + b , (13) c t = σ ( f t ) ⊙ c t − 1 + σ ( i t ) ⊙ tanh( g t ) , (14) h t = σ ( o t ) ⊙ tanh( BN ( c t )) , (15) where x t is the input, W x and W h are the input and recurr ent weights, f t , i t , o t and g t are for get, input , output and gates and the cell candidate, c t and h t are t he cell and output vectors, ⊙ and σ are the Hadamard product and the logistic sigmoid function, respectively . In these t wo ba tch normalize d formulations, we ﬁ nd Y i and h t rather similar to each other , up to only some component-wise scaling and clipping by activations . In a d- dition to benchmarking dif fer ent la youts of residual blocks, we also present sp eech emotion recognition experiments to investigate the convergence beha vior of shaking regularized ResNeXt networks with the batch normalization parameter γ initialized to different v alues. 5 D A T A B A S E A N D E X P E R I M E N T S F O R S P E E C H E M OT I O N R E C O G N I T I O N W e us e six pu blicly ava ilable emotion corpora to demon- strate the effectiveness of the p roposed models, including the eNTERF ACE’05 A udio-V isua l Emotion Databas e [38], the Ryerson Au dio-V isual Dat abase of Emotional Sp eech and Song (RA VDESS) [39], the Interactive Emotional Dyadic Motion Ca pture (IEMOCAP) databas e [40], the Berlin Database of Emotional S peech (Emo-DB) [41], the EMOVO Corpus [42] and the Surrey A udio-V isua l Expressed Emo- tion (SA VEE) [43]. Some of these corpora are multi-modal in which s peech, facial expression and text all convey a certain degree of affective information. However , in this p aper we solely focus on the acoustic modality for experiments. W e formulate the exp erimental t ask into a sequence classiﬁcation of 4 classes, including joy , anger , sa dness and fear . W e perform sub-utterance samp ling [44] by dividing long utterances into several short segments of 6 . 4 -second long with the same label to limit the length of the longest utterance; for example, a 1 0 -second long angry utterance is replaced b y two a ngry s egments corresponding to the ﬁrst 6 . 4 and the last 6 . 4 seconds of the original utterance with some overlapping p art. In this way , we also slightly beneﬁt from data au gmentation. As a result, we obta in 680 3 emotional u tterances from the aggregated corpora. T able 3 summarizes the information about these six corpora. Corpus No. No. Utterances Actors Joy Anger Sadness Fear eNTERF ACE 42 207 210 210 210 RA VDESS 24 376 376 376 376 IEMOCAP 10 720 1355 1478 0 Emo-DB 10 71 127 66 69 EMOVO 6 84 84 84 84 SA VEE 4 60 60 60 60 T otal 96 151 8 2212 2274 799 T ABLE 3: An overview of these selected corpora, including the num ber of actors and the distribu tion of utterances in the emotional cla sses Corpus Actor Set Partition 1 2 3 4 eNTERF ACE 2 F , 8 M 2 F , 9 M 3 F , 8 M 2 F , 8 M RA VDESS 3 F , 3 M 3 F , 3 M 3 F , 3 M 3 F , 3 M IEMOCAP 1 F , 2 M 1 F , 1 M 1 F , 1 M 2 F , 1 M Emo-DB 1 F , 2 M 1 F , 1 M 1 F , 1 M 2 F , 1 M EMOVO 1 F , 0 M 1 F , 1 M 1 F , 1 M 0 F , 1 M SA VEE 0 F , 1 M 0 F , 1 M 0 F , 1 M 0 F , 1 M T otal 8 F , 16 M 8 F , 16 M 9 F , 15 M 9 F , 15 M T ABLE 4 : F: female, M: male. The gender a nd corpus distri- butions in each a ctor set partition of t he cross v alidation For t he evaluation, we adopt a 4 -fold cross validation strategy . T o begin with, we split the a ctor set into 4 par- titions. In addition, we impose extra const raints to make sure that each partition is as gender and corpus unifor m as possible. For example, each a ctor set p artition is randomly distributed with 2 - 3 female actors and 8 - 9 male actors fr om SUBMISSION TO THE IEEE TRANSA CTIONS 9 Layer Name Structur e Stride No. Params preli m-conv  2 × 16 , 4  × 1 [ 1 , 1] 132 res-8  2 × 16 , 8 2 × 16 , 8  × 3 [1 , 1] 22 . 84 K res-16  2 × 16 , 16 2 × 16 , 16  × 1 [ 1 , 1] 24 . 88 K res-32  2 × 16 , 32 2 × 16 , 32  × 1 [ 1 , 1] 99 . 17 K average - - - afﬁne 256 × 4 - 1024 total - - 148 K T ABLE 5: ResNeXt ( 12 , 2 × 8 d ) network ar chitecture the eNTERF ACE’05 corpus . More details are p rovided in T a ble 4. By pa rtitioning the actor s e t , it becomes ea s ier to maintain s peaker independence between training and validation throughout all of the experiments. W e extract the spectrograms of each u tterance with a 25 ms window for e v ery 10 ms using the Kaldi [45] library . Cepstral mea n and variance normalization is then a pplied on the spectrogram frames per utterance. T o equip each frame with a certain context, we splice it with 1 0 frames on the left and 5 frames on t he right. Therefor e, a resulting spliced frame has a resolution of 16 × 257 . Since emotion involves a longer-term mental state transition, we further down-sample the frame rate by a factor of 8 to simplify and expedite the training p rocess. W e bu ild a n a rchitectur e of ResNeX t ( 12 , 2 × 8 d ) that consists of only 3 residual sta ges, res-8, res-16 and res-32, in which the ﬁlter s ize is ﬁxed to 2 × 16 . There are 3 residual blocks in res-8, and one residual block in res-16 and res-32. Befor e the a fﬁne transformation, we add a dropout la yer with p robability of 0 . 5 . T ab le 5 contains details of the archi - tecture. For each utterance, a simple mean pooling is taken at the output of the ﬁnal residual block to form a n utterance representation before it is map ped by the afﬁne la yer . W e avoid exp licit temporal modeling layers s uch as a long short-term memory recurrent network becaus e our focus is to investigat e the effectiveness b y shaking the Res NeXt. Note tha t a shaking layer has no pa rameter to learn and hence t he model size in t his work sta ys almost constant and only changes slightly in the situation when the number of batch normalization layers differs, since ba tch normalization has a s et of trainable scaling a nd shift para m e t ers. W e implement the shaking layer as well as the e nt ire network architecture us ing the PyT orch [46] library . Only the Shake-Shake combination [10] is used and shaking is ap- plied independently p e r frame. W e ma y also for simplicity refer to the Shake-Sha ke regularization a s shaking regular- ization. Due to cla ss imbalance in the aggregated corpora, the objectiv e function for training is the weighted cross- entropy , where the clas s weight is inversely p roportional to the class size. The models a re learned us ing the Adam optimizer [47] with an initial learning rate of 0 . 001 and the training is carried out on an NVIDIA T esla P1 0 0 GPU. W e use a mini-batch of 8 utterances across all model training and let each exp eriment run for 1000 epochs, and for three runs with differ ent random s eeds to obta in a robust estim a te of the perfor mance. For sub-band shaking experiments, we initially employ the fully pre-activation layout (Fig. 2(c)). PreAct denotes the model without shaking, while PreAct-Full and PreAct- Both correspond to the models with t he Full and Both shaking (Eq. 8, 9 ), respectively . W e only consider two sub - bands, deﬁned by the middle point on the spectral axis , to present our point. However , it does not rule out the feasibility of exp eriments with more sub-ba nds. I n fact, in the multi-s t ream framework for s p eech recognition, more sub-bands were considered to generate more combinations and diversity . W e conduct experiments to benchmark differ ent layouts, including PostAct (Fig. 2(a)), RPreAct (Fig. 2(b)) and Pre- ActBN (Fig. 2( d)) wit h and without ( Full ) shaking. Further- more, we choose three different initializa tion v alues of γ in batch normalization a nd apply them t o each of PostAct , RPreAct a nd PreActBN layouts t o inves tigate if batch nor- malized ResNeXt networks als o b e ha ve as ba tch normaliz ed recurr ent neural networks in terms of faste r convergence and better generaliz ation, in addition to the resemblance in m athematical formulations. In this s et of experiments, we u se the afﬁne transformation in b atch normaliz ation despite CCL recommends not to u s e it, except in the last batch normaliza t ion before the outp ut layer , where we only standardize t he input a nd do not rescale it. Finally , we revisit s ub-band s haking wit h t he PreA ctBN layout and wit h different initializa tion v alues of γ to present a complete compa rison between shaking on the full ba nd and on the sub -bands independently . W ith this set of exper- iments, we may a lso examine the effectiveness of t rade-of f between the un-weighted a ccuracy (UA) and the generaliz a- tion ga p (Gap) between training and validation UAs when batch normaliza t ion is applied right before shaking. 6 E X P E R I M E N TA L R E S U L T S 6.1 Sub-ban d Shaking The result of sub-band shaking is presented in T able 6. It is clear tha t both PreA ct-Full and PreAct-Both a re able t o improve from PreAct , unlike the experiments on MNIS T and CIF AR-10 data sets. For both shaking regularized mod- els, however , their improvements on the UA is comparably smaller than their reduction on the generaliz a tion gap, which suggests the models a re learning a ha rder problem in the presence of s ha king. It could also be the difference in the nature of ta sks or in the number of class es that PreAct with shaking is improving instea d of degrading. Model V alid UA (%) Gap (%) PreAct 61 . 342 7 . 485 PreAct-Full † 62 . 989 † − 1 . 128 PreAct-Both † 64 . 973 † 1 . 791 * average over three runs † signiﬁcantly ( p< 0 . 05 ) outperforms Pre- Act T ABLE 6: Performances of PreA ct , PreA ct-Full and PreAct- Both Furthermore, we conduct statis tical hypothes is testing to examine the s igniﬁcance of the observed improvements. The resulting p-values from one-sided paired t-test indicates that the improvements of PreAct-Full and PreA ct-Both from PreAct are s tatistical signiﬁcant, in terms of the UA and SUBMISSION TO THE IEEE TRANSA CTIONS 10 the Gap. In addition, compa red to the improvements of PreAct-Full , PreA ct-Both has reached a higher UA but also a larger gap. In fact, the improvement of PreAct-Both from PreAct-Full is also signiﬁcant in terms of UA. On the other hand, PreAct-Full ha s a signiﬁca nt reduction on the Gap, compared to the reduction of PreAct-Both on the Gap. More details are presented in T able 7. Model Pair V alid UA Gap PreAct vs PreAct-Full 1 . 44 × 10 − 2 2 . 7 × 10 − 4 PreAct vs PreAct-Both 1 . 22 × 10 − 5 3 . 0 × 10 − 3 PreAct-Full vs PreAct-Both 2 . 60 × 10 − 3 9 . 9 × 10 − 1 T ABLE 7 : P-values resulted from one-sided pa ired t-test with df= 4 × 3 − 1 = 11 Based on these obs ervations, we can think of sub-band shaking as a method to relax the strength of shaking. This way , one may trade off bet wee n the am ount of imp rovement on the UA and the amount of reduction on the Gap . Our previous work on s tochastic Sha ke-Shake regularization [23 ] also inves tigated another method to t rade off b e t ween t hes e two performance metrics by randomly turning of f shaking. Since sub-band shaking and stochas tic S hake-Shake beha v e similarly and the latt er do not contribute directly into t he focus of this work, we choose to present only s ub-band shaking here. 6.2 Experiments with Diff erent Lay ou ts The results of benchmarking the layouts are presented in T a ble 8. First of all, we notice that even without s haking, all of PostAct , RPreAct a nd PreActBN have a chieved higher UAs with s igniﬁcance (denoted by ⋆ ), compared to t hat of PreAct , while the UA of the ﬁrst three models are compa- rable, i.e. no statis tical signiﬁcance in the differences. T hese improvements on the U A could be att ribu ted to discrimina- tive feature lea rning, i.e. CCL, s ince we additionally app ly batch normaliza tion without rescaling before the ﬁnal afﬁne layer in the ﬁrst three models. W e deliberately employ CCL in these three models to set up a s et of competitive baselines. Next, we look at the layouts with shaking at γ 0 = 1 . 0 . W e immediately observe that even with the common practice of unit initializ ation for γ , all of these three layouts giv e further improvements on the U A . However , none of them is ab le to reduce the generalization gap with statis t ical signiﬁcance. The closes t one is RPreA ct with shaking when γ 0 = 1 . 0 and the p-value for the testing is 0 . 0 88 due to a high variation of the generaliza tion ga p from ea ch fold. W ith the common practice of unit initializat ion, PreActBN converges t o the most accura t e model with the UA as high as 66 . 19 4% , but it also results in t he la rgest generaliza tion gap. In fa ct , we could see that for all values of γ 0 we choose, these three layout s all outpe rform their respective baseline models, the same layout without shaking, in terms of UA with signiﬁcance (denoted by † ), and all of them show a trend of genera liz ation gap reduction only that t he majority of t hem are unable to reduce the generaliz ation gap with signiﬁcance, e x cept two of them. Both of RPreAct and PreActBN outperform their baseline models simultaneously on the UA a nd on the Gap, achieving a higher a ccuracy as well as reduced over-ﬁtting a t the same time, when γ is Architecture Shake γ 0 V alid UA (%) Gap (%) PreAct ✗ – 61 . 342 7 . 485 X – † 62 . 989 † − 1 . 128 PostAct ✗ – ⋆ 62 . 782 7 . 822 X 1 . 00 † 65 . 536 7 . 657 X 0 . 20 † 65 . 490 5 . 431 X 0 . 10 † 65 . 483 7 . 255 X 0 . 05 † 65 . 512 7 . 205 RPreAct ✗ – ⋆ 63 . 939 8 . 517 X 1 . 00 † 65 . 859 7 . 063 X 0 . 20 † 65 . 821 8 . 305 X 0 . 10 † 64 . 899 7 . 958 X 0 . 05 † 65 . 052 † 5 . 821 PreActBN ✗ – ⋆ 63 . 407 7 . 410 X 1 . 00 † 66 . 194 8 . 348 X 0 . 20 † 65 . 789 ‡ 5 . 817 X 0 . 10 † 66 . 097 ‡ 6 . 040 X 0 . 05 † 66 . 418 † ‡ 3 . 416 * average over three runs ⋆ signiﬁcantly ( p < 0 . 05 ) outperforms P reAct without shak- ing † signiﬁcantly ( p< 0 . 05 ) outperforms the same layout with- out shaking ‡ signiﬁcantly ( p< 0 . 05 ) outperforms th e same layout at γ 0 = 1 . 0 T ABLE 8: Performances of Pre Act , PostAct , RPreAct a nd PreActBN with and without shaking for sp e e ch emotion recognition, where γ 0 is the initia lization value of the stan- dard deviat ion parameter γ in batch normalization initialized as 0 . 05 . PreActBN with shaking at γ 0 = 0 . 05 is also the best model, both highest on the UA and lowest on the Gap (excep t for PreA ct wit h sha king, which is known to be difﬁcult to learn). In other words, the result of this model va lidates our hypothesis with statistica l signiﬁcance that b atch normalized ResNeX t networks are locally in formulation s imilar to batch normalized recurrent neu ra l networks, and both of them require a careful selection of initialization values for batch normaliz a tion pa rameters to avoid the a forementioned difﬁculties. For comparison between layouts , we ﬁnd PreA ctBN with shaking at γ 0 = 0 . 05 als o outperforms most of the settings in PostAct and RPreAct layouts with statist ical sig- niﬁcance, except for the reduction on the generalization gap by PostAct at γ 0 = 0 . 2 , where PreActBN @0.0 5 is not better with signiﬁcance. W e summarize the complete comparison results in T able 9. Architecture Pair γ 0 V alid UA Gap PreActBN@0.05 v s PostAct 1 . 00 2 . 37 × 10 − 3 4 . 12 × 10 − 2 0 . 20 2 . 96 × 10 − 3 6 . 44 × 10 − 2 0 . 10 1 . 71 × 10 − 2 2 . 15 × 10 − 2 0 . 05 2 . 21 × 10 − 2 1 . 27 × 10 − 2 PreActBN@0.05 v s RPreAct 1 . 00 1 . 29 × 10 − 2 1 . 00 × 10 − 2 0 . 20 2 . 85 × 10 − 2 2 . 44 × 10 − 3 0 . 10 5 . 07 × 10 − 4 1 . 34 × 10 − 2 0 . 05 4 . 14 × 10 − 3 4 . 23 × 10 − 2 T ABLE 9 : P-values from one-sided p aired t -tes t between PreActBN at γ 0 = 0 . 05 and PostAct and RPreAct wit h various value of γ 0 Finally , when comparing γ 0 -initialized batch normaliz ed networks (when γ 0 = 0 . 05 , 0 . 1 , 0 . 2 ) with t heir counterpart by u nit initializ ation, we ﬁnd tha t although both RPreAct and PreActBN s how a tendency t o reduce the Gap with SUBMISSION TO THE IEEE TRANSA CTIONS 11 Architecture Shake γ 0 V alid UA (%) Gap (%) PreAct X 1 . 00 64 . 973 1 . 791 PreActBN X 1 . 00 65 . 544 8 . 034 X 0 . 20 65 . 679 ‡ 4 . 408 X 0 . 10 66 . 170 6 . 257 X 0 . 05 ‡ 66 . 432 ‡ 5 . 539 * average over three runs ‡ signiﬁcantly ( p< 0 . 05 ) outperforms th e same layout at γ 0 = 1 . 0 T ABLE 10: Performances of Pre Act and Pre ActBN with sub- band s haking for sp eech emotion recognition, where γ 0 is the initialization value of the standard deviation pa rameter γ in batch normaliz ation a sma ller γ 0 , only Pre ActBN manages to outperform its unit-initialized counterpart (denoted b y ‡ ). It may su ggest that the extra batch normalizat ion betwee n residual blocks in Pre ActBN makes it structurally more simila r to ba tch normalized recurrent neural networks. On the other hand, any two batch normalization layers in PostAct or RPreAct are always separated by one convolutional layer . In Section 6.1, we have seen that sub-band shaking leads to an im p rovement on the UA but a lso enlarges the generalization gap . However , instead of PreActBN we employed PreA ct previously a nd did not experiment with diff erent initializ ation va lues of γ . T o p resent a complete comparison b etween s haking on the full band and on the sub-bands independently , we conduct experiments on s ub- band shaking with the PreActBN layout and with the cho- sen initialization value s of γ . T he results are summarized in T a ble 1 0. Again, comparing PreA ct and PreActBN b oth at γ 0 = 1 . 00 , we could ea s ily obs erve a signiﬁcant increment of the generalization gap a long wit h the a dditional batch normalization. W ith a smaller initialization value, not only does PreActBN reduce the generaliza tion gap but also im- prove the UA when γ 0 = 0 . 05 with statistica l signiﬁcance. Y et, the resulting 5 . 5 39% on the Gap is st ill signiﬁcantly larger t han 1 . 79 1% on the Gap by the PreAct layout. W e further compare the performance of PreActBN with shaking on the full band ( 66 . 418 % / 3 . 416% ) and on the sub- bands independently ( 66 . 432 % / 5 . 539% ). The p-values for these two settings a re 0 . 51 and 0 . 079 for test ing between the UAs and the Gap s. In the end, shaking on t he full band and on the s ub-bands independently give comparable performances with shaking on the full band slightly better at reducing the generaliz ation gap. 7 C O N C L U S I O N W e investigate the recently proposed Sha ke-Shake regu- larization and its variants for classiﬁcation in general and speech emotion recognition in particular . In order to ex p lain the observed interaction betwee n ba tch normalization and shaking regularization, we bas e our ablation analysis of batch normalization in the MNIST experiments on discrim- inative feat u re learning. Our experiments s how tha t the batch normalization right before shaking regularization is crucial in that it keeps a dispersed symmetric distribution of intermediate representations fro m being tilted by random perturbation due to sha king. W ithout the ba tch normaliza- tion, shaking regularization could easily tilt the distribution of each classes and cause them to overlap each other . In addition, we highlight the s ubtle difference in the requir ement of embeddings for classiﬁcation tasks and ver- iﬁcation tasks, where according to t he vicinal empirical minimization principle or the recent success of Mixup [18], classiﬁcation tasks s hould try to minimize the margin b e- tween classes in feature sp ace, while veriﬁcation tasks a im to ﬁnd an emb e dding distribution that has a minima l intra- class variation bu t a max im a l inter-class disp ersion. From this perspective, we ﬁnd the embeddings by PreActBN with shaking are indeed distributed with small or zero ma rgin between classes (Fig. 4(d)), while the embeddings by Pre- ActBN without sha king are distributed with large margins (Fig. 4(c)). Moreover , since the original embeddings a re distributed close to the class center vectors, the distribution of tes ting embeddings, which is often as s umed to be the same or simila r distribution of training samp les, becomes more compact and hence lea ds to a higher accuracy . This ﬁnding p rovides a direct explanation ba sed on the VRM principle for the ability of s haking regularization to help improve classiﬁcation tasks. Another set of ex periments on CIF AR-10 further validates it is the direct concatena tion of batch normalization and shaking that contributes the mos t to the improvement on cla s siﬁcation a ccuracy , a nd other layers of ba tch normalization are auxilia ry . In addition, we ﬁnd that the formulation of ba tch nor- malized residual blocks resemble b a tch normalized recur- rent neural networks. Moreover , both of these two architec- tures su ffer from a reported issue of fast convergence but more over-ﬁtting. T o reduce the observe increment of the generalization gap in our speech emotion recognition exper- iments, we properly initialize the γ parameter with a sma ller value. The experimental results validate our hypothes is and give a s igniﬁcant reduction on the generalizat ion gap while achieving the same or a higher improvement on the UA. A ﬁna l comparison be t ween shaking on t he full band and on the sub -bands independently s hows that with the additional ba tch normalization in PreActBN and a proper initialization value of γ , t he differ ence between t hes e two kinds of shaking is minimized, only that s haking on the full band slightly better at r educing the generaliza tion gap. R E F E R E N C E S [1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proceedings of the IEEE Conference on Computer V isio n and Pattern Recognition (CVPR) , 2016. [2] J. Gehring, M. Auli, D. Grangier , D. Y arats, and Y . N. Dauphin, “Convolutional Sequence t o Sequence Learning,” 2017 , arXiv:1705 .03122. [3] C. Huang and S. S. Narayanan, “D e e p Convolutional Rec urrent Neural Network with Attent ion Mec hanism for R obust Speech Emotion R ecognition,” in IEE E Internationa l Conference on Multi- media and Expo (ICME) , 2017. [4] N. Srivastava, G. Hinton, A. Krizhevsky , I. Sutskever , and R. Salakhutdinov , “Dropout: A Simple W ay to Prevent Neural Networks from Overﬁtting,” J. Mach. Learn. Res. , vol. 15, no. 1, Jan. 2014 . [5] S. Xie, R. Girshick, P . Dollr , Z. T u, and K. He, “Aggr egated Residual T ransformations for Deep Neural Networks,” a rXiv:1611.0543 1 , 2016. [6] S. Ioffe and C. Szegedy , “Batch Normalizatio n: Accelerating D eep Network T raining by Reducing Internal Covari ate Shift,” in Pro- ceedings of the 32nd International Conference on Ma chine Learni ng (ICML) , 2015. SUBMISSION TO THE IEEE TRANSA CTIONS 12 [7] S. Za goruyko and N. Komodakis, “Wide Residual Netw orks,” in BMVC , 2016 . [8] G. Huang, Y . Sun, Z. Liu, D. Sedra, and K. W einber ger , “Deep Networks with Stochastic Depth,” in ECCV , 2016. [9] X. Li, S. Chen, X. Hu, and J. Y ang, “Understanding the Disharmony Between Dropout and B atc h Normalizati on by Variance Shift,” arXiv:1801.05134 , 2018. [10] X. G astaldi, “Shake-Shake Regulariz ation,” in International Confer- ence on Learning Repr esentations Workshop , 2017. [11] Y . Y amada, M. Iwamura, and K. Kise, “ShakeDrop Regulariza- tion,” , 2018. [12] A . Krizhevsky , “Learning Multiple Layers of Features fr om Tiny Images,” T echnical Report , 2009. [13] N. X. V inh, S. M. Erfani, S. Paisitkria ngkrai, J. Bailey , C. Leckie, and K. Ramamohanarao, “T raining Robust Models Using Random Projection,” in Proceedings of the 23rd Internati onal Confe rence o n Pattern Recognition (ICPR) , 2016. [14] O. Chapelle, J. W eston, L . B ottou, and V . V apnik, “V icinal Risk Minimizatio n,” in Advances i n Neura l Informa tion Processing Sys- tems , 2000. [15] T . DeV ries and G. W . T aylor , “Dataset A ugmentation in Feature Space,” in International Conference on Lea rning Representations W ork- shop , 2017. [16] N. V . Chawla, K. W . Bowyer , L. O. Hall, , and W . P . Kegelmeyer , “Smote: synth etic minority over-sampli ng technique,” Journal of artiﬁcial intelligence researc h , 2012. [17] C. Szegedy , V . V anhoucke, S. Ioffe, J. Shlens, , and Z. W ojna, “Rethinking t he Inception Architecture fo r Computer V ision,” in The IEEE Conference on Computer V ision a nd P attern Recognition (CVPR) , 2016. [18] H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond Empirical Risk Minimization,” in International Conference on Learning Repr esentations , 2018. [19] V . V erma, A. Lamb, C. B eckham, A. Courville, I. Mitliagkas, and Y . Bengio, “Manifold mixup: Enc ouraging meaningful on- manifold interpola tion as a regulariz er ,” , 2018. [20] Y . LeCun and C. Cortes, “MNIST Handwritten Digit Database,” 2010. [21] K. He, X. Zhang, S. Ren, and J. Sun, “Identity Mappings in Deep Residual Networks,” in Proceedings of the European Conference on Computer V isio n (ECCV) , B. Leibe, J. Matas, N. Seb e, and M. W elling, Eds., 2016. [22] C. Huang and S. S. Narayanan, “Shaking Acoustic Spec tral Sub- bands Can Better Regularize L earning in Affective Computing,” in Proceedings of the IEEE Internati onal Conf erence on Audio, Speech and Signal Processing , 2018. [23] C. -W . Huang and S. S. Narayana n, “Stochastic Shake-Shake Reg- ulariza tion for Affective Learning fr om Speech, ” in Proceedings of Interspeech , 2018. [24] W . Liu, Y . W en, Z. Y u, and M. Y ang, “Large- Margin Softmax Loss for Convolutional Neural Networks,” in P roceedings of The 33rd International Conference on Machine Learning , 2016, pp. 507–51 6. [25] W . Liu, Y . W en, Z. Y u, M. Li, B. R aj, and L. Song, “SphereFace: Deep Hypersphere E m bedding for Face Recognition,” in The IEEE Conference on Computer V ision and Pattern Recognition (CVPR) , 2017. [26] H. W ang, Y . W ang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W . Liu, “CosFace:Large Margin Cosine Loss for Deep Face Recognition,” in The IEEE Conference on Computer V ision and Pattern Recognition (CVPR) , 2018. [27] J. Den g, J. Guo, and S. Zafeiriou, “ArcFace: Add itive Angular Margi n Loss for Deep Face Recognition,” , 2018. [28] X. Qi and L. Zhang, “Face recognition via centralized coor dinate learning,” arXiv:1801.056 78 , 2018. [29] R . Hadsell, S. Chopra, and Y . L eCun, “Dimensionality R eduction by Learning An Invariant Mapping,” in The IEEE Conference o n Computer V isio n and Pattern Recognition (CVPR) , 2006. [30] F . Sc hrof f, D. Kalenichenko, and J. Philbin, “Facenet: A U niﬁed Embedding for Face Re c ognition and Clustering,” in The IEEE Conference on Computer V ision and Pattern Recognition (CVPR) , 2015. [31] V . V apnik, Stati stical Lea rning Theory . J. W iley , 1998. [32] C. M. Lee, S. Y ildiri m, M. Bulut, A . Kazemzadeh, C. B usso, Z. Deng, S. Lee, and S. Narayanan, “Emotion Rec ognition Based on Phoneme Clas ses,” in Proceedings of InterSpeech , 2004. [33] H. Mallidi and H. Hermansky , “Novel Neural Network B ased Fu- sion for Multistream A SR,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing , March 2016. [34] S. Santurkar , D. T sipras, A. Ilyas, and A. Madry , “How D oes Batc h Normali zation Help Optimi zation? (No, It Is Not About Internal Covariate Shift),” arXiv:1805.11604 , 2018. [35] C. Szegedy , W . Zaremba, I. Sutskever , J. Bruna, D. Er han, I. J. Good- fellow , and R. Fergus , “Intriguing Pro perties of Neural Networks,” in International Conference on Learning Representations , 2014 . [36] C. Laurent, G. Pereyra, P . Brakel, Y . Zhang, and Y . Bengio, “Batch Normali zed Rec urrent Neural Netw orks,” in Proceedings o f the IEEE Internati o nal Conference o n Audio, Speech and Signal Processing , 2016. [37] T . Cooijmans, N. Ballas, C. Laurent, C ¸ . G ¨ ulc ¸eh re, and A. Courville, “Recurrent Batch Normalizatio n,” in Internatio nal Conference o n Learning Represent ations , 2017. [38] O. Martin, I. Kotsia, B. M. Macq, and I. Pitas , “The eNTERF ACE’05 Audio-Visua l Emotion Database,” in Proceedings of the International Conference on Data Engineering Workshops , 2006 . [39] S. R. Livingstone, K. Pec k, and F . A. Russo, “RA VDESS: The Ryers on Audio-Visual Database of Emotional Speech and Song,” in Proceedings of the 22nd Annual Meeting of the Canadian Society for Brain, Behaviour and Cogniti ve Science , 2012. [40] C. Busso, M. B ulut, C.-C. Lee , A. Kazemzadeh, E. Mower , S. Kim, J. N. Chang, S. Lee, and S. S. Narayana n, “IEMOCAP: Interactive Emotional D yadic Motion Capture Database,” Language Resources and Evaluation , vol. 42, no. 4, p. 335, Nov 2008. [41] F . B urkhar dt, A. Paeschke, M. Rolfes, W . F . Sendlmeier , and B. W eiss, “A Database of German E motional Speech, ” in Proceed- ings of Interspeech , 2005. [42] G . Costantini, I. Iaderola, andrea Paoloni, and M. T odisco, “EMOVO Corpus: an Italian Emotional Speech Database,” in Proceedings of the 9th Internationa l Conference on Language Resources and Evaluation , 2014. [43] S. -U. Haq and P . J. Jackson, Machine Audition: Principl es, Algorithms and Systems . IGI Global, 2010, ch. Multimodal Emotion Recogni- tion. [44] G . Keren, J. Deng, J. Pohjalainen, and B. Schuller , “Convolutional Neural Networks w ith Data Augmentation for Classifying Speak- ers Native Language,” in P roceedings of Interspeech , 2016. [45] D . Povey , A. Gh oshal, G. Boulianne, N. Goel, M. Hannemann, Y . Qian, P . Sc hwarz, and G. Stemmer , “The KALDI Speech Recog- nition Toolkit,” in P roceedings of the IEEE Workshop o n Automatic Speech Recognition and Understanding , 2011. [46] “Pytorch: T ensors and Dynamic Neural Networks in Python with Strong GPU Acceleration,” 201 7, https://github.com/pytorch/pytor ch. [47] D . P . Kingma and J. Ba, “Adam: A Method for Stochastic Opti- mization,” , 2014.

Normalization Before Shaking Toward Learning Symmetrically Distributed Representation Without Margin in Speech Emotion Recognition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment