Shaking Acoustic Spectral Sub-bands Can Better Regularize Learning in Affective Computing

In this work, we investigate a recently proposed regularization technique based on multi-branch architectures, called Shake-Shake regularization, for the task of speech emotion recognition. In addition, we also propose variants to incorporate domain …

Authors: Che-Wei Huang, Shrikanth Narayanan

Shaking Acoustic Spectral Sub-bands Can Better Regularize Learning in   Affective Computing
SHAKING A COUSTIC SPECTRAL SUB-B ANDS CAN BETTER REGULARIZE LEARNING IN AFFECTIVE COMPUTING Che-W ei Huang, Shrikanth Narayanan Univ ersity of Southern California, Los Angeles, CA 90 089 che weihu@usc.edu, shri@ sipi.usc.edu ABSTRA CT In this work, we investi gate a recently propo sed regularization technique based on multi-branch architectures, called Shake- Shake regularization, f or the task of speech em o tion reco g - nition. In ad dition, we also propo se variants to incorpo rate domain kn owledge in to m odel co nfiguratio ns. The experi- mental resu lts demonstrate: 1) independ ently sh a king sub - bands delivers fa vorable models c o mpared to shak ing the en- tire spectral-temp oral featu re maps. 2) with proper p atience in e a r ly stopp ing, the pro posed mod els can simultaneou sly outperf orm the baseline and maintain a smaller performan ce gap between training and validation. Index T erm s — Shake-Shake Regularization , Sub-b and Shaking, Adversarial T raining, Affecti ve Computing, Speech Emotion Recognition 1. INTR ODUCTION Deep conv o lutional neural networks hav e been successfu lly applied to several p a ttern recogn ition tasks such as imag e recogn itio n [1], mach ine tran slation [2] an d speec h emo tion recogn itio n [3]. Cur rently , to successfully train a deep n e u ral network, one n e e ds either a suffi cient num ber of trainin g sam- ples to implicitly regularize the learning p rocess, or employ technique s like weight decay and drop out [4] and its variants to explicitly keep the model from over-fitting. In the recent years, one of the most po pular and su c c essful architecture s is the r esidual neural network (ResNet) [1]. Th e ResNet architectu r e was design ed b ased on a key assumption that it is mo r e effi cient to op tim ize the r esidual ter m tha n the original task mappin g. Since th en, a grea t deal of effort in machine learnin g and com puter vision has been dedicated to study the multi-br anch arch itecture. Deep convolutional n eural networks have also gained much attention in th e comm unity of affecti ve computin g mainly be c a use of its outstanding ability to formula te dis- criminative f eatures for th e top - layer classifier . Usually the number of par a meters in a mod el is far mo re th an the num ber of training samples and thus it requir es hea vy regular ization to train d eep neu r al network s for affecti ve computin g. Howe ver, since the introdu ction of batch n ormalization [5], the gains obtained by using dro pout for regularization have decrea sed [5, 6, 7]. Y et, multi-bra nch arch itectures have emerged as a promising alternative. Regularization techniqu es ba sed on multi-b ranch archi- tectures such as Shakeou t [8] and Shake-Shake [ 9] have de- li vered imp ressiv e perfor m ances o n standard image datasets such as the CIF AR-1 0 [ 10]. In a clev e r way , both of them utilize multiple bran ches to learn d ifferent aspects of the rel- ev ant inf ormation and then a summatio n in the end follows for inf ormation alignmen t am ong bran ches. Instead of using multiple branch es, a rece n t work [ 1 1] based on a mix ture of experts showed that rand omly projectin g samp le s is ab le to break the structu re of a dversarial noise that co uld easily con- found the model and as a re su lt mislead the learn in g proce ss. Despite no t b eing an end-to-e n d ap proach , it sh a r es the same idea o f integratin g mu ltiple stream s o f mo del-based diversity . In this work, we study the Shake-Shake regularized ResNet f o r speech emotion r ecognition . In addition to shak- ing the entire spectral- temporal feature maps with the same strength, we propose to address dif ferent spectr a l sub -band s indepen d ently based on our hy pothesis of the non- uniform distribution of af fective infor mation over the spectral a x is. There has been work on multi-stream f ramework in speech processing. For example, Mallidi et al. [ 12] desig ned a robust speech reco gnition system using multiple streams, each of them a tten ding to a different par t of the feature spac e , to figh t against noise. Howe ver, lacking bo th multiple br anches and the final info r mation alignmen t, the design ph ilo sophy is fu n- damentally different f rom that o f m ulti-branc h architectures. In fact, w e intend to serve this work as a bridg e between the multi-stream framework and the m ulti-branc h arc hitecture. 1.1. Shake-Shak e Regularizat ion Shake-Shake regularization [9] is a r ecently proposed tech- nique to r egularize training of deep co n volution al neu ral net- works fo r imag e recogn ition tasks. Th is regula r ization tech- nique based on multi-bran ch architectu res p romotes stochas- tic mixtur es of f o rward an d b ackward p ropaga tio ns fr om n et- work branc h es in ord e r to create a flow of model- based adver- sarial learning samples/grad ients durin g the trainin g phase. Owing to it excellent ability to combat over-fitting ev en in the presence o f batch n ormalization , the Shake-Sha ke regula r ized 3 -bran ch r esidual neur al network [9] has ach iev ed the c u rrent state-of-the- art perfo rmance on the CIF AR-10 imag e d ataset. An overview of a 3 -bran ch shake-shake regularized addition Conv HxW BatchNorm Mul(  ) Conv HxW BatchNorm Mul(1-  ) BatchNorm +ReL U Conv HxW Conv HxW Conv 3x3 BatchNorm +ReL U Conv HxW Conv HxW addition Conv HxW BatchNorm Mul(  ) Conv HxW BatchNorm Mul(1-  ) BatchNorm +ReL U Conv HxW Conv HxW Conv 3x3 BatchNorm +ReL U Conv HxW Conv HxW addition Conv HxW BatchNorm Mul(0.5) Conv HxW BatchNorm Mul(0.5 ) BatchNorm +ReL U Conv HxW Conv HxW Conv 3x3 BatchNorm +ReL U Conv HxW Conv HxW β← rand(0,1 ) α← rand(0,1 ) (a) (b) (c) Fig. 1 . An overvie w of a 3 - branch shake-shake regularized residual block [9] . (a) Forward p ropaga tio n d u ring the train - ing pha se (b) Backward p ropaga tio n d u ring th e trainin g phase (c) T esting pha se. The coefficients α an d β are sampled from the un iform distribution over [0 , 1] to scale down the fo rward and backward flows during the tr a in ing phase. ResNet is d epicted in Fig. 1. In addition to the short-cu t flow (in ligh t gray ), there are oth er two r esidual bran ches B ( x ) , each o f the m con sisting of a sequ ence of layers stacked in or- der: Conv H × W , Batch Nor malization, ReLU, Conv H × W , Batch Nor m alization, whe r e Conv H × W represents a con- volutional laye r with filters of size H × W witho ut poolin g and ReLU is the r ectified linear un it ReLU(x) = max(0 , x) . The Shake-Shake regularization a dds to the aggr egate of th e o utput of each b ranch an addition al layer, called the ShakeShake la y er , to r andomly ge nerate adversar ia l flows in the following way: ShakeResNet N ( X ) = X + N X n =1 ShakeShake  { B n ( X ) } N n =1  where in the for ward pro pagation for a = [ α 1 , · · · , α N ] sam - pled from the ( N − 1 )-simplex (Fig. 1 (a ) ) ShakeResNet N ( X ) = X + N X n =1 α n B n ( X ) , while in the backward p r opagatio n f or b = [ β 1 , · · · , β N ] sampled f rom th e ( N − 1 )-simplex and g the gr adient f r om the top layer , th e gra dient enterin g into B n ( x ) is β n g (Fig. 1 (b)). At the testing time, the expected mod el is then evaluated for infe r ence by taking the expectation of the ran d om sources in the architectu re (Fig . 1 (c)). In eac h mini-batch , to apply scaling coefficients α or β either on th e entire m ini-batch or o n each individual samp le indepen d ently can also make a difference [9]. 2. PR O POSED MODELS In add ition to batch- or sample- w ise sh aking, wh en it comes to th e ar ea of acoustic pro cessing, there is another orthog - onal d imension to consider : the sp ectral do main. Lev er ag- ing domain k nowledge, our pro posed m o dels ar e based on a simple but plausible hypothesis that affective inf o rmation is distributed n on-un iformly over the sp e ctral axis [13]. There- fore, there is no r eason to enforc e the entire spec tr al axis to be shaken with the same strength concurr ently . Fur thermor e , adversarial n oise may exist and extend over th e spectral axis. By delib erately shaking spectral su b-band s indep endently , the structure of adversaria l n oise may be b roken and beco me less confou nding to the m odel. Time Frequency Upper1 Upper1 Lower1 residual branch 1 } Time Frequency Upper1 Upper2 Lower2 residual branch 2 F ull Fig. 2 . An illustration fo r the sub -band definitions. Before we form ally define the pr oposed mod els, we in - troduce the definition of sub -band s first. Fig. 2 depicts the definition for sub -band s in a 3 -branc h residu al block. Here we slightly abuse the notatio ns of freque ncy a nd time because after two con volutional layers these axes are not exactly the same as those of input to the branches; howev er , since con- volution is a local oper ation they still hold the cor respond ing spectral an d temporal natu re. At the outp ut of each branch, we define the hig h-freq uency half to be the upp er sub-band while the low-frequency half to be the lower sub-b and. W e take th e midd le point on the spectral a xis to be the bord er line for simplicity . The entire outp ut is called the fu ll b and. Having defined these conc epts, we d enote X the input to a residual block , X i the fu ll band from the i -th bran ch, X i u the upper sub-band from th e i -th branch and X i l the lower sub- band from the i -th branc h . Naturally , the relationsh ip between them is giv e n by X i =  X i u | X i l  . W e a lso denote Y the output of a Shake-Sh a ke regularized residual block. T o de m onstrate that shaking sub -bands ca n better regular- ize the lear ning p r ocess, we p ropose the fo llowing models fo r benchm a rking: 1. Shake the full ban d ( Full ) Y = X + N X n =1 ShakeShake  { X n } N n =1  . (1) 2. Shake the upp er sub -band ( Upper ) Y = X (2) + " N X n =1 ShakeShak e  { X n u } N n =1       N X n =1 X n l # . 3. Shake the lower sub -band ( Lower ) Y = X (3) + " N X n =1 X n u      N X n =1 ShakeShake  { X n l } N n =1  # . 4. Shake bo th sub -bands but ind epende n tly ( Bo th ) Y = X + [ Y u | Y l ] , (4) Y u = N X n =1 ShakeShake  { X n u } N n =1  , Y l = N X n =1 ShakeShake  { X n l } N n =1  . 3. EXPERIME NTS 3.1. Experiment I 3.1.1. Datasets W e u se fo ur pu blicly available emotion corpo ra to demon- strate the effecti veness o f the propo sed models, including the Ryerson Au dio-V isual Database of Em otional Sp eech and Song (RA VDESS) [1 4], the eNTERF ACE’05 Au dio-V isual Emotion Database [15], the EMOV O Corpus [16] and th e Surrey Audio-V isual Ex pressed Emo tion (SA VEE) [1 7]. All of th ese corpor a are multi-m odal in which speech , facial expression an d text all con vey a certain degree of affectiv e informa tio n. Ho wever , in this paper we solely focus on the acoustic m odality for experim ents. The intersectio n of emotion al c la sses in these fo ur co rpora consists of joy , anger, sadn ess and fear . Therefo re, we for mu- late the experimen tal task into a sequen ce classification o f 4 classes. In par ticular , we employ both speak ing and singin g sets from the RA VDESS corpu s, all o f the eNTERF A CE’05, EMO VO and SA VEE c orpor a . Howe ver, o ne of the fema le actors in RA VDESS co rpus d oes n ot have the singing part and we thus leave h er speech part ou t of the experime n ts as well. The acto r 23 in eNTERF A CE’05 has on ly 3 utterances portray ing joy , which makes the em otional class distribution slightly imbalan ced. As a result, we have 2 885 utteranc es in total. T able 1 su mmarizes the inform ation ab out these four corpor a. Corpus No. No. Utterances Actors joy anger sadness fear RA VDESS 23 368 368 368 368 eNTERF A CE 4 2 2 07 210 210 210 EMO VO 6 84 8 4 84 84 SA VEE 4 60 6 0 60 60 T otal 75 719 722 722 722 T able 1 . An overview of the se selected corpor a, including the number of actor s and the distribution of utteranc e s in the emotional c lasses. For the ev alu ation, we ad o pt a 4 - fold cr o ss validation strategy . T o begin with, we sp lit the actor set into 4 partitions. Moreover , we im pose extra con straints to make sure tha t each p artition is as gen der and co rpus u niform as possible. For example, e ach actor set partition is r andom ly d istributed Corpus Actor Set Partition 1 2 3 4 RA VDESS 3 F , 3 M 3 F , 3 M 3 F , 3 M 2 F , 3 M eNTERF A CE 2 F , 9 M 2 F , 8 M 2 F , 8 M 3 F , 8 M EMO VO 1 F , 0 M 1 F , 1 M 1 F , 1 M 0 F , 1 M SA VEE 0 F , 1 M 0 F , 1 M 0 F , 1 M 0 F , 1 M T otal 6 F , 13 M 6 F , 13 M 6 F , 13 M 5 F , 13 M T able 2 . F: fem ale, M: m a le. Th e gender and corpu s d istribu- tions in each acto r set partition of the cr oss validation. with 2 - 3 female acto rs and 8 - 9 male actor s fro m the eN- TERF A CE’05 cor pus. More details ar e provided in T able 2. By par titioning the actor set, it b ecomes easier to main - tain speaker ind ependen ce b etween training and validation throug hout all of th e experiments. Models Layers No. Params 3 -Branch Con v 2d(4, 2,16) + 1.17 M ResNet BatchNorm2 d + ReLU + (Shortcut, Branch × 2 ) + [ w/ shake reg. [ ShakeShake {× 2 } + ] { on Both } ] ReLU + Mean-Po o ling + Dropou t(0.5) + Linear(2 5 6) + ReLU + Dropou t(0.25 ) + Linear(2 5 6) + ReLU + Linear(4 ) Branch Con v 2d(4, 2,64) + 10.2 K BatchNorm2 d + ReLU + Con v 2d(4, 4,128 ) + BatchNorm2 d T able 3 . Network architecture, lay ers and the n u m- ber of parameters in the baseline a n d proposed models. Con v 2d ( N , H , W ) stand s for a 2 D conv o lutional lay e r with N filters of size H × W and Linear ( N ) for a fully connec ted layer with N nodes. The Mean-Poo ling layer r epresents the temporal poo ling for g enerating an utteranc e rep r esentation. 3.1.2. Experimental Se tu p T o start with, we extract the spectr ograms of each u tterance with a 25 ms win dow for ev ery 10 ms using the Kaldi [ 18] li- brary . Cepstral m ean and variance n ormalization is then ap- plied on the spectro gram frames per uttera n ce. T o equip each frame with a certain context, we splice it with 10 fr ames in the left and 5 f rames in the right. Theref ore, a resultin g spliced frame has a reso lu tion of 16 × 25 7 . Sin c e emotion inv olves a longer-term mental state transition, we fu r ther down-sample the frame rate by a factor of 8 to simplify and expedite the training proce ss. W e establish a baseline o f 3 - b ranch ResNet and list the details in T able 3 . For each utterance, a simple mean poo l- ing is taken at the ou tput of the r esidual block to form an Model Patience in E arly Stopp in g 9 11 13 15 17 19 2 1 26 31 36 4 1 46 51 Baseline 48.01 48.01 48.57 50.59 51.78 56.03 56.03 56.49 57.66 57.66 57.66 5 8.85 5 8.85 Full 47.26 47.26 48.46 52.58 5 3.23 53.23 5 3.23 5 6.24 56.24 5 6.94 56.94 5 7.34 57 .34 Upper 47.30 48.62 52.66 53.31 5 4.73 55.26 5 5.47 5 6.00 57.50 5 7.50 57.79 5 7.79 57 .79 Lower 4 5.62 46.55 4 7.66 48.04 4 8.79 48.79 49.21 51 .34 51.48 54 .18 54.18 54 .18 54 .18 Both 46.97 49.66 50.72 51.61 54.13 54.13 5 4.13 5 4.58 55.08 5 7.20 57.66 5 7.79 57 .79 T able 4 . A veraged unweigh ted accuracy (%) on the validation p a rtition over 4 -fold cro ss validation. Model Patience in E arly Stopp in g 9 11 13 15 17 19 21 26 3 1 36 41 46 51 Baseline -0.14 -0 .14 1.45 1.90 4.5 0 8.93 8.9 3 11 .42 14.74 14 .74 14.74 1 6 .34 16 .34 Full 0.08 0.08 0.84 3.16 3.3 3 3.33 3.3 3 8.05 8.05 10.57 10.57 1 1.13 1 1 .13 Upper 2.35 2.68 4.98 5.85 7.3 8 10 .47 11.45 14 .09 16.08 1 6 .08 18.62 18 .62 18 .62 Lower 0.13 - 0 .18 0.84 1.66 3.3 4 3.34 3.9 9 8.56 8.90 14.27 14.27 1 4.27 1 4 .27 Both 1.41 2.90 2.74 2.22 2 .73 2.7 3 2.73 4.57 6.13 8.16 10.1 4 11.53 11.53 T able 5 . A veraged gap betwee n the unweig hted accuracy (%) on the training and validation p artitions over 4 -fold cro ss valida- tion. utterance r epresentation bef ore feed ing it to th e fu lly co n- nected layers. W e av oid explicit tem poral mod eling layers such as a long short- term m emory recur rent n etwork b ecause our fo cus is on shak in g the ResNet. Note that the ShakeShake layer has no pa r ameter to learn an d h ence the m o del size does not change during this work. W e imp lement the ShakeShake layer as well as the entire network architectur e using the Py- T orch [19] library . Only th e Shake-Shake c o mbination [9] is used and shaking is app lied in d epende n tly per frame. Due to space limit, we leave other comb in ations for future work. The models are learned using the Ad a m op timizer [20] with an initial lear n ing rate of 0 . 001 and the training is carried out on an NVIDI A T esla K80 GPU. W e use a mini-batch of 64 utterances acro ss all mod e l tr aining an d let each experiment run for 20 0 ep o chs in o rder to investigate the regularization power when over-training occ u rs. 3.1.3. Results T able 4 and 5 summ arize the bench marking of the unwe ighted accuracy (U A) of cross validation and the g ap of U A between training and validation with respe c t to different patien ce in early stopping. In T ab le 4, the und erlined n u mbers ind icate wh en a mod el perfor ms b etter than the baseline. A clear tre n d is that if the training proc e ss is stopped e a rly , models with regu la r ization tend to outperf orm the baseline. On the oth er hand, if the training g o es to o far, th e situation is almost e n tirely th e op- posite. Howe ver , even when over-trained the margin that th e baseline h as over the other regularized mod els is on ly ar ound 1% except f o r the model Lower . O n e thing to no te, in par ticu- lar , is th at the m odel Lower seem s to struggle with difficulties in capturin g the affecti ve pattern since the beginning. In T ab le 5, the und erlined n u mbers ind icate wh en a mod el has a smaller gap than th at of a baseline u nder the same pa- tience. The appar ent tren d h e r e is tha t if we let th e tra in - ing keep going , almost all r egularized mo d els ten d to have a sm aller gap compared to the baseline; in o ther word s, the baseline tends to overfit m o re under the same patience in early stopping. W e also no te the model Upper , d espite being reg- ularized, appea r s to have a larger gap than the baseline does since the beginnin g of learn in g. Fortunately , these two tr ends overlap abo ut when patienc e equals 1 7 . In bo th T able 4 and 5, the boldfaced num b ers rep- resent when a model p erform s no t worse th an the baseline and has a smaller gap. Based on these two criteria, the m odels Full and Both b oth demon stra te a supe r ior perfor mance while staying far from being over-trained. Moreover , the model Both is able to match the p e rforman ce of the baseline even in the over-trained region where p atience equa ls 41 , while still achieving a smaller gap com pared to the baseline. Another observation is that regula r ized m odels g enerally r equire mo re patience to reach the sam e gap, espec ially the m o del Both . This suggests early stopping under the same patien ce ma y no t be an universally optimal strategy . When bench marked with the m odel Full , the model Both always has a h ig her accuracy wh enever th ey achieve com - parable gaps (e.g . 3 . 33 versus 2 . 7 3 , 8 . 05 versus 8 . 16 , etc) , and mo st of th e time when und er the same patience . T h is pheno m enon corrob orates our hy pothesis that ind ependen tly shaking the sub- bands would help to learn a better mo d el fo r affecti ve comp uting. Nevertheless, the co ncerning fact that the m o dels Upper and Lower show totally different cha rac- teristics requir es fur ther investigation in the future. Unfortu n ately , the improvement in Experim e nt I is no t statistical significant, possibly due to a small degree o f free- dom ( df = 4 − 1 = 3 ). The strong regular ization by shaking may also suppress the expressi veness by the shallow arc h itec- ture in T able 3 in some fold s. 3.2. Experiment II T o be tter demon strate the effectiveness of shak ing sub-b ands, we pr esent anoth e r set of experiments similar to Experiment I in this subsection, except that we em ploy two extra co rpora and a deeper architectu re here. T able 6 su m marizes the cor- pora for Ex periment II, includin g the additional IEMOCAP [ ? ] a nd E M O-DB [ ? ] cor pora. A similar actor set par tition is presented in T able 7 to ensure speaker indep endence and gen- der a nd c o rpus-u n iform d istribution over each actor par titio n. 3.2.1. Datasets Corpus No. No. Utterances Actors joy anger sad ness fear eNTERF A CE 4 2 207 2 10 210 210 RA VDESS 24 376 376 376 376 IEMOCAP 10 720 135 5 1478 0 Emo-DB 10 7 1 1 27 66 69 EMO VO 6 84 84 84 84 SA VEE 4 60 60 60 60 T otal 96 1518 2212 2 274 799 T able 6 . 6803 utterances Corpus Actor Set Partition 1 2 3 4 eNTERF A CE 2 F , 8 M 2 F , 9 M 3 F , 8 M 2 F , 8 M RA VDESS 3 F , 3 M 3 F , 3 M 3 F , 3 M 3 F , 3 M IEMOCAP 1 F , 2 M 1 F , 1 M 1 F , 1 M 2 F , 1 M Emo-DB 1 F , 2 M 1 F , 1 M 1 F , 1 M 2 F , 1 M EMO VO 1 F , 0 M 1 F , 1 M 1 F , 1 M 0 F , 1 M SA VEE 0 F , 1 M 0 F , 1 M 0 F , 1 M 0 F , 1 M T otal 8 F , 16 M 8 F , 1 6 M 9 F , 15 M 9 F , 15 M T able 7 . F: female, M: male. layer name s tructure strid e no .params prelim-co n v [2 × 16 , 4] × 1 [1 , 1] 132 res-8  2 × 16 , 8 2 × 16 , 8  × 3 [1 , 1 ] 22 . 84 K res-16  2 × 16 , 16 2 × 16 , 16  × 1 [1 , 1 ] 24 . 88 K res-32  2 × 16 , 32 2 × 16 , 32  × 1 [1 , 1 ] 99 . 17 K av e r age - - - affine 256 × 4 - 1024 total - - 148 K W e emp loy a residual n e ural network with 3 stages, re s- 8, res-16 and res-3 2, f or E xperimen t II. There are 3 residual blocks in res-8, an d one residual block in res-1 6 an d res-32 . W e rem ove the f ully con n ected layer and keep o nly the o utput layer . T ab le 3.2 .1 contain s d etails o f the architec tu re. Model V alid UA Gap Baseline 61 . 342 7 . 485 Full 6 2 . 989 − 1 . 128 Both 64 . 9 73 1 . 791 * av erage over three runs Model Pair V alid UA Gap Baseline vs Full 1 . 4 4 × 10 − 2 2 . 7 × 10 − 4 Baseline vs Both 1 . 22 × 10 − 5 3 × 10 − 3 Full vs Both 2 . 6 × 10 − 3 9 . 9 × 10 − 1 T able 8 . p-values resulted from one -sided p aired t-test with df= 4 × 3 − 1 = 11 W e do not u se early- stopping in Ex perimen t II and let each m odel train ing run for 1 000 ep ochs, and fo r 3 run s to ob- tain a robust estimate of th e perf ormanc e . T able 3.2.1 present the experim ental r e sults fo r UA an d Gap. It is clear that both Full and Both are able to improve fr om Baseline , ev en with a much dee per a r chitecture an d with a lo nger training time. Furthermo re, we con d uct statistical hypo thesis testing to examine the significance o f the obser ved improvement. The resulting p- values from on e-sided paired t- test indicates that the imp rovements of Full and Both from Baseline ar e statis- tical significant, in terms of UA an d Gap, in wh ich Both has reached a high er UA but also a larger g a p , com pared to Full . In fact, the imp rovement of Both fro m Full is also sign ifi- cant in terms of U A. On the other h and, Full has significan tly achieved a smaller gap, comp ared to Both . More d etails a r e presented in T able 3. 2.1. 4. CONCLUSIONS W e h av e pro p osed a Shake-Shake regularized multi-b ranch ResNet mod el for speec h emotion r e c ognition . In particular, we h av e experimented with different configur ations fo r the Shake-Shake regularization on th e f ull band, the upp er and the lower sub -band s alone and simultaneou sly . W e condu cted two experim ents to study the shaking regularization mec ha- nism fo r affecti ve co mputing and demon strated th at sub- b and shaking can lead to an improved U A and r educed overfitting. The r esults su pport o ur h ypothe sis that sha k ing different sub - bands with indep e ndent stren gth would be n efit learn ing in af- fective comp uting. W ith a commo nly used patience, say 17 , the shallow mod els Full and Both are able to d eliv e r com- petitiv e p erforma nces over the b a seline with reduced over - fitting. Wh e n experimentin g with deeper mod els, we find that the shak ing regular ization keeps mo dels fro m overfitting even with a substantially longe r time. The result fro m hy pothesis testing shows n ot only the significan ce of the impr ovemen t from Baseline but also that different configu rations of shak- ing regularizatio n man ifest a trad e-off between expr essi ve- ness and comp ression. However , given the opp osite behav- iors of the m odels Upper an d Lower , further in vestigation is necessary in the f uture. 5. REFERENCES [1] Kaiming He, Xiang yu Z hang, Sh a oqing Ren, and Jian Sun, “Deep Residual Learn ing for Imag e Recog nition, ” in Pr o ceedings o f the IEEE Conference on Computer V i- sion an d P a ttern Recognition (CVPR) , 2016 . [2] Jonas Geh ring, Michael Auli, David Gr angier, Denis Y ar ats, and Y ann N. Dauph in , “Conv olu tio nal Sequen ce to Sequ ence L earning, ” 201 7, arXiv:1705.0 3122. [3] Che-W ei Huan g and Shrika n th S. Narayan an, “D e e p Con volution al Recurrent Neural Network with Atten- tion Mec hanism for Robust Speech E m otion Recogni- tion, ” in IEEE Internation al Con fer ence o n Multimed ia and Exp o (I CME) , 2017 . [4] Nitish Sriv astav a, Geoffrey Hin ton, Alex Krizhevsky , Ilya Sutske ver, and Ruslan Salak hutdinov , “Drop out: A Simple Way to Prevent Neu ral Networks fr om Over- fitting, ” J. Mach. Learn. Res. , vol. 15, n o. 1 , Jan. 201 4. [5] Sergey Io ffe and Christian Szegedy , “Batch Nor m aliza- tion: Accelerating Deep Network Training b y Redu cing Internal Covariate Shift, ” in Pr oceedin g s of the 32n d In- ternational Confer ence o n Machine Learning (ICML) , 2015. [6] Sergey Zagoruy ko and Nikos Komodakis, “Wide Resid- ual Networks, ” in BMVC , 2 016. [7] Gao Huang, Y u Sun , Zh uang Liu, Daniel Sedra, an d Kilian W e inberger, “Deep N e tworks with Stochastic Depth, ” in ECCV , 2 016. [8] Guolian g Kang, Jun Li, and Dacheng T ao, “Shake- out: A New Regularized Deep Neural Network Training Scheme, ” in Pr oceedin gs of the Thirtieth AAAI Confer- ence on Artificia l Intelligence , 2016 . [9] Xavier Ga staldi, “Shake-Shake Regularizatio n , ” 2017, arXiv:1705.074 85. [10] Alex Krizhevsky , V inod Nair, and Geo ffrey Hin- ton, “CIF AR-10 (Canad ian I nstitute for Advanced Re- search), ” . [11] Nguye n Xuan V inh , Sarah M. Erfani, Sakrapee Paisitkriangkrai, James Bailey , Christopher Leckie, and K otag iri Ramamohan a r ao, “T r a ining Robust Mo d els Using Random Pro je c tio n, ” in Pr oce e dings of the 23r d Internationa l Confer en ce on P a ttern Recognitio n (ICPR) , 20 16. [12] Harish Mallidi and Hyn ek Hermansky , “Novel Neu- ral Network Based Fu sio n fo r Multistream ASR, ” in Pr oceedin g s of the IEEE In ternationa l Confer ence on Acoustics, Spee ch and Sign al Pr o cessing , March 201 6. [13] Chul Min Lee, Serdar Y ildirim, Mu rtaza Bulut, Abe Kazemzadeh , Carlos Busso, Zhigang Deng, Sun gbok Lee, an d Shrikan th Nar ayanan, “Em otion Recognitio n Based on Phon eme Classes, ” in Pr o ceedings of Inter- Speech , 20 0 4. [14] S. R. Livingstone, K. Peck, and F . A. Russo, “RA VDESS: The Rye rson Audio-V isu a l Database of Emotiona l Speech and Song , ” in Pr oceeding s of the 22nd Annua l Mee tin g of the Canad ian Society for B rain, Behaviou r a nd Cognitive Science , 201 2. [15] Olivier Martin , Ir ene Kotsia, Beno it M. Macq , an d I o an- nis Pitas, “T he eNTERF ACE’05 Audio-Visua l Emo tion Database, ” in Pr oc e e dings of th e Internatio nal Con fer- ence o n Data En g ineering W orkshops , 20 06. [16] Giovanni Costantini, Iacopo Iadero la, an d rea Paoloni, and Massimiliano T odisco, “EMO VO Corpus: an Italian Emotiona l Speech Da ta b ase, ” in Pr o ceedings o f the 9 th Internation al Confer ence on La nguage Reso ur ces and Evaluatio n , 2 014. [17] Sana- Ul Haq an d Philip J.B. Jackson, Ma chine Audi- tion: Principles, Algorithms and Systems , chapter Mul- timodal Emotion Recogn ition, IGI Global, 2 010. [18] Dan ie l Povey , Arnab Ghoshal, Gilles Bouliann e , Na- gendra Goe l, Mir ko Hannema n n, Y an m in Qian, Petr Schwarz, and Georg Stemmer, “The KALDI speech recogn itio n too lkit, ” in Pr oceed ings of the IEEE W o rk- shop on A utoma tic Speech Recognition and Under - standing , 2011. [19] “Pyto rch: T ensor s and Dynamic Neural Network s in Pytho n with Strong GPU Acce leration, ” 2017, https://github.com/pytorch/pytorch. [20] Diede rik P . King ma an d Jimmy Ba, “ Adam: A Method for Stochastic Optimization , ” arXiv:1 412.6 980 , 2014.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment