Learning Representations of Emotional Speech with Deep Convolutional Generative Adversarial Networks

LEARNING REPRESENT A TIONS OF EMO TIONAL SPEECH WITH DEEP CONV OLUTIONAL GENERA TIVE AD VERSARIAL NETWORKS J onathan Chang Harve y Mudd College Computer Science Department 1250 North Dartmouth A ve Claremont, CA 91711 Stefan Scher er Uni versity of Southern California Institute for Creati ve T echnologies 12015 W aterfront Dr Playa V ista, CA 90094 ABSTRA CT Automatically assessing emotional valence in human speech has historically been a difﬁcult task for machine learning algorithms. The subtle changes in the voice of the speaker that are indicative of positi ve or negativ e emotional states are often ”ov ershadowed” by voice characteristics relating to emotional intensity or emotional activ ation. In this work we explore a representation learning ap- proach that automatically deriv es discriminati ve representations of emotional speech. In particular, we in vestigate two machine learning strategies to improve classiﬁer performance: (1) utilization of unla- beled data using a deep con volutional generativ e adv ersarial netw ork (DCGAN), and (2) multitask learning. W ithin our extensi ve experi- ments we le verage a multitask annotated emotional corpus as well as a large unlabeled meeting corpus (around 100 hours). Our speaker- independent classiﬁcation experiments show that in particular the use of unlabeled data in our inv estigations improves performance of the classiﬁers and both fully supervised baseline approaches are out- performed considerably . W e improv e the classiﬁcation of emotional valence on a discrete 5-point scale to 43.88% and on a 3-point scale to 49.80%, which is competitiv e to state-of-the-art performance. Index T erms — Machine Learning, Affecti ve Computing, Semi- supervised Learning, Deep Learning 1. INTR ODUCTION Machine Learning, in general, and af fective computing, in particu- lar , rely on good data representations or features that have a good discriminatory faculty in classiﬁcation and re gression experiments, such as emotion recognition from speech. T o deri ve ef ﬁcient rep- resentations of data, researchers ha ve adopted two main strategies: (1) carefully crafted and tailored feature extractors designed for a particular task [1] and (2) algorithms that learn representations auto- matically from the data itself [2]. The latter approach is called Rep- resentation Learning (RL), and has recei ved gro wing attention in the past few years and is highly reliant on large quantities of data. Most approaches for emotion recognition from speech still rely on the e x- traction of standard acoustic features such as pitch, shimmer , jitter and MFCCs (Mel-Frequency Cepstral Coefﬁcients), with a fe w no- table exceptions [3, 4, 5, 6]. In this work we leverage RL strategies and automatically learn representations of emotional speech from This material is based upon work supported by the U.S. Army Research Laboratory under contract number W911NF-14-D-0005. Any opinions, ﬁnd- ings, and conclusions or recommendations e xpressed in this material are those of the author(s) and do not necessarily reﬂect the views of the Gov- ernment, and no ofﬁcial endorsement should be inferred. the spectrogram directly using a deep con volutional neural network (CNN) architecture. T o learn strong representations of speech we seek to lev erage as much data as possible. Howe ver , emotion annotations are difﬁcult to obtain and scarce [7]. W e le verage the USC-IEMOCAP dataset, which comprises of around 12 hours of highly emotional and partly acted data from 10 speak ers [8]. Howev er, we aim to impro ve the learned representations of emotional speech with unlabeled speech data from an unrelated meeting corpus, which consists of about 100 hours of data [9]. While the meeting corpus is qualitatively quite different from the highly emotional USC-IEMOCAP data 1 , we be- liev e that the learned representations will impro ve through the use of these additional data. This combination of tw o separate data sources leads to a semi-supervised machine learning task and we extend the CNN architecture to a deep con volutional generati ve neural netw ork (DCGAN) that can be trained in an unsupervised fashion [10]. W ithin this work, we particularly target emotional valence as the primary task, as it has been shown to be the most challeng- ing emotional dimension for acoustic analyses in a number of stud- ies [11, 12]. Apart from solely targeting v alence classiﬁcation, we further in vestigate the principle of multitask learning. In multitask learning, a set of related tasks are learned (e.g., emotional acti va- tion), along with a primary task (e.g., emotional v alence); both tasks share parts of the network topology and are hence jointly trained, as depicted in Figure 1. It is expected that data for the secondary task models information, which would also be discriminativ e in learning the primary task. In fact, this approach has been shown to improve generalizability across corpora [13]. The remainder of this paper is or ganized as follows: First we introduce the DCGAN model and discuss prior work, in Section 2. Then we describe our speciﬁc multitask DCGAN model in Section 3, introduce the datasets in Section 4, and describe our e xperimental design in Section 5. Finally , we report our results in Section 6 and discuss our ﬁndings in Section 7. 2. RELA TED WORK The proposed model b uilds upon previous results in the ﬁeld of emo- tion recognition, and le verages prior work in representation learning. Multitask learning has been effectiv e in some prior experiments on emotion detection. In particular , Xia and Liu proposed a mul- titask model for emotion recognition which, like the inv estigated model, has activation and valence as targets [14]. Their work uses a 1 The AMI meeting corpus is not highly emotional and strong emotions such as anger do not appear . Deep Belief Network (DBN) architecture to classify the emotion of audio input, with valence and acti vation as secondary tasks. Their experiments indicate that the use of multitask learning produces im- prov ed unweighted accurac y on the emotion classiﬁcation task. Like Xia and Liu, the proposed model uses multitask learning with va- lence and activ ation as targets. Unlike them, howe ver , we are pri- marily interested not in emotion classiﬁcation, but in valence clas- siﬁcation as a primary task. Thus, our multitask model has v alence as a primary target and activ ation as a secondary target. Also, while our experiments use the IEMOCAP database like Xia and Liu do, our method of speaker split differs from theirs. Xia and Liu use a leav e-one-speaker -out cross validation scheme with separate train and test sets that have no speaker overlap. This method lacks a dis- tinct v alidation set; instead, they v alidate and test on the same set. Our e xperimental setup, on the other hand, splits the data into dis- tinct train, validation, and test sets, still with no speaker ov erlap. This is described in greater detail in Section 5. The unsupervised learning part of the in vestigated model builds upon an architecture known as the deep conv olutional generative adversarial network, or DCGAN. DCGAN consists of two compo- nents, known as the generator and discriminator , which are trained against each other in a minimax setup. The generator learns to map samples from a random distribution to output matrices of some pre- speciﬁed form. The discriminator takes an input which is either a generator output or a “real” sample from a dataset. The discrimina- tor learns to classify the input as either generated or real [10]. For training, the discriminator uses a cross entropy loss function based on how many inputs were correctly classiﬁed as real and how many were correctly classiﬁed as generated. The cross entropy loss between true labels y and predictions ˆ y is deﬁned as: L ( w ) = − 1 N N X n =1 [ y n log ˆ y n + (1 − y n ) log(1 − ˆ y n )] (1) Where w is the learned v ector of weights, and N is the number of samples. For purposes of this computation, labels are represented as numerical values of 1 for real and 0 for generated. Then, letting ˆ y r represent the discriminator’ s predictions for all real inputs, the cross entropy for correct predictions of “real” simpliﬁes to: L r ( w ) = − 1 N N X n =1 log ˆ y r,n Because in this case the correct predictions are all ones. Similarly , letting ˆ y g represent the discriminator’ s predictions for all generated inputs, the cross entropy for correct predictions of “generated” sim- pliﬁes to: L f ( w ) = − 1 N N X n =1 log(1 − ˆ y g,n ) Because here the correct predictions are all zeroes. The total loss for the discriminator is giv en by the sum of the previous two terms L d = L r + L f . The generator also uses a cross entropy loss, but its loss is de- ﬁned in terms of ho w many generated outputs got incorr ectly classi- ﬁed as real: L g ( w ) = − 1 N N X n =1 log( ˆ y g,n ) Thus, the generator’ s loss gets lower the better it is able to produce outputs that the discriminator thinks are real. This leads the gen- erator to eventually produce outputs that look like real samples of speech giv en sufﬁcient training iterations. 3. MUL TIT ASK DEEP CONV OLUTIONAL GENERA TIVE AD VERSARIAL NETWORK The in vestigated multitask model is based upon the DCGAN archi- tecture described in Section 2 and is implemented in T ensorFlow 2 . For emotion classiﬁcation a fully connected layer is attached to the ﬁnal con volutional layer of the DCGAN’ s discriminator . The output of this layer is then fed to two separate fully connected layers, one of which outputs a v alence label and the other of which outputs an activ ation label. This setup is sho wn visually in Figure 1. Through this setup, the model is able to take advantage of unlabeled data dur- ing training by feeding it through the DCGAN layers in the model, and is also able to take adv antage of multitask learning and train the valence and acti vation outputs simultaneously . In particular , the model is trained by iterativ ely running the gen- erator , discriminator, v alence classiﬁer, and acti vation classiﬁer , and back-propagating the error for each component through the network. The loss functions for the generator and discriminator are unaltered, and remain as sho wn in Section 2. Both the valence classiﬁer and activ ation classiﬁer use cross entropy loss as in Equation 1. Since the valence and acti vation classiﬁers share layers with the discriminator the model learns features and con volutional ﬁlters that are effecti ve for the tasks of valence classiﬁcation, activ ation classi- ﬁcation, and discriminating between real and generated samples. Generator Discriminator Generated Sample Real Sample Final Convolutional Layer Shared Layer Decision (Real vs Generated) Valence Classifier Activation Classifier Predicted Activation Predicted Valence Fig. 1 . V isual representation of the deep conv olutional generati ve adversarial network with multitask v alence and activ ation classiﬁer . 4. D A T A CORPUS Due to the semi-supervised nature of the proposed Multitask DC- GAN model, we utilize both labeled and unlabeled data. For the unlabeled data, we use audio from the AMI [9] and IEMOCAP [8] datasets. For the labeled data, we use audio from the IEMOCAP dataset, which comes with labels for acti v ation and v alence, both measured on a 5-point Likert scale from three distinct annotators. Although IEMOCAP provides per-word activ ation and v alence la- bels, in practice these labels do not generally change over time in a giv en audio ﬁle, and so for simplicity we label each audio clip with the average valence and acti vation. Since valence and acti va- tion are both measured on a 5-point scale, the labels are encoded in 5-element one-hot vectors. For instance, a valence of 5 is repre- sented with the vector [0 , 0 , 0 , 0 , 1] . The one-hot encoding can be 2 W e will make the code av ailable on GitHub upon publication. thought of as a probability distrib ution representing the likelihood of the correct label being some particular value. Thus, in cases where the annotators disagree on the valence or acti vation label, this can be represented by assigning probabilities to multiple positions in the label vector . For instance, a label of 4.5 conceptually means that the “correct” valence is either 4 or 5 with equal probability , so the cor- responding vector would be [0 , 0 , 0 , 0 . 5 , 0 . 5] . These “fuzzy labels” hav e been shown to improv e classiﬁcation performance in a number of applications [15, 16]. It should be noted here that we had gen- erally greater success with this fuzzy label method than training the neural network model on the valence label directly , i.e. classiﬁcation task vs. regression. Pre-pr ocessing. Audio data is fed to the network models in the form of spectrograms. The spectrograms are computed using a short time Fourier transform with window size of 1024 samples, which at the 16 kHz sampling rate is equi v alent to 64 ms. Each spectrogram is 128 pixels high, representing the frequency range 0-11 kHz. Due to the varying lengths of the IEMOCAP audio ﬁles, the spectrograms vary in width, which poses a problem for the batching process of the neural network training. T o compensate for this, the model ran- domly crops a region of each input spectrogram. The crop width is determined in adv ance. T o ensure that the selected crop region con- tains at least some data (i.e. is not entirely silence), cropping occurs using the following procedure: a random word in the transcript of the audio ﬁle is selected, and the corresponding time range is looked up. A random point within this time range is selected, which is then treated as the center line of the crop. The crop is then made using the region deﬁned by the center line and crop width. Early on, we found that there is a noticeable imbalance in the valence labels for the IEMOCAP data, in that the labels skew heav- ily towards the neutral (2-3) range. In order to prevent the model from overﬁtting to this distribution during training, we normalize the training data by ov ersampling underrepresented valence data, such that the ov erall distribution of v alence labels is more even. 5. EXPERIMENT AL DESIGN In vestigated Models. W e in vestigate the impact of both unlabeled data for improved emotional speech representations and multitask learning on emotional valence classiﬁcation performance. T o this end, we compared four different neural netw ork models: 1. BasicCNN: A fully supervised, single-task valence classiﬁer . 2. MultitaskCNN: A fully supervised, multitask v alence clas- siﬁer with activ ation as secondary task. 3. BasicDCGAN: A semi-supervised, DCGAN with single-task valence classiﬁer . 4. MultitaskDCGAN: A semi-supervised, DCGAN with mul- titask valence classiﬁer and acti vation as secondary task. The BasicCNN represents a “bare minimum” valence classiﬁer and thus sets a lo wer bound for expected performance. Comparison with MultitaskCNN indicates the ef fect of the inclusion of a sec- ondary task, i.e. emotional activation recognition. Comparison with BasicDCGAN indicates the effect of the incorporation of unlabeled data during training. For f airness, the architectures of all three baselines are based upon the full MultitaskDCGAN model. BasicDCGAN for example is simply the MultitaskDCGAN model with the activ ation layer re- mov ed, while the two fully supervised baselines were b uilt by taking the con volutional layers from the discriminator component of the MultitaskDCGAN, and adding fully connected layers for valence Basic Multitask Basic Multitask Parameter CNN CNN DCGAN DCGAN Crop W idth 128 64 128 64 Learning Rate 1 e − 4 1 e − 3 1 e − 4 1 e − 4 Batch Size 64 128 64 128 Filter Size 15 8 9 6 No. of Filters 84 32 72 88 T able 1 . Final parameters used for each model as found by random parameter search and activ ation output. Speciﬁcally , the discriminator contains four con volutional layers; there is no e xplicit pooling but the kernel stride size is 2 so image size gets halved at each step. Thus, by design, all four models hav e this same con volutional structure. This is to en- sure that potential performance g ains do not stem from a larger com- plexity or higher number of trainable weights within the DCGAN models, but rather stem from impro ved representations of speech. Experimental Procedure. The parameters for each model, in- cluding batch size, ﬁlter size (for con volution), and learning rate, were determined by randomly sampling different parameter combi- nations, training the model with those parameters, and computing accuracy on a held-out validation set. For each model, we kept the parameters that yield the best accuracy on the held-out set. This pro- cedure ensures that each model is fairly represented during ev alua- tion. Our hyper-parameters included crop width of the input signal ∈ { 64 , 128 } , conv olutional layer ﬁlter sizes ∈ [2 , w / 8] (where w is the selected crop width and gets divided by 8 to account for each halving of image size from the 3 con volutional layers leading up to the last one), number of conv olutional ﬁlters ∈ [32 , 90] (step size 4), batch size ∈ { 64 , 128 , 256 } , and learning rates ∈ { 1 e − 3 , 1 e − 4 , 1 e − 5 } . Identiﬁed parameters per model are shown in T able 1. For ev aluation, we utilized a 5-fold lea ve-one-session-out val- idation. Each fold leaves one of the ﬁve sessions in the labeled IEMOCAP data out of the training set entirely . From this left-out con versation, one speaker’ s audio is used as a v alidation set, while the other speaker’ s audio is used as a test set. For each fold, the ev aluation procedure is as follows: the model being e valuated is trained on the training set, and after each full pass through the training set, accuracy is computed on the validation set. This process continues until the accuracy on the validation set is found to no longer increase; in other words, we locate a local maxi- mum in v alidation accuracy . T o increase the certainty that this local maximum is truly representative of the model’ s best performance, we continue to run more iterations after a local maximum is found, and look for 5 consecutive iterations with lower accurac y values. If, in the course of these 5 iterations, a higher accuracy v alue is found, that is treated as the ne w local maximum and the search res tarts from there. Once a best accurac y v alue is found in this manner, we restore the model’ s weights to those of the iteration corresponding to the best accuracy , and ev aluate the accuracy on the test set. W e ev aluated each model on all 5 folds using the methodology described abov e, recording test accuracies for each fold. Evaluation Strategy . W e collected several statistics about our models’ performances. W e were primarily interested in the unweighted per-class accuracy . In addition, we conv erted the net- work’ s output from probability distributions back into numerical labels by taking the expected value; that is: v = P 5 i =1 ip i , where p is the model’ s prediction, in its original form as a 5 element vector probability distrib ution. W e then used this to compute the Pearson correlation ( ρ measure) between predicted and actual labels. Accuracy Accurac y Pearson Correlation Model (5 class) (3 class) ( ρ value) BasicCNN 38 . 52% 46 . 59% 0 . 1639 MultitaskCNN 36 . 78% 40 . 57% 0 . 0737 BasicDCGAN 43 . 88 % 49 . 80 % 0 . 2618 MultitaskDCGAN 43 . 69% 48 . 88% 0 . 2434 T able 2 . Ev aluation metrics for all four models, a veraged across 5 test folds. Speaker-independent unweighted accuracies in % for both 5-class and 3-class valence performance as well as Pearson correla- tion ρ are reported. Some pre-processing was needed to obtain accurate measures. In particular , in cases where human annotators were perfectly split on what the correct label for a particular sample should be, both possibilities should be accepted as correct predictions. For instance, if the correct label is 4.5 (v ector form [0 , 0 , 0 , 0 . 5 , 0 . 5] ), a correct prediction could be either 4 or 5 (i.e. the maximum inde x in the output vector should either be 4 or 5). The above measures are for a 5-point labeling scale, which is how the IEMOCAP data is originally labeled. Howe ver , prior ex- periments ha ve e valuated performance on v alence classiﬁcation on a 3-point scale [17]. The authors provide an example of this, with va- lence lev els 1 and 2 being pooled into a single “ne gative” category , valence le vel 3 remaining untouched as the “neutral” cate gory , and valence le vels 4 and 5 being pooled into a single “positiv e” category . Thus, to allow for comparison with our models, we also report re- sults on such a 3-point scale. W e construct these results by taking the results for 5 class comparison and pooling them as just described. 6. RESUL TS T able 2 shows the unweighted per -class accuracies and Pearson cor- relation coeffecients ( ρ values) between actual and predicted labels for each model. All v alues shown are average values across the test sets for all 5 folds. Results indicate that the use of unsupervised learning yields a clear improvement in performance. Both BasicDCGAN and Multi- taskDCGAN hav e considerably better accuracies and linear correla- tions compared to the fully supervised CNN models. This is a strong indication that the use of large quantities of task-unrelated speech data improved the ﬁlter learning in the CNN layers of the DCGAN discriminator . Multitask learning, on the other hand, does not appear to hav e any positive impact on performance. Comparing the two CNN mod- els, the addition of multitask learning actually appears to impair per- formance, with MultitaskCNN doing worse than BasicCNN in all three metrics. The difference is smaller when comparing BasicD- CGAN and MultitaskDCGAN, and may not be enough to decid- edly conclude that the use of multitask learning has a net negativ e impact there, but certainly there is no indication of a net positiv e impact. The observed performance of both the BasicDCGAN and MultitaskDCGAN using 3-classes is comparable to the state-of-the- art, with 49.80% compared to 49.99% reported in [17]. It needs to be noted that in [17] data from the test speaker’ s session partner was uti- lized in the training of the model. Our models in contrast are trained on only four of the ﬁve sessions as discussed in 5. Further, the here presented models are trained on the raw spectrograms of the audio and no feature extraction was employed whatsoev er . This represen- tation learning approach is employed in order to allow the DCGAN component of the model to train on vast amounts of unsupervised V ery Neg. Neg. Neu. Pos. V ery Pos. V ery Neg. 0.3316 0.1633 0.1837 0.2092 0.1122 Neg. 0.0293 0.7701 0.0796 0.0910 0.0299 Neu. 0.0106 0.5090 0.3507 0.0903 0.0393 Pos. 0.0285 0.4881 0.1395 0.3027 0.0412 V ery Pos. 0.0366 0.2683 0.0732 0.1829 0.4390 T able 3 . Confusion matrix for 5-class v alence classiﬁcation with the BasicDCGAN model. Predictions are reported in columns and actual targets in ro ws. V alence classes are sorted from very neg ativ e to very positi ve. These classes correspond to the numeric labels 1 through 5. speech data. W e further report the confusion matrix of the best performing model BasicDCGAN in T able 3. It is noted that the “ne gati ve” class (i.e., the second row) is classiﬁed the best. Howe ver , it appears that this class is picked more frequently by the model resulting in high recall = 0.7701 and lo w precision = 0.3502. The class with the high- est F1 score is “very positive” (i.e., the last ro w) with F 1 = 0 . 5284 . The confusion of “very negati ve” valence with “very positi ve” va- lence in the top right corner is interesting and has been pre viously observed [5]. 7. CONCLUSIONS W e in vestigated the use of unsupervised and multitask learning to improv e the performance of an emotional v alence classiﬁer . Ov erall, we found that unsupervised learning yields considerable improve- ments in classiﬁcation accuracy for the emotional valence recogni- tion task. The best performing model achie ves 43.88% in the 5-class case and 49.80% in the 3-class case with a signiﬁcant Pearson corre- lation between continuous target label and prediction of ρ = 0 . 2618 ( p < . 001 ). There is no indication that multitask learning provides any adv antage. The results for multitask learning are somewhat surprising. It may be that the v alence and acti v ation classiﬁcation tasks are not sufﬁciently related for multitask learning to yield improv ements in accuracy . Alternati vely , a different neural network architecture may be needed for multitask learning to work. Further , the alternating update strate gy employed in the present work might not have been the optimal strategy for training. The iterativ e swapping of tar get tasks v alence/activation might hav e created instabilities in the weight updates of the backpropagation algorithm. There may yet be other explanations; further in vestigation may be warranted. Lastly , it is important to note that this model’ s performance is only approaching state-of-the-art, which employs potentially bet- ter suited sequential classiﬁers such as Long Short-term Memory (LSTM) networks [18]. Howe ver , basic LSTM are not suited to learn from entirely unsupervised data, which we leveraged for the proposed DCGAN models. For future work, we hope to adapt the technique of using unlabeled data to sequential models, including LSTM. W e expect that combining our work here with the advan- tages of sequential models may result in further performance gains which may be more competitive with today’ s leading models and potentially outperform them. For the purposes of this inv estigation, the key takea way is that the use of unsupervised learning yields clear performance gains on the emotional valence classiﬁcation task, and that this represents a technique that may be adapted to other models to achiev e even higher classiﬁcation accuracies. 8. REFERENCES [1] Piotr Doll ´ ar , Zhuowen T u, Hai T ao, and Ser ge Belongie, “Fea- ture mining for image classiﬁcation, ” in 2007 IEEE Conference on Computer V ision and P attern Recognition . IEEE, 2007, pp. 1–8. [2] Y oshua Bengio, Aaron Courville, and Pascal V incent, “Rep- resentation learning: A revie w and ne w perspectives, ” IEEE transactions on pattern analysis and machine intelligence , vol. 35, no. 8, pp. 1798–1828, 2013. [3] George Trigeor gis, Fabien Ringev al, Raymond Brueckner, Erik Marchi, Mihalis A Nicolaou, Stefanos Zafeiriou, et al., “ Adieu features? end-to-end speech emotion recognition using a deep con volutional recurrent network, ” in 2016 IEEE Inter- national Confer ence on Acoustics, Speech and Signal Pr ocess- ing (ICASSP) . IEEE, 2016, pp. 5200–5204. [4] Sayan Ghosh, Eugene Laksana, Louis-Philippe Morenc y , and Stefan Scherer , “Learning representations of affect from speech, ” CoRR , vol. abs/1511.04747, 2015. [5] S. Ghosh, E. Laksana, L.-P . Morency , and S. Scherer , “Rep- resentation learning for speech emotion recognition, ” in Pr o- ceedings of Interspeech 2016 , 2016. [6] S. Ghosh, L.-P . Morency , E. Laksana, and S. Scherer , “ An unsupervised approach to glottal in verse ﬁltering, ” in Pr oceed- ings of EUSIPCO 2016 , 2016. [7] Gary McKeown, Michel V alstar, Roddy Cowie, Maja Pantic, and Marc Schroder , “The semaine database: Annotated mul- timodal records of emotionally colored con versations between a person and a limited agent, ” IEEE T ransactions on Affective Computing , vol. 3, no. 1, pp. 5–17, 2012. [8] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mo wer, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan, “Iemocap: Interactiv e emotional dyadic motion capture database, ” Lan- guage resour ces and evaluation , vol. 42, no. 4, pp. 335–359, 2008. [9] Jean Carletta, Simone Ashby , Sebastien Bourban, Mike Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, V asilis Karaiskos, W essel Kraaij, Melissa Kronenthal, et al., “The ami meeting corpus: A pre-announcement, ” in International W orkshop on Machine Learning for Multimodal Interaction . Springer , 2005, pp. 28–39. [10] Alec Radford, Luke Metz, and Soumith Chintala, “Unsuper- vised representation learning with deep con volutional genera- tiv e adversarial networks, ” arXiv pr eprint arXiv:1511.06434 , 2015. [11] T . Dang, V . Sethu, and E. Ambikairajah, “Factor analysis based speaker normalisation for continuous emotion prediction, ” in Pr oceedings of Interspeech 2016 , 2016. [12] Carlos Busso and T auhidur Rahman, “Un veiling the acoustic properties that describe the valence dimension., ” in INTER- SPEECH , 2012, pp. 1179–1182. [13] Sayan Ghosh, Eugene Laksana, Stefan Scherer , and Louis- Philippe Morency , “ A multi-label con volutional neural net- work approach to cross-domain action unit detection, ” in Af- fective Computing and Intelligent Interaction (ACII), 2015 In- ternational Confer ence on . IEEE, 2015, pp. 609–615. [14] Rui Xia and Y ang Liu, “ A multi-task learning frame work for emotion recognition using 2d continuous space, ” IEEE T rans- actions on Affective Computing , 2015. [15] S. Scherer , J. Kane, C. Gobl, and F . Schwenker , “In vestigating fuzzy-input fuzzy-output support vector machines for robust voice quality classiﬁcation, ” Computer Speech and Language , vol. 27, no. 1, pp. 263–287, 2013. [16] C. Thiel, S. Scherer, and F . Schwenker , “Fuzzy-input fuzzy- output one-against-all support vector machines, ” in 11th Inter - national Confer ence on Knowledge-Based and Intellig ent In- formation and Engineering Systems (KES’07) . 2007, vol. 3 of Lectur e Notes in Artiﬁcial Intelligence , pp. 156–165, Springer . [17] Angeliki Metallinou, Martin W ollmer , Athanasios Katsama- nis, Florian Eyben, Bjorn Schuller , and Shrikanth Narayanan, “Context-sensiti ve learning for enhanced audiovisual emotion classiﬁcation, ” IEEE T ransactions on Affective Computing , vol. 3, no. 2, pp. 184–198, 2012. [18] Sepp Hochreiter and J ¨ urgen Schmidhuber , “Long short-term memory , ” Neural computation , v ol. 9, no. 8, pp. 1735–1780, 1997.

Learning Representations of Emotional Speech with Deep Convolutional Generative Adversarial Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment