Adversarial Machine Learning And Speech Emotion Recognition: Utilizing Generative Adversarial Networks For Robustness

Deep learning has undoubtedly offered tremendous improvements in the performance of state-of-the-art speech emotion recognition (SER) systems. However, recent research on adversarial examples poses enormous challenges on the robustness of SER systems…

Authors: Siddique Latif, Rajib Rana, Junaid Qadir

Adversarial Machine Learning And Speech Emotion Recognition: Utilizing   Generative Adversarial Networks For Robustness
Adversarial Machine Lear ning And Speech Emotion Recognition: Utilizing Generativ e Adversarial Networks F or Rob ustness Siddique Latif 1,2 , Rajib Rana 2 , and Junaid Qadir 1 1 Information T echnology Univ ersity (ITU)-Punjab, Pakistan 2 Univ ersity of Southern Queensland, Australia Abstract Deep learning has undoubtedly offered tremendous improv ements in the perfor- mance of state-of-the-art speech emotion recognition (SER) systems. Howe ver , recent research on adversarial examples poses enormous challenges on the rob ust- ness of SER systems by showing the susceptibility of deep neural networks to adversarial e xamples as they rely only on small and imperceptible perturbations. In this study , we ev aluate how adversarial examples can be used to attack SER systems and propose the first black-box adv ersarial attack on SER systems. W e also explore potential defenses including adversarial training and generativ e adversarial network (GAN) to enhance rob ustness. Experimental ev aluations suggest various interesting aspects of the effecti ve utilization of adversarial examples useful for achieving rob ustness for SER systems opening up opportunities for researchers to further innov ate in this space. 1 Introduction Recent progress in machine learning (ML) is rein venting the future of intelligent systems enabling plethora of speech controlled applications [ 1 , 2 , 3 ]. In particular , the emotion-aware systems are on the rise. The breakthrough in deep learning is largely fueling the de velopment of highly accurate and robust emotion recognition systems [4, 5]. Despite the superior performance of deep neural networks (DNNs), recent studies demonstrate that DNNs are highly vulnerable to the malicious attacks that use adversarial examples . Adversarial examples are de veloped by malicious adversaries through the addition of unperceived perturbation with the intention of eliciting wrong responses from ML models. These adversarial examples can debilitate the performance of image recognition, object detection, and speech recognition models [ 6 ]. Adversarial attacks can also be used to undermine the performance of speech-based emotion recognition (SER) systems [ 7 ], putting security-sensitiv e paralinguistic applications of SER systems at high risk. In this paper , we aim to in vestigate the utility of adversarial e xamples to achieve rob ustness in speech emotion classification to adversarial attacks. W e consider a “black-box” attack that directly perturbs speech utterances with small and imperceptible noises. The generated adversarial examples are utilized within dif ferent schemes highlighting dif ferent aspects of rob ustness of SER systems. W e further propose a GAN-based defense for SER systems and show that it can better resits adversarial examples compared to the previously proposed defense solutions such as adversarial training and random noise addition. 2 Background Literatur e and Motivation Existing methods of adv ersarial attacks including fast gradient sign method (FGSM) [ 8 ], Jacobian- based saliency map attack (JSMA) [ 9 ], DeepFool [ 10 ], and Carlini and W agner attacks [ 11 ] compute the perturbation noise based on the gradient of targeted output with respect to the input. This is computed using backpropagation with the implicit assumption that the attacker has complete knowledge of the netw ork and its parameters (such methods are called white-box attacks). While the backpropagation method, which needs to compute the deri vati ve of each layer of the network with respect to the input layers, can be efficiently applied in image recognition due to the dif ferentiability of all layers, the application of such methods is difficult for SER systems since these systems rely on complex acoustic features of the input audio utterances—such as Mel Frequenc y Cepstral Coefficients (MFCCs), spectrogram, extended Genev a Minimalistic Acoustic Parameter Set (eGeMAPS) [ 12 ]. The SER system’ s first layer is the pre-processing or the feature extraction layer, which does not of fer an efficient way to compute deri vati ve, therefore, gradient-based methods [ 9 , 10 , 11 , 13 ] are not directly applicable to SER systems. Adversarial attacks on ML ha ve prov oked an acti ve area of research that is focusing on understanding the adversarial attack phenomenon [ 14 ] and on techniques that can make ML models robust [ 15 ]. For speech-based systems, Carlini [ 6 ] proposed a white-box iterati ve optimization-based attack for DeepSpeech [ 16 ], a state-of-the-art speech-to-text model, with 100% success rate. Alzantot et al. [ 17 ] proposed an adversarial attack on speech commands classification model by adding a small random noise (background noise) to the audio files. They achiev ed 87% success without having any information of the underlying model. Song et al. [ 18 ] proposed a mechanism that directly attacks the microphone used for sensing voice data and showed that an adversary can exploit the microphone’ s non-linearity to control the targeted device with inaudible voice commands. Gong et al. [ 7 ] presented an architecture to craft adversarial examples for computational paralinguistic applications. The y perturbed the raw audio file and were able to cause a significant reduction in performance. V arious other studies [ 19 , 20 , 21 ] hav e also de veloped adversarial attacks for speech recognition system. Ho wever , most of the previous research on targeted attacks for speech-based applications [ 6 , 7 , 17 , 18 , 19 , 20 , 21 ] has considered attacks on the model without inv estigating how adversarial examples may be utilized to make the ML models more robust. Our work is different since we not only propose an adversarial attack for SER system using adversarial examples b ut also lev erage adversarial e xamples for making ML models more robust. 3 Proposed A udio Adversarial Examples In this work, we adopt a simple approach to prepare adversarial e xamples by adding imperceptible noise ( δ ) to the legitimate samples. W e take an audio utterance x with label y , and generate an adversarial e xample x 0 = x + δ such that the SER system fails to correctly classify the gi ven input while ensuring that x and x 0 are very similar when percei ved by humans. Pre vious speech-related studies hav e mostly considered “non-real world” random noise as adv ersarial noise. DolphinAttack exploits inaudible ultrasounds as adv ersarial noise to control the victim de vice inconspicuously but the attack sound was out of the human perception. Similarly , Alzantot et al. [ 17 ] used random noise for creating an adversarial attack on speech recognition. It was howe ver observed that the state-of-the-art classifiers are relati vely robust to random noise [ 14 ]. W e therefore propose a black-box attack for SER system where an adversary can add “real-world” noise as adversarial perturbation. W e empirically show that speech samples imputed with real-w orld noise can fool the classier while not being perceptible to the human ear . Generation of δ : W e use three noises: café, meeting, and station from the Demand Noise database [ 22 ] and their imputation le vel is based on the already e xisting background noises (microphone noise and discussion noise) in the utterances. W e estimate the e xisting noise in utterances using a well-known technique proposed in [ 23 ] that estimate noise using spectral and log-amplitude. W e make the mean and variance of the abo ve three noises equal to that of the reference noise. W e also use  as the variation parameter to further control the extent of perturbation and added the perturbation noise δ to the utterances using ( x i +  × δ i ). Where x i is i th utterance and δ i is generated noise for it. In this way , the adversarial noise has a very small value similar to the existing noise and the adversarial 2 T able 1: Binary class mapping of different emotions Dataset Positiv e class Negati ve Class IEMOCAP happiness, exited, neutral anger , sadness F A U-AIBO neutral, motherese, and joyful angry , touchy , reprimanding, and emphatic example is unrecognizable to the human ear in the human perception test. Because this noise acts as the background noise it does not change the emotional context of a gi ven audio file. Human Per ception and Classifier T est: In order to assess the effect of added adversarial noise on the human listener , we asked fi ve adults (age: 23-30 years) listeners to listen to 200 adversarial examples for dif ferent perturbation factor (  ) and differentiate it from the original audio file. For the IEMOCAP and F A U-AIBO datasets, 96% and 91% of the samples were indistinguishable from the original utterances. When these examples were giv en to the classifier , the attack success rate was 72% and 79% for IEMOCAP and F A U-AIBO, respectiv ely . 4 Experimental Setup and Results W e e valuated the generated adv ersarial examples using two well-kno wn emotional corpora: IEMO- CAP and F A U-AIBO. W e consider binary classification problem (Positiv e and Negati ve) by mapping emotion to binary v alance classes as used in [ 4 ] and [ 24 ]. T able 1 sho ws the considered emotions and their binary class mapping for both these datasets. W e use the eGeMAPS features, a popular features set specifically suited for paralinguistic applications, for representing the audio samples. Classification Model: W e consider LSTM-RNN for emotion classification. LSTM is a popular RNN and widely employed in audio [ 25 ] and emotion classification [ 5 ] due to their ability to model contextual information. W e find the best model structure by e valuating dif ferent number of layers. W e obtained the best results with tw o LSTM layers, one dense layer , and a softmax as the last layer . W e initially used a learning rate of 0.002 to start training the model and halved this rate after every 5 epochs if performance did not improv e on the test set. This process stopped when the learning rate reached below 0.00001. Emotion Classification Results: For experimentation, we ev aluated the model in a speaker inde- pendent scheme. IEMOCAP dataset consists of fi ve sessions; we used four session for training and one for testing, consistent with the methodology of pre vious studies [ 4 , 5 ]. For F A U-AIBO, we followed the speaker-independent training strategy proposed in the 2009 Interspeech Emotion Challenge [ 24 ]. For emotion classification on legitimate e xamples, we achieved 68 . 35% and 56 . 41% unweighted accuracy (UA) on F A U-AIBO and IEMOCAP dataset, respectively . The results on adversarial examples are compared with these results. W e generated adversarial examples with different v alues of  (0.1–2) to ev aluate the performance of model with dif ferent perturbation factor . This is demonstarted in Figure 1 presents the emotion classification error on adversarial samples with different v alues of  . (a) (b) Figure 1: The error rate (%) with different perturbation f actors for speech emotion classification for F A U-AIBO (left) and IEMOCAP (right) datasets. 3 Based on Figure 1 the proposed attack is ef fective in fooling the classifier for emotion classification tasks. W ith the perturbation factor 2.0, the classification error rate is increased from 31.65 and 43.59 to 56.87 and 66.87 for F A U-AIBO and IEMOCAP dataset respectiv ely . 5 Defense Mechanisms 5.1 T raining with Adversarial Examples Adversarial training of model is considered as a possible defense to adversarial attacks when the exact nature of the attack is known. Model training on the mixture of clean and adversarial e xamples can somewhat help regularization [ 26 ]. T raining on adversarial samples is different from data augmentation methods that are performed based on the expected translations in test data. T o the best of our kno wledge, adversarial training is not explored for SER systems and other speech/audio classification systems. W e explore this phenomenon by mixing adversarial examples with training data to highlight the robustness of model against attack. W e trained the model with training data comprising of a varying percentage of adv ersarial examples (10% to 100% of training data). Figure 2 shows the classification error rate (%) significantly decreases with the increase of percentage of adversarial e xamples in the training data. (a) (b) Figure 2: The error rate (%) with v arying the percentage of adversarial samples as training data for F A U-AIBO (left) and IEMOCAP (right) datasets. 5.2 T raining with Random Noise It is reported in [ 27 ] that the addition of a random noise layer to the neural network can prevent strong gradient-based attacks in the image domain. W e e valuated this phenomenon in speech emotion classification system by adding a small random noise to o verall training data and ev aluated the performance against the proposed attacks. T able 2 sho ws that emotion classification error reduces only slightly with the addition of random noise in training data, which indicates that this strategy is not particularly effecti ve in the SER settings. T able 2: Emotion classification error (%) while adding random noise in training data Dataset Adversarial Perturbations Error (max) with adversarial attack Error by training with random noise Café 56.87 54.02 F AU-AIBO Meeting 52.58 49.24 Station 53.57 48.51 Café 64.58 56.73 IEMOCAP Meeting 63.88 52.57 Station 66.87 60.87 5.3 Using Generative Adv ersarial Network Generativ e adversarial networks (GANs) [ 28 ] are deep models that learn to generate samples, ideally indistinguishable from the real data x , that are supposed to belong to an unkno wn data distribution, p data ( x ) . GANs consist of two networks, a generator ( G ) and a discriminator ( D ). The generator 4 network ( G ) maps latent vectors from some kno wn prior p z to samples and discriminator tasked to dif ferentiate between the real sample x or fake G ( z ) . Mathematically , this is represented by the following optimization program: min G max D E x [log( D ( x ))] + E y [log(1 − D ( G ( z )))] (1) where G and D play this game to fool each other using this min-max optimization program. In our case, G network is task ed to remove the adversarial noise from the adversarial examples z . The G network is structured like an autoencoder using LSTM layers. In the G network, the encoder part compresses the contextual (emotional) information of the input speech features and the decoder uses this representation for reconstruction. The D network follo ws the same encoder-decoder architecture. For training G and D for dif ferent possible scenarios, we used the training data from both the datasets to train the GAN. For each G step, the discriminator was updated twice. For faster con ver gence, we pretrained the G network in each case. W e trained GAN using RMSProp optimizer with learning rate 1 × 10 − 4 and batch size of 32 , until con ver gence. For training we used utterances corrupted by the three adversarial noises: café, meeting, station as noisy data and it was tasked to clean the utterances. Data cleaned by GAN was gi ven to the classifier for emotion classification. T able 3 sho ws emotion classification results on audio utterances cleaned by GAN. It can be noted that the classification error significantly reduces by removing adversarial noise from data prior to classification. T able 3: Emotion classification error (%) by utilizing GAN as defense against adversarial noise remov al Dataset Adversarial Perturbations Error (max) with adversarial attack Error by employing GAN prior to classification Café 68.82 38.31 F AU-AIBO Meeting 62.58 36.02 Station 66.87 35.14 Café 65.87 49.20 IEMOCAP Meeting 67.70 48.18 Station 69.87 46.24 6 Discussion From the experimental ev aluations we find that GAN-based defense against adversarial audio ex- amples better withstands adversarial examples compared to other approaches. Figure 3 sho ws a comparison of the diff erent defense mechanisms using two well-known datasets: IEMOCAP and F A U-AIBO. The addition of random noise in training utterances slightly reduces speech emotion classification error , ho wever , using adversarial training, classification error is significantly reduced. This supports that training with random noise is not adequate to av oid adv ersarial attacks. The (a) (b) Figure 3: The error rate (%) with three dif ferent approaches against adversarial examples for F A U- AIBO (left) and IEMOCAP (right) datasets. best results are ho wev er achiev ed using GAN. This moti vates further research for its utilization in other speech-based intelligent systems for the minimization of adversa rial perturbations. It is worth 5 pointing out that GANs require information about the exact type and nature of adversarial e xamples for its training, but this is also an essential requirement for the adv ersarial training mechanism. 7 Conclusions In this paper , we propose a black-box method to generate adversarial perturbations in audio e xamples of speech emotion recognition system (SER). W e also propose a defence strategy using Generati ve Adversarial Network (GAN) for enhancing the robustness of SER system by first cleaning the perturbed utterances through GANs and then running a classifier on it. W e compared our GAN-based defense against adversarial training and the addition of random noise in training examples and sho wed that our GAN-based defense provides consistently better results in speech emotion recognition. W e anticipate that the attack and defense that we propose can also be utilized more generally for other speech-based intelligent systems. References [1] Erik Cambria. Affecti ve computing and sentiment analysis. IEEE Intelligent Systems , 31(2):102– 107, 2016. [2] Soujanya Poria, Erik Cambria, Amir Hussain, and Guang-Bin Huang. T o wards an intelligent framew ork for multimodal affecti ve data analysis. Neur al Networks , 63:104–116, 2015. [3] Rajib Rana. Poster: Context-dri ven mood mining. In Pr oceedings of the 14th Annual Interna- tional Conference on Mobile Systems, Applications, and Services Companion , pages 143–143. A CM, 2016. [4] Siddique Latif, Rajib Rana, Shahzad Y ounis, Junaid Qadir , and Julien Epps. T ransfer learning for improving speech emotion classification accuracy . Pr oc. Interspeech 2018 , pages 257–261, 2018. [5] Siddique Latif, Rajib Rana, Junaid Qadir , and Julien Epps. V ariational autoencoders for learning latent representations of speech emotion: A preliminary study . In Pr oc. Interspeech 2018 , pages 3107–3111, 2018. [6] Nicholas Carlini and David W agner . Audio adversarial examples: T argeted attacks on speech- to-text. arXiv preprint , 2018. [7] Y uan Gong and Christian Poellabauer . Crafting adversarial examples for speech paralinguistics applications. arXiv pr eprint arXiv:1711.03280 , 2017. [8] Ian J Goodfellow , Jonathon Shlens, and Christian Szegedy . Explaining and harnessing adversar - ial examples (2014). arXiv preprint . [9] Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In Security and Privacy (Eur oS&P), 2016 IEEE Eur opean Symposium on , pages 372–387. IEEE, 2016. [10] Seyed-Mohsen Moosa vi-Dezfooli, Alhussein Fa wzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , pages 2574–2582, 2016. [11] Nicholas Carlini and David W agner . T ow ards ev aluating the robustness of neural netw orks. In 2017 IEEE Symposium on Security and Privacy (SP) , pages 39–57. IEEE, 2017. [12] Florian Eyben, Klaus R Scherer , Björn W Schuller , Johan Sundberg, Elisabeth André, Carlos Busso, Laurence Y Devillers, Julien Epps, Petri Laukka, Shrikanth S Narayanan, et al. The gene va minimalistic acoustic parameter set (gemaps) for voice research and affecti ve computing. IEEE T ransactions on Affective Computing , 7(2):190–202, 2016. [13] Jiawei Su, Danilo V asconcellos V argas, and Sakurai K ouichi. One pixel attack for fooling deep neural networks. arXiv preprint , 2017. [14] Alhussein Fa wzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Robustness of classifiers: from adv ersarial to random noise. In Advances in Neural Information Pr ocessing Systems , pages 1632–1640, 2016. 6 [15] Moustapha Cisse, Piotr Bojanowski, Edouard Gra ve, Y ann Dauphin, and Nicolas Usunier . Parse- val netw orks: Improving rob ustness to adversarial examples. arXiv preprint , 2017. [16] A wni Hannun, Carl Case, Jared Casper , Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger , Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. Deep speech: Scaling up end-to-end speech recognition. arXiv pr eprint arXiv:1412.5567 , 2014. [17] Moustafa Alzantot, Bharathan Balaji, and Mani Sriv astava. Did you hear that? adversarial examples against automatic speech recognition. arXiv preprint , 2018. [18] Liwei Song and Prateek Mittal. Inaudible voice commands. arXiv preprint , 2017. [19] Nirupam Ro y , Haitham Hassanieh, and Romit Roy Choudhury . Backdoor: Making microphones hear inaudible sounds. In Pr oceedings of the 15th Annual International Confer ence on Mobile Systems, Applications, and Services , pages 2–14. ACM, 2017. [20] Dan Iter , Jade Huang, and Mike Jermann. Generating adversarial e xamples for speech recogni- tion. http://web.stanford.edu/class/cs224s/reports/Dan_Iter.pdf , 2017. [21] Lea Schönherr , Katharina K ohls, Steffen Zeiler , Thorsten Holz, and Dorothea K olossa. Adver- sarial attacks against automatic speech recognition systems via psychoacoustic hiding. arXiv pr eprint arXiv:1808.05665 , 2018. [22] Joachim Thiemann, Nobutaka Ito, and Emmanuel V incent. The div erse environments multi- channel acoustic noise database: A database of multichannel en vironmental noise recordings. The J ournal of the Acoustical Society of America , 133(5):3591–3591, 2013. [23] Y ariv Ephraim and David Malah. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator . IEEE T ransactions on acoustics, speech, and signal pr ocessing , 32(6):1109–1121, 1984. [24] Björn Schuller , Stefan Steidl, and Anton Batliner . The interspeech 2009 emotion challenge. In T enth Annual Confer ence of the International Speech Communication Association , 2009. [25] Siddique Latif, Muhammad Usman, Rajib Rana, and Junaid Qadir . Phonocardiographic sensing using deep learning for abnormal heartbeat detection. IEEE Sensors Journal , 2018. [26] Christian Sze gedy , W ojciech Zaremba, Ilya Sutskev er , Joan Bruna, Dumitru Erhan, Ian Goodfel- lo w , and Rob Fergus. Intriguing properties of neural networks. arXiv preprint , 2013. [27] Xuanqing Liu, Minhao Cheng, Huan Zhang, and Cho-Jui Hsieh. T o wards robust neural networks via random self-ensemble. arXiv pr eprint arXiv:1712.00673 , 2017. [28] Ian Goodfello w , Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, Da vid W arde-Farley , Sherjil Ozair , Aaron Courville, and Y oshua Bengio. Generativ e adversarial nets. In Advances in neural information pr ocessing systems , pages 2672–2680, 2014. 7

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment