Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition

Adversarial examples are inputs to machine learning models designed by an adversary to cause an incorrect output. So far, adversarial examples have been studied most extensively in the image domain. In this domain, adversarial examples can be constru…

Authors: Yao Qin, Nicholas Carlini, Ian Goodfellow

Imper ceptible, Rob ust, and T argeted Adversarial Examples f or A utomatic Speech Recognition Y ao Qin 1 Nicholas Carlini 2 Ian Goodfellow 2 Garrison Cottrell 1 Colin Raffel 2 Abstract Adversarial e xamples are inputs to machine learn- ing models designed by an adversary to cause an incorrect output. So far , adversarial e xamples hav e been studied most extensi vely in the image domain. In this domain, adversarial e xamples can be constructed by imperceptibly modifying images to cause misclassification, and are prac- tical in the physical world. In contrast, current targeted adversarial examples applied to speech recognition systems hav e neither of these proper- ties: humans can easily identify the adversarial perturbations, and they are not effecti ve when played over -the-air . This paper makes advances on both of these fronts. First, we develop effec- tively imper ceptible audio adversarial examples (verified through a human study) by leveraging the psychoacoustic principle of auditory mask- ing, while retaining 100% targeted success rate on arbitrary full-sentence targets. Next, we make progress to wards physical-world ov er-the-air au- dio adversarial e xamples by constructing perturba- tions which remain ef fective even after applying realistic simulated en vironmental distortions. 1. Introduction Adversarial e xamples ( Szegedy et al. , 2013 ) are inputs that hav e been specifically designed by an adversary to cause a machine learning algorithm to produce a misclassification ( Biggio et al. , 2013 ). Initial work on adv ersarial examples focused mainly on the domain of image classification. In order to differentiate properties of adversarial examples on neural networks in general from properties which hold true only on images, it is important to study adversarial exam- ples in different domains. Indeed, adversarial examples 1 Department of CSE, University of California, San Diego, USA 2 Google Brain, USA. Correspondence to: Y ao Qin < yaq007@eng.ucsd.edu > , Colin Raffel < craffel@google.com > . Pr oceedings of the 36 th International Conference on Machine Learning , Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). are known to e xist on domains ranging from reinforcement learning ( Huang et al. , 2017 ) to reading comprehension ( Jia & Liang , 2017 ) to speech recognition ( Carlini & W agner , 2018 ). This paper focuses on the latter of these domains, where ( Carlini & W agner , 2018 ) showed that any gi ven source audio sample can be perturbed slightly so that an au- tomatic speech recognition (ASR) system would transcribe the audio as any dif ferent target sentence. T o date, adversarial examples on ASR differ from adv ersar- ial examples on images in tw o key ways. First, adversarial examples on images are imperceptible to humans: it is pos- sible to generate an adv ersarial example without changing the 8-bit brightness representation ( Szegedy et al. , 2013 ). Con versely , adv ersarial examples on ASR systems are often perceptible. While the perturbation introduced is often small in magnitude, upon listening it is obvious that the added perturbation is present ( Sch ¨ onherr et al. , 2018 ). Second, adversarial e xamples on images work in the physical world ( Kurakin et al. , 2016 ) (e.g., ev en when taking a picture of them). In contrast, adversarial examples on ASR systems do not yet work in such an “over -the-air” setting where they are played by a speaker and recorded by a microphone. In this paper 1 , we improve the construction of adversarial examples on the ASR system and match the po wer of attacks on images by developing adversarial examples which are imperceptible, and make steps towards robust adversarial examples. In order to generate imperceptible adversarial examples, we depart from the common ` p distance measure widely used for adversarial example research. Instead, we make use of the psychoacoustic principle of auditory masking, and only add the adv ersarial perturbation to regions of the audio where it will not be heard by a human, even if this perturbation is not “quiet” in terms of absolute energy . Further in vestigating properties of adversarial e xamples which appear to be dif ferent from images, we examine the ability of an adversary to construct physical-world adver- sarial examples ( Kurakin et al. , 2016 ). These are inputs that, even after taking into account the distortions intro- 1 The project webpage is at http://cseweb.ucsd.edu/ ˜ yaq007/imperceptible- robust- adv.html Imperceptible, Rob ust, and T argeted Adv ersarial Examples for A utomatic Speech Recognition duced by the physical world, remain adv ersarial upon clas- sification. W e make initial steps to wards dev eloping audio which can be played over -the-air by designing audio which remains adversarial after being processed by random room- en vironment simulators ( Scheibler et al. , 2018 ). Finally , we additionally demonstrate that our attack is ca- pable of attacking a modern, state-of-the-art Lingvo ASR system ( Shen et al. , 2019 ). 2. Related W ork W e build on a long line of work studying the robustness of neural networks. This research area largely began with ( Biggio et al. , 2013 ; Szegedy et al. , 2013 ), who first studied adversarial examples for deep neural networks. This paper focuses on adversarial examples on automatic speech recognition systems. Early work in this space ( Gong & Poellabauer , 2017 ; Cisse et al. , 2017 ) was successful when generating untar geted adversarial examples that pro- duced incorrect, but arbitrary , transcriptions. A concurrent line of work succeeded at generating targeted attacks in practice, ev en when played over a speak er and recorded by a microphone (a so-called “over -the-air” attack) but only by both (a) synthesizing completely ne w audio and (b) target- ing older , traditional (i.e., not neural network based) speech recognition systems ( Carlini et al. , 2016 ; Zhang et al. , 2017 ; Song & Mittal , 2017 ). These two lines of work were partially unified by Carlini & W agner ( 2018 ) who constructed adversarial perturbations for speech recognition systems targeting arbitrary (multi- word) sentences. Howe ver , this attack was neither effecti ve ov er-the-air , nor was the adversarial perturbation completely inaudible; while the perturbations it introduces are very quiet, they can be heard by a human (see § 7.2 ). Con- currently , the CommanderSong ( Y uan et al. , 2018 ) attack dev eloped adversarial examples that are ef fectiv e over -the- air but at a cost of introducing a significant perturbation to the original audio. Follo wing this, concurrent work with ours develops attacks on deep learning ASR systems that either work over -the-air or are less obviously perceptible. • Y akura & Sakuma ( 2018 ), create adversarial examples which can be played over -the-air . These attacks are highly effecti ve on short two- or three-word phrases, but not on the full-sentence phrases originally stud- ied. Further , these adversarial examples often have a significantly larger perturbation, and in one case, the perturbation introduced had a higher magnitude than the original audio. • Sch ¨ onherr et al. ( 2018 ) work towards dev eloping at- tacks that are less perceptible through using “Psychoa- coustic Hiding” and attack the Kaldi system, which is partially based on neural networks b ut also uses some “traditional” components such as a Hidden Marko v Model instead of an RNN for final classification. Be- cause of the system differences we can not directly compare our results to that of this paper , but we encour- age the reader to listen to examples from both papers. Our concurrent work manages to achie ve both of these re- sults (almost) simultaneously: we generate adversarial e x- amples that are both nearly imperceptible and also remain effecti ve after simulated distortions. Simultaneously , we target a state-of-the-art network-based ASR system, Lingvo, as opposed to Kaldi and generate full-sentence adversarial examples as opposed to tar geting short phrases. A final line of work extends adversarial example genera- tion on ASR systems from the white-box setting (where the adversary has complete knowledge of the underlying classifier) to the black-box setting ( Khare et al. , 2018 ; T aori et al. , 2018 ) (where the adv ersary is only allo wed to query the system). This work is complementary and independent of ours: we assume a white-box threat model. 3. Background 3.1. Problem Definition Giv en an input audio wav eform x , a tar get transcription y and an automatic speech recognition (ASR) system f ( · ) which outputs a final transcription, our objecti ve is to con- struct an imperceptible and targeted adversarial example x 0 that can attack the ASR system when played over -the- air . That is, we seek to find a small perturbation δ , which enables x 0 = x + δ to meet three requirements: • T argeted : the classifier is fooled so that f ( x 0 ) = y and f ( x ) 6 = y . Untargeted adversarial examples on ASR systems often only introduce spelling errors and so are less interesting to study . • Imperceptible : x 0 sounds so similar to x that humans cannot differentiate x 0 and x when listening to them. • Robust : x 0 is still ef fective when played by a speak er and recorded by a microphone in an over -the-air attack. (W e do not achie ve this goal completely , b ut do succeed at simulated en vironments.) 3 . 1 . 1 . A S R M O D E L W e mount our attacks on the Lingvo classifier ( Shen et al. , 2019 ), a state-of-the-art sequence-to-sequence model ( Sutske ver et al. , 2014 ) with attention ( Bahdanau et al. , 2014 ) whose architecture is based on the Listen, Attend and Spell model ( Chan et al. , 2016 ). It feeds filter bank spectra Imperceptible, Rob ust, and T argeted Adv ersarial Examples for A utomatic Speech Recognition into an encoder consisting of a stack of con volutional and LSTM layers, which conditions an LSTM decoder that out- puts the transcription. The use of the sequence-to-sequence framew ork allows the entire model to be trained end-to-end with the standard cross-entropy loss function. 3 . 1 . 2 . T H R E AT M O D E L In this paper , as is done in most prior work, we consider the white box threat model where the adversary has full access to the model as well as its parameters. In particular, the adversary is allo wed to compute gradients through the model in order to generate adversarial e xamples. When we mount over -the-air attacks, we do not assume we know the exact configurations of the room in which the attack will be performed. Instead, we assume we know the distribution from which the room will be drawn, and generate adversarial e xamples so as to be effecti ve on any room drawn from this distrib ution. 3.2. Adversarial Example Generation Adversarial e xamples are typically generated by performing gradient descent with respect to the input on a loss func- tion designed to be minimized when the input is adversarial ( Szegedy et al. , 2013 ). Specifically , let x be an input to a neu- ral network f ( · ) , let δ be a perturbation, and let ` ( f ( x ) , y ) be a loss function that is minimized when f ( x ) = y . Most work on adversarial examples focuses on minimizing the max-norm ( k·k ∞ norm) of δ . Then, the typical adversarial example generation algorithm ( Szegedy et al. , 2013 ; Carlini & W agner , 2017 ; Madry et al. , 2017 ) solves minimize ` ( f ( x + δ ) , y ) + α · k δ k such that k δ k <  (where in some formulations α = 0 ). Here,  controls the maximum perturbation introduced. T o generate adversarial examples on ASR systems, Carlini & W agner ( 2018 ) set ` to the CTC-loss and use the max-norm which has the effect of adding a small amount of adv ersarial perturbation consistently throughout the audio sample. 4. Imperceptible Adv ersarial Examples Unlike on images, where minimizing ` p distortion between an image and the nearest misclassified example yields a visually indistinguishable image, on audio, this is not the case ( Sch ¨ onherr et al. , 2018 ). Thus, in this work, we depart from the ` p distortion measures and instead rely on the extensi ve work which has been done in the audio space for capturing the human perceptibility of audio. 4.1. Psychoacoustic Models A good understanding of the human auditory system is criti- cal in order to be able to construct imperceptible adversarial examples. In this paper , we use fr equency masking , which refers to the phenomenon that a louder signal (the “masker”) can make other signals at nearby frequencies (the “mas- kees”) imperceptible ( Mitchell , 2004 ; Lin & Abdulla , 2015 ). In simple terms, the mask er can be seen as creating a “mask- ing threshold” in the frequency domain. Any signals which fall under this threshold are ef fectively imperceptible. Because the masking threshold is measured in the frequenc y domain, and because audio signals change rapidly over time, we first compute the short-time Fourier transform of the raw audio signal to obtain the spectrum of ov erlapping sections (called “windows”) of a signal. The window size N is 2048 samples which are extracted with a “hop size” of 512 samples and are windowed with the modified Hann windo w . W e denote s x ( k ) as the k th bin of the spectrum of frame x . Then, we compute the log-magnitude po wer spectral density (PSD) as follows: p x ( k ) = 10 log 10     1 N s x ( k )     2 . (1) The normalized PSD estimate ¯ p x ( k ) is defined by Lin & Abdulla ( 2015 ) ¯ p x ( k ) = 96 − max k { p x ( k ) } + p x ( k ) (2) Masking Threshold Giv en an audio input, in order to compute its masking threshold, first we need to identify the maskers, whose normalized PSD estimate ¯ p x ( k ) must satisfy three criteria: 1) they must be local maxima in the spectrum; 2) the y must be higher than the threshold in quiet; and 3) they have the largest amplitude within 0.5 Bark (a psychoacoustically-motiv ated frequency scale) around the masker’ s frequency . Then, each masker’ s masking threshold can be approximated using the simple two-slope spread function, which is deri ved to mimic the excitation patterns of maskers. Finally , the global masking threshold θ x ( k ) is a combination of the indi vidual masking threshold as well as the threshold in quiet via addition (because the effect of masking is additive in the logarithmic domain). W e refer interested readers to our appendix and ( Lin & Abdulla , 2015 ) for specifics on computing the masking threshold. When we add the perturbation δ to the audio input x , if the normalized PSD estimate of the perturbation ¯ p δ ( k ) is under the frequency masking threshold of the original audio θ x ( k ) , the perturbation will be masked out by the raw audio and therefore be inaudible to humans. The normalized PSD estimate of the perturbation ¯ p δ ( k ) can be calculated via: ¯ p δ ( k ) = 96 − max k { p x ( k ) } + p δ ( k ) . (3) Imperceptible, Rob ust, and T argeted Adv ersarial Examples for A utomatic Speech Recognition where p δ ( k ) = 10 log 10 | 1 N s δ ( k ) | 2 and p x ( k ) = 10 log 10 | 1 N s x ( k ) | 2 are the PSD estimate of the perturba- tion and the original audio input. 4.2. Optimization with Masking Threshold Loss function Giv en an audio example x and a target phrase y , we formulate the problem of constructing an im- perceptible adversarial e xample x 0 = x + δ as minimizing the loss function ` ( x, δ, y ) , which is defined as: ` ( x, δ, y ) = ` net ( f ( x + δ ) , y ) + α · ` θ ( x, δ ) (4) where ` net requires that the adversarial examples fool the audio recognition system into making a targeted prediction y , where f ( x ) 6 = y . In the Lingv o model, the simple cross entropy loss function is used for ` net . The term ` θ constrains the normalized PSD estimation of the perturbation ¯ p δ ( k ) to be under the frequency masking threshold of the original audio θ x ( k ) . The hinge loss is used here to compute the loss for masking threshold: ` θ ( x, δ ) = 1 b N 2 c + 1 b N 2 c X k =0 max  ¯ p δ ( k ) − θ x ( k ) , 0  , (5) where N is the predefined window size and b x c outputs the greatest integer no lar ger than x . The adaptiv e parameter α is to balance the relativ e importance of these two criteria. 4 . 2 . 1 . T W O S TAG E A T TAC K Empirically , we find it is difficult to directly minimize the masking threshold loss function via backpropagation with- out any constraint on the magnitude of the perturbation δ . This is reasonable because it is very challenging to fool the neural network and in the meanwhile, limit a very large perturbation to be under the masking threshold in the fre- quency domain. In contrast, if the perturbation δ is relativ ely small in magnitude, then it will be much easier to push the remaining distortion under the frequenc y masking threshold. Therefore, we di vide the optimization into two stages: the first stage of optimization focuses on finding a relativ ely small perturbation to fool the network (as was done in prior work ( Carlini & W agner , 2018 )) and the second stage makes the adversarial e xamples imperceptible. In the first stage, we set α in Eqn 4 to be zero and clip the perturbation to be within a relati vely small range. As a result, the first stage solves: minimize ` net ( f ( x + δ ) , y ) such that k δ k <  (6) where k δ k represents the k·k ∞ max-norm of δ . Specifically , we begin by setting δ = 0 and then on each iteration: δ ← clip  ( δ − lr 1 · sign ( ∇ δ ` net ( f ( x + δ ) , y ))) , (7) where lr 1 is the learning rate and ∇ δ ` net is the gradient of ` net with respect to δ . W e initially set  to a large v alue and then gradually reduce it during optimization following Car - lini & W agner ( 2018 ). The second stage focuses on making the adv ersarial exam- ples imperceptible, with an unbounded max-norm; in this stage, δ is only constrained by the masking threshold con- straints. Specifically , initialize δ with δ ∗ im optimized in the first stage and then on each iteration: δ ← δ − l r 2 · ∇ δ ` ( x, δ, y ) , (8) where lr 2 is the learning rate and ∇ δ ` is the gradient of ` with respect to δ . The loss function ` ( x, δ, y ) is defined in Eqn. 4 . The parameter α that balances the network loss ` net ( f ( x + δ ) , y ) and the imperceptibility loss ` θ ( x, y ) is initialized with a small value, e.g., 0.05, and is adaptiv ely updated according to the performance of the attack. Specifi- cally , e very twenty iterations, if the current adv ersarial ex- ample successfully fools the ASR system (i.e. f ( x + δ ) = y ), then α is increased to attempt to make the adversarial exam- ple less perceptible. Correspondingly , e very fifty iterations, if the current adversarial example f ails to make the tar geted prediction, we decrease α . W e check for attack failure less frequently than success (fifty vs. twenty iterations) to allow more iterations for the network to con verge. The details of the optimization algorithm are further explained in the appendix. 5. Robust Adv ersarial Examples 5.1. Acoustic Room Simulator In order to improv e the robustness of adversarial examples when playing over -the-air , we use an acoustic room simula- tor to create artificial utterances (speech with re verberations) that mimic playing the audio o ver-the-air . The transforma- tion function in the acoustic room simulator, denoted as t , takes the clean audio x as an input and outputs the simulated speech with re verberation t ( x ) . First, the room simulator ap- plies the classic Image Source Method introduced in ( Allen & Berkley , 1979 ; Scheibler et al. , 2018 ) to create the room impulse response r based on the room configurations (the room dimension, source audio and tar get microphone’ s lo- cation, and re verberation time). Then, the generated room impulse response r is con volved with the clean audio to create the speech with rev erberation, to obtain t ( x ) = x ∗ r where ∗ denotes the con volution operation. T o make the gen- erated adversarial examples rob ust to various en vironments, multiple room impulse responses r are used. Therefore, the transformation function t follows a chosen distribution T ov er different room configurations. 5.2. Optimization with Reverberations In this section, our objecti ve is to mak e the perturbed speech with rev erberation (rather than the clean audio) fool the ASR Imperceptible, Rob ust, and T argeted Adv ersarial Examples for A utomatic Speech Recognition system. As a result, the generated adversarial examples x 0 = x + δ will be passed through the room simulator first to create the simulated speech with rev erberation t ( x 0 ) , mimicking playing the adversarial e xamples over -the-air, and then the simulated t ( x 0 ) will be fed as the new input to fool the ASR system, aiming at f ( t ( x 0 )) = y . Simultaneously , the adversarial perturbation δ should be relativ ely small in order not to be audible to humans. In the same manner as the Expectation over T ransformation in ( Athalye et al. , 2018 ), we optimize the expectation of the loss function over different transformations t ∼ T as follows: minimize ` ( x, δ, y ) = E t ∼T  ` net ( f ( t ( x + δ )) , y )  such that k δ k < . (9) Rather than directly targeting f ( x + δ ) = y , we apply the loss function l net (the cross entropy loss in the Lingvo network) to the classification of the transformed speech f ( t ( x + δ )) = y . W e approximate the gradient of the expected value via independently sampling a transformation t from the distribution T at each gradient descent step. In the first I r 1 iterations, we initialize  with a suf ficiently large v alue and gradually reduce it follo wing Carlini & W ag- ner ( 2018 ). W e consider the adversarial e xample successful if it successfully fools the ASR system under a single ran- dom room configuration; that is, if f ( t ( x + δ )) = y for just one t ( · ) . Once this optimization is complete, we obtain the max-norm bound for δ , denoted as  ∗ r . W e will then use the perturbation δ ∗ r as an initialization for δ in the next stage. Then in the following I r 2 iterations, we finetune the pertur- bation δ with a much smaller learning rate. The max-norm bound  is increased to  ∗∗ r =  ∗ r +∆ , where ∆ > 0 , and held constant during optimization. During this phase, we only consider the attack successful if the adversarial example successfully fools a set of randomly chosen transformations Ω = { t 1 , t 2 , · · · , t M } , where t i ∼ T and M is the size of the set Ω . The transformation set Ω is randomly sampled from the distribution T at each gradient descent step. In other words, the adversarial example x 0 = x + δ gener- ated in this stage satisfies ∀ t i ∈ Ω , f ( t i ( x + δ )) = y . In this way , we can generate robust adv ersarial examples that successfully attack ASR systems when the exact room en vi- ronment is not kno wn ahead of time, whose configuration is drawn from a pre-defined distribution. More details of the algorithm are shown in the appendix. It should be emphasized that there is a tradeoff between imperceptibility and robustness (as we will show experimen- tally in Section 7.2 ). If we increase the maximal amplitude of the perturbation  ∗∗ r , the robustness can al ways be further improv ed. Correspondingly , it becomes much easier for humans to perceive the adv ersarial perturbation and alert the ASR system. In order to keep these adversarial examples mostly imperceptible, we therefore limit the ` ∞ amplitude of the perturbation to be in a reasonable range. 6. Imperceptible and Rob ust Attacks By combining both of the techniques we de veloped earlier , we no w develop an approach to generate both imperceptible and robust adversarial e xamples. This can be achiev ed by minimizing the loss ` ( x, δ, y ) = E t ∼T  ` net ( f ( t ( x + δ )) , y ) + α · ` θ ( x, δ )  , (10) where the cross entropy loss function ` net ( · ) is again the loss used for Lingv o, and the imperceptibility loss ` θ ( · ) is the same as that defined in Eqn 5 . Since we need to fool the ASR system when the speech is played under a random room configuration, the cross entropy loss ` net ( f ( t ( x + δ )) , y ) forces the transcription of the transformed adversarial ex- ample t ( x + δ ) to be y (again, as done earlier). T o further improve these adv ersarial examples to be imper- ceptible, we optimize ` θ ( x, δ ) to constrain the perturbation δ to fall under the masking threshold of the clean audio in the frequency domain. This is much easier compared to optimizing the hinge loss ` θ ( t ( x ) , t ( δ )) = max { ¯ p t ( δ ) ( k ) − θ t ( x ) ( k ) , 0 } because the frequenc y masking threshold of the clean audio θ x ( k ) can be pre-computed while the masking threshold of the speech with reverberation θ t ( x ) ( k ) varies with the room re verberation r . In addition, optimizing ` θ ( x, δ ) and ` θ ( t ( x ) , t ( δ )) hav e similar ef fects based on the con volution theorem that the Fourier transform of a con- volution of two signals is the pointwise product of their Fourier transforms. Note that the speech with reverberation t ( x ) is a con volution of the clean audio x and a simulated room rev erberation r , hence: F { t ( x ) } = F { x ∗ r } = F { x } · F { r } (11) where F is the Fourier transform, ∗ denotes the con volution operation and · represents the pointwise product. W e apply the short-time Fourier transform to the perturbation and the raw audio signal first in order to compute the power spectral density ¯ p t ( δ ) and the masking threshold θ t ( x ) in the frequency domain. Since most of the energy in the room impulse response falls within the spectral analysis windo w size, the con volution theorem in Eqn 11 is approximately satisfied. Therefore, we arriv e at: ( ¯ p t ( δ ) − θ t ( x ) ) ≈ ( ¯ p δ − θ x ) · F { r } . (12) As a result, simply optimizing the imperceptibility loss ` θ ( x, δ ) can help in finding the optimal δ and in constructing the imperceptible adversarial examples that can attack the ASR systems in the physical world. Imperceptible, Rob ust, and T argeted Adv ersarial Examples for A utomatic Speech Recognition Specifically , we will first initialize δ with the perturbation δ ∗∗ r that enables the adversarial examples to be robust in Section 5 . Then in each iteration, we randomly sample a transformation t from the distribution T and update δ according to: δ ← δ − lr 3 · ∇ δ  ` net ( f ( t ( x + δ ) , y )) + α · ` θ ( x, δ ))  , (13) where lr 3 is the learning rate and α , a parameter that bal- ances the importance of the robustness and the impercepti- bility , is adaptively changed based on the performance of adversarial examples. Specifically , if the constructed ad- versarial e xample can successfully attack a set of randomly chosen transformations, then α will be increased to focus more on imperceptibility loss. Otherwise, α is decreased to make the attack more robust to multiple room en vironments. The implementation details are illustrated in the appendix. 7. Evaluation 7.1. Datasets and Evaluation Metrics Datasets W e use the LibriSpeech dataset ( Panayotov et al. , 2015 ) in our experiments, which is a corpus of 16KHz En- glish speech deriv ed from audiobooks and is used to train the Lingvo system ( Shen et al. , 2019 ). W e randomly select 1000 audio examples as source examples, and 1000 separate transcriptions from the test-clean dataset to be the tar geted transcriptions. W e ensure that each target transcription is around the same length as the original transcription because it is unrealistic and overly challenging to perturb a short audio clip ( e.g ., 10 words) to ha ve a much longer transcrip- tion ( e.g ., 20 words). Examples of the original and targeted transcriptions are av ailable in the appendix. Evaluation Metrics For automatic speech recognition, we ev aluate our model using the standard word error rate (WER) metric, which is defined as WER = S + D + I N W × 100% , where S , D and I are the number of substitutions, deletions and insertions of words respectiv ely , and N W is the total number of words in the reference. W e also calculate the success rate (sentence-le vel accurac y) as Accuracy = N s N a × 100% , where N a is the number of audio examples that we test, and N s is the number of audio examples that are correctly transcribed. Here, “correctly transcribed” means the original transcription for clean audio and the targeted transcription for adv ersarial examples. 7.2. Imperceptibility Analysis T o attack the Lingvo ASR system, we construct 1000 imper- ceptible and targeted adv ersarial examples, one for each of the examples we sampled from the LibriSpeech test-clean dataset. T able 1 shows the performance of the clean audio and the constructed adv ersarial examples. W e can see that the word error rate (WER) of the clean audio is just 4 . 47% 19% 66% 23% 82% 66% 83% 85% 24% 76% (a) (b) (c) C lean Baseline O urs Clean vs Baseline Clean vs Ours Ours vs Baseline Clean vs Clea n Clean vs Baseline Clean vs Ours Percenta ge of Noisy Examples (%) Percenta ge of A is chose n as mor e natural (A vs B) (%) Percenta ge of identical examp les (%) Figure 1. Results of human study for imperceptibility . Here base- line represents the adversarial example generated by Carlini & W agner ( 2018 ), and ours denotes the imperceptible adversarial example generated follo wing the algorithm in Section. 4 . on the 1000 test e xamples, indicating the model is of high quality . Our imperceptible adversarial examples perform ev en better , and reach a 100 % success rate. 7 . 2 . 1 . Q U A L I TA T I V E H U M A N S T U DY Of the 1000 examples selected from the test set, we ran- domly selected 100 of these with their corresponding imper - ceptible adversarial examples. W e then generate an adversar- ial example using the prior w ork of Carlini & W agner ( 2018 ) for the same tar get phrase; this attack again succeeds with 100% success. W e perform three experiments to validate that our adversarial e xamples are imperceptible, especially compared to prior work. Experimental Design. W e recruit 80 users online from Amazon Mechanical T urk. W e giv e each user one of the three (nearly identical) experiments, each of which we de- scribe below . In all cases, the experiments consist of 20 “comparisons tasks”, where we present the ev aluator with some audio samples and ask them questions (described be- low) about the samples. W e ask the users to listen to each sample with headphones on, and answer a simple question about the audio samples (the question is determined by which experiment we run, as giv en below). W e do not ex- plain the purpose of the study other than that it is a research study , and do not record any personally identifying informa- tion. 2 W e randomly include a small number of questions with kno wn, obvious answers; we remo ve 3 users from the study who failed to answer these questions correctly . In all experiments, users have the ability to listen to audio files multiple times when they are unsure of the answer , making it as dif ficult as possible for our adversarial e xam- ples to pass as clean data. Users additionally ha ve the added benefit of hearing 20 examples back-to-back, effectiv ely “training” them to recognize subtle dif ferences. Indeed, a permutation test finds users are statistically significantly bet- ter at distinguishing adversarial e xamples from clean audio during the second half of the experiment compared to the 2 Unfortunately , for this reason, we are unable to report aggre- gate statistics such as age or gender, slightly harming potential reproducibility . Imperceptible, Rob ust, and T argeted Adv ersarial Examples for A utomatic Speech Recognition Input Clean Adversarial Accuracy (%) 58.60 100.00 WER (%) 4.47 0.00 T able 1. Sentence-lev el accuracy and WER for 1000 clean and (imperceptible) adversarially per - turbed examples, fed without over-the-air sim- ulation into the Lingvo model. In “Clean”, the ground truth is the original transcription. In“ Adversarial”, the ground truth is the targeted transcription. Input Clean Robust ( ∆ = 300 ) Robust ( ∆ = 400 ) Imperceptible & Robust Accuracy (%) 31.37 62.96 64.64 49.65 WER (%) 15.42 14.45 13.83 22.98 T able 2. Sentence-lev el accuracy and WER for 100 clean and adversarially perturbed examples, fed with o ver-the-air simulation into the Lingv o model. The ground truth for “clean” inputs is the original transcription while the ground truth is the targeted transcription for the adv ersarial inputs. The perturbation is bounded by k δ k <  ∗ r + ∆ . first half of the e xperiment, although the magnitude of the difference is small: only by about 3% . Figure 1 summarizes the statistical results we giv e below . Experiment 1: clean or noisy . W e begin with what we believ e is the most representative experiment of ho w an attack would work in practice. W e give users one audio sample and ask them to tell us if it has any background noise (e.g., static, echoing, people talking in the background). As a baseline, users believ ed that 19% of original clean audio samples contained some amount of noise, and 66% of users believ ed that the adversarial examples generated by Carlini & W agner ( 2018 ) contained some amount of noise. In comparison, only 23% of users believe that the adversarial examples we generate contain an y noise, a result that is not statistically significantly dif ferent from clean audio ( p > . 05 ). That is, when presented with just one audio sample in isolation, users do not believe the adversarial examples we generate are any noisier than the clean samples. Experiment 2: identify the original. W e give users two audio samples and inform them that one of the audio sam- ples is a modified version of the other; we ask the user to select the audio sample corresponding to the one which sounds like the more natural audio sample. This setup is much more challenging: when users can listen to both the before and after , it is often possible to pick up on the small amount of distortion that has been added. When comparing the original audio to the adversarial e xamples generated by Carlini & W agner ( 2018 ), the e valuator chose the original audio 82% of the time. When we have the ev aluator com- pare the imperceptible adversarial examples we generate to those of Carlini & W agner ( 2018 ), our imperceptible ex- amples are selected as the better audio samples 83% of the time—a difference that is not statistically distinguishable from comparing against the clean audio. Howe ver , when directly comparing the adversarial examples we generate to the clean audio, users prefer the clean audio still 66% of the time. Observe that the baseline percentage, when the samples are completely indistinguishable, is 50% . Thus, users only perform 16% better than random guessing at distinguishing our examples from clean e xamples. Experiment 3: identical or not. Finally , we perform the most difficult e xperiment: we present users with two audio files, and ask them if the audio samples are identical, or if there are any dif ferences. As the baseline, when giv en the same audio sample twice, users agreed it was identical 85% of the time. (That is, in 15% of cases the ev aluator wrongly heard a difference between the two samples.) When given a clean audio sample and comparing it to the audio generated by Carlini & W agner ( 2018 ), users only belie ved them to be identical 24% of the time. Comparing clean audio to the adversarial examples we generate, user believ ed them to be completely identical 76% of the time, 3 × more often than the adversarial e xamples generated by the baseline, but below the 85% -identical v alue for actually-identical audio. 7.3. Robustness Analysis T o mount our simulated over -the-air attacks, we consider a challenging setting that the e xact configuration of the room in which the attack will be performed is unknown. Instead, we are only aw are of the distribution from which the room configuration will be drawn. First, we generate 1000 random room configurations sampled from the distribution as the training room set. The test room set includes another 100 random room configurations sampled from the same distribution. Adversarial examples are created to attack the Lingvo ASR system when played in the simulated test rooms. W e randomly choose 100 audio examples from LibriSpeech dataset to perform this robustness test. As sho wn in T able 2 , when fed non-adversarial audio played in simulated test rooms, the WER of the Lingvo ASR de- grades to 15 . 42% which suggests some rob ustness to re- verberation. In contrast, the success rate of adversarial examples in ( Carlini & W agner , 2018 ) and our impercepti- ble adversarial e xamples in Section 4 are 0% in this setting. The success rate of our rob ust adversarial e xamples gener- ated based on the algorithm in Section 5 is ov er 60% , and the WER is smaller than that of the clean audio. Both the Imperceptible, Rob ust, and T argeted Adv ersarial Examples for A utomatic Speech Recognition success rate and the WER demonstrate that our constructed adversarial examples remain effecti ve when played in the highly-realistic simulated en vironment. In addition, the robustness of the constructed adversarial examples can be impro ved further at the cost of increased perceptibility . As presented in T able 2 , when we increase the max-norm bound of the amplitude of the adversarial perturbation  ∗∗ r =  ∗ r + ∆ ( ∆ is increased from 300 to 400), both the success rate and WER are improv ed correspond- ingly . Since our final objecti ve is to generate imperceptible and robust adversarial examples that can be played ov er-the- air in the physical world, we limit the max-norm bound of the perturbation to be in a relati vely small range to a void a huge distortion tow ard the clean audio. T o construct imperceptible as well as robust adversarial examples, we start from the robust attack ( ∆ = 300 ) and finetune it with the imperceptibility loss. In our e xperiments, we observe that 81 % of the robust adversarial examples 3 can be further improv ed to be much less perceptible while still retaining high robustness (around 50 % success rate and 22 . 98% WER). 7 . 3 . 1 . Q U A L I TA T I V E H U M A N S T U DY W e run identical experiments (as described earlier) on the robust and rob ust-and-imperceptible adversarial examples. In experiment 1 , where we ask ev aluators if there is any noise, only 6% heard any noise on the clean audio, com- pared to 100% on the robust (but perceptible) adversarial examples and 83% on the robust and imperceptible adver - sarial examples. 4 In experiment 2 , where we ask evaluators to identify the original audio, comparing clean to robust adv ersarial exam- ples the e valuator correctly identified the original audio 97% of the time versus 89% when comparing the clean audio to the imperceptible and robust adv ersarial examples. Finally , in experiment 3 , where we ask ev aluators if the au- dio is identical, the baseline clean audio was judged different 95% of the time when compared to the robust adversarial examples, and the clean audio was judged dif ferent 71% of the time when compared to the imperceptible and robust adversarial e xamples. In all cases, the imperceptible and robust adversarial ex- amples are statistically significantly less perceptible than just the robust adversarial examples, but also statistically 3 The other 19 % adversarial examples lose the robustness be- cause they cannot successfully attack the ASR system in 8 ran- domly chosen training rooms in any iteration during optimization. 4 Evaluators stated the y heard noise on clean examples 3 × less often compared to the baseline in the prior study . W e believ e this is due to the fact that when primed with examples which are obvi- ously different, the baseline becomes more easily distinguishable. significantly more perceptible than the clean audio. Di- rectly comparing the imperceptible and rob ust adversarial examples to the robust examples, ev aluators believed the imperceptible examples had less distortion 91% of the time. Clearly the adversarial e xamples that are robust are signifi- cantly easier to distinguish from clean audio, ev en when we apply the masking threshold. Howe ver , this result is consis- tent with work on adversarial examples on images, where completely imperceptible physical-world adversarial e xam- ples have not been successfully constructed. On images, physical attacks require o ver 16 × as much distortion to be effecti ve on the physical w orld (see, for example, Figure 4 of Kurakin et al. ( 2016 )). 8. Conclusion In this paper , we successfully construct imperceptible adver - sarial examples (verified by a human study) for automatic speech recognition based on the psychoacoustic principle of auditory masking, while retaining 100 % targeted success rate on arbitrary full-sentence tar gets. Simultaneously , we also make progress to wards de veloping robust adv ersarial examples that remain ef fectiv e after being played ov er-the- air (processed by random room en vironment simulators), increasing the practicality of actual real-world attacks using adversarial e xamples targeting ASR systems. W e believ e that future work is still required: our robust adversarial e xamples do not play fully over -the-air , despite working in simulated room environments. Resolving this difficulty while maintaining a high targeted success rate is necessary for demonstrating a practical security concern. As a final contribution of potentially independent interest, this work demonstrates how one might go about constructing adversarial e xamples for non- ` p -based metrics. Especially on images, nearly all adversarial example research has fo- cused on this highly-limited distance measure. De voting efforts to identifying different methods that humans use to assess similarity , and generating adversarial examples ex- ploiting those metrics, is an important research effort we hope future work will explore. Acknowledgements The authors would lik e to thank Patrick Nguyen, Jonathan Shen and Rohit Prabhav alkar for helpful discussions on Lingvo ASR system and Arun Narayanan for suggestions in room impulse simulations. W e also want to thank the revie wers for their useful comments. This work was greatly supported by Google Brain. GWC and YQ were also par- tially supported by Guangzhou Science and T echnology Planning Project (Grant No. 201704030051). Imperceptible, Rob ust, and T argeted Adv ersarial Examples for A utomatic Speech Recognition References Allen, J. B. and Berkle y , D. A. Image method for ef fi ciently simulating small-room acoustics. The Journal of the Acoustical Society of America , 65(4):943–950, 1979. Athalye, A., Engstrom, L., Ilyas, A., and Kwok, K. Synthe- sizing robust adv ersarial examples. In ICML , 2018. Bahdanau, D., Cho, K., and Bengio, Y . Neural machine translation by jointly learning to align and translate. arXiv pr eprint , 2014. Biggio, B., Corona, I., Maiorca, D., Nelson, B., ˇ Srndi ´ c, N., Laskov , P ., Giacinto, G., and Roli, F . Evasion attacks against machine learning at test time. In Joint Eur opean confer ence on machine learning and knowledge discovery in databases , pp. 387–402. Springer , 2013. Carlini, N. and W agner , D. T owards e valuating the rob ust- ness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP) , pp. 39–57. IEEE, 2017. Carlini, N. and W agner, D. A. Audio adversarial examples: T argeted attacks on speech-to-te xt. 2018 IEEE Security and Privacy W orkshops (SPW) , pp. 1–7, 2018. Carlini, N., Mishra, P ., V aidya, T ., Zhang, Y ., Sherr , M., Shields, C., W agner , D., and Zhou, W . Hidden voice commands. In USENIX Security Symposium , pp. 513– 530, 2016. Chan, W ., Jaitly , N., Le, Q., and V inyals, O. Listen, attend and spell: A neural network for large v ocabulary con ver- sational speech recognition. In Acoustics, Speech and Signal Pr ocessing (ICASSP), 2016 IEEE International Confer ence on , pp. 4960–4964. IEEE, 2016. Cisse, M., Adi, Y ., Nev erova, N., and Keshet, J. Houdini: Fooling deep structured prediction models. arXiv preprint arXiv:1707.05373 , 2017. Gong, Y . and Poellabauer, C. Crafting adversarial examples for speech paralinguistics applications. arXiv pr eprint arXiv:1711.03280 , 2017. Huang, S., Papernot, N., Goodfellow , I., Duan, Y ., and Abbeel, P . Adversarial attacks on neural network policies. arXiv pr eprint , 2017. Jia, R. and Liang, P . Adversarial examples for ev alu- ating reading comprehension systems. arXiv preprint arXiv:1707.07328 , 2017. Khare, S., Aralikatte, R., and Mani, S. Adversarial black- box attacks for automatic speech recognition systems using multi-objective genetic optimization. arXiv preprint arXiv:1811.01312 , 2018. Kingma, D. P . and Ba, J. Adam: A method for stochastic optimization. arXiv pr eprint , 2014. Kurakin, A., Goodfellow , I., and Bengio, S. Adversar - ial examples in the physical world. arXiv pr eprint arXiv:1607.02533 , 2016. Lin, Y . and Abdulla, W . H. Principles of psychoacoustics. In Audio W atermark , pp. 15–49. Springer , 2015. Madry , A., Makelov , A., Schmidt, L., Tsipras, D., and Vladu, A. T owards deep learning models resistant to adversarial attacks. arXiv pr eprint , 2017. Mitchell, J. L. Introduction to digital audio coding and standards. J ournal of Electr onic Imaging , 13(2):399, 2004. Panayoto v , V ., Chen, G., Povey , D., and Khudanpur, S. Librispeech: an asr corpus based on public domain au- dio books. In Acoustics, Speech and Signal Pr ocessing (ICASSP), 2015 IEEE International Confer ence on , pp. 5206–5210. IEEE, 2015. Scheibler , R., Bezzam, E., and Dokmani ´ c, I. Pyroomacous- tics: A python package for audio room simulation and array processing algorithms. In 2018 IEEE International Confer ence on Acoustics, Speech and Signal Processing (ICASSP) , pp. 351–355. IEEE, 2018. Sch ¨ onherr , L., K ohls, K., Zeiler, S., Holz, T ., and K olossa, D. Adversarial attacks ag ainst automatic speech recogni- tion systems via psychoacoustic hiding. arXiv pr eprint arXiv:1808.05665 , 2018. Shen, J., Nguyen, P ., W u, Y ., Chen, Z., Chen, M. X., Jia, Y ., Kannan, A., Sainath, T ., Cao, Y ., Chiu, C.-C., et al. Lingvo: a modular and scalable framework for sequence- to-sequence modeling. arXiv preprint , 2019. Song, L. and Mittal, P . Inaudible v oice commands. arXiv pr eprint , 2017. Sutske ver , I., V inyals, O., and Le, Q. V . Sequence to se- quence learning with neural networks. In Advances in Neural Information Pr ocessing Systems , pp. 3104–3112, 2014. Szegedy , C., Zaremba, W ., Sutskev er, I., Bruna, J., Erhan, D., Goodfellow , I., and Fergus, R. Intriguing properties of neural networks. arXiv pr eprint , 2013. T aori, R., Kamsetty , A., Chu, B., and V emuri, N. T argeted adversarial e xamples for black box audio systems. arXiv pr eprint , 2018. Imperceptible, Rob ust, and T argeted Adv ersarial Examples for A utomatic Speech Recognition Y akura, H. and Sakuma, J. Robust audio adversar- ial example for a physical attack. arXiv preprint arXiv:1810.11793 , 2018. Y uan, X., Chen, Y ., Zhao, Y ., Long, Y ., Liu, X., Chen, K., Zhang, S., Huang, H., W ang, X., and Gunter , C. A. Commandersong: A systematic approach for practical adversarial voice recognition. arXiv pr eprint arXiv:1801.08535 , 2018. Zhang, G., Y an, C., Ji, X., Zhang, T ., Zhang, T ., and Xu, W . Dolphinattack: Inaudible voice commands. In Pr oceed- ings of the 2017 ACM SIGSA C Confer ence on Computer and Communications Security , pp. 103–117. A CM, 2017. A ppendix A. Frequency Masking Thr eshold In this section, we detail how we compute the frequency masking threshold for constructing imperceptible adversar - ial examples. This procedure is based on psychoacoustic principles which were refined over many years of human studies. For further background on psychoacoustic mod- els, we refer the interested reader to ( Lin & Abdulla , 2015 ; Mitchell , 2004 ). Step 1: Identifications of Maskers In order to compute the frequency masking threshold of an input signal x ( n ) , where 0 ≤ n ≤ N , we need to first iden- tify the maskers. There are two dif ferent classes of maskers: tonal and nontonal maskers, where nontonal maskers hav e stronger masking ef fects compared to tonal maskers. Here we simply treat all the mask ers as tonal ones to make sure the threshold that we compute can always mask out the noise. The normalized PSD estimate of the tonal maskers ¯ p m x ( k ) must meet three criteria. First, they must be local maxima in the spectrum, satisfying: ¯ p x ( k − 1) ≤ ¯ p m x ( k ) and ¯ p m x ( k ) ≥ ¯ p x ( k + 1) , (14) where 0 ≤ k < N 2 . Second, the normalized PSD estimate of an y masker must be higher than the threshold in quiet A TH( k ), which is: ¯ p m x ( k ) ≥ A TH ( k ) , (15) where A TH( k ) is approximated by the following frequency- dependency function: A TH ( f ) = 3 . 64( f 1000 ) − 0 . 8 − 6 . 5 exp {− 0 . 6( f 1000 − 3 . 3) 2 } + 10 − 3 ( f 1000 ) 4 . (16) The quiet threshold only applies to the human hearing range of 20 Hz ≤ f ≤ 20 kHz . When we perform short time Fourier transform (STFT) to a signal, the relation between the frequency f and the index of sampling points k is f = k N · f s , 0 ≤ f < f s 2 (17) where f s is the sampling frequency and N is the window size. Last, the maskers must have the highest PSD within the range of 0.5 Bark around the masker’ s frequency , where bark is a psychoacoustically-moti vated frequency scale. Human’ s main hearing range between 20Hz and 16kHz is divided into 24 non-ov erlapping critical bands, whose unit is Bark, varying as a function of frequenc y f as follows: b ( f ) = 13 arctan( 0 . 76 f 1000 ) + 3 . 5 arctan( f 7500 ) 2 . (18) As the ef fect of masking is additive in the logarithmic domain, the PSD estimate of the the masker is further smoothed with its neighbors by: ¯ p m x ( ¯ k ) = 10 log 10 [10 ¯ p x ( k − 1) 10 + 10 ¯ p m x ( k ) 10 + 10 ¯ p x ( k +1) 10 ] (19) Step 2: Individual masking thresholds An individual masking threshold is better computed with frequency denoted at the Bark scale because the spreading functions of the masker would be similar at dif ferent Barks. W e use b ( i ) to represent the bark scale of the frequenc y in- dex i . There are a number of spreading functions introduced to imitate the characteristics of maskers and here we choose the simple two-slope spread function: SF [ b ( i ) , b ( j )] = ( 27∆ b ij , if ∆ b ij ≤ 0 . G ( b ( i )) · ∆ b ij , otherwise (20) where ∆ b ij = b ( j ) − b ( i ) , (21) G ( b ( i )) = [ − 27 + 0 . 37 max { ¯ p m x ( b ( i )) − 40 , 0 } ] (22) where b ( i ) and b ( j ) are the bark scale of the masker at the frequency inde x i and the maskee at frequenc y index j respecti vely . Then, T [ b ( i ) , b ( j )] refers to the masker at Bark index b ( i ) contributing to the masking ef fect on the maskee at bark inde x b ( j ) . Empirically , the threshold T [ b ( i ) , b ( j )] is calculated by: T [ b ( i ) , b ( j )] = ¯ p m x ( b ( i ))+ ∆ m [ b ( i )]+ SF [ b ( i ) , b ( j )] , (23) where ∆ m [ b ( i )] = − 6 . 025 − 0 . 275 b ( i ) and SF[ b ( i ) , b ( j ) ] is the spreading function. Imperceptible, Rob ust, and T argeted Adv ersarial Examples for A utomatic Speech Recognition Step 3: Global masking threshold The global masking threshold is a combination of indi vidual masking thresholds as well as the threshold in quiet via addition. The global masking threshold at frequency index i measured with Decibels (dB) is calculated according to: θ x ( i ) = 10 log 10 [10 AT H ( i ) 10 + N m X j =1 10 T [ b ( j ) ,b [ i ]] 10 ] , (24) where N m is the number of all the selected maskers. The computed θ x is used as the frequency masking threshold for the input audio x to construct imperceptible adversarial examples. B. Stability in Optimization In case of the instability problem during back-propagation due to the existence of the log function in the threshold θ x ( k ) and the normalized PSD estimate of the perturbation ¯ p δ ( k ) , we remov e the term 10 log 10 in the PSD estimate of p δ ( k ) and p x ( k ) and then the y become: p δ ( k ) =     1 N s δ ( k )     2 , p x ( k ) =     1 N s x ( k )     2 (25) and the normalized PSD of the perturbation turns into ¯ p δ ( k ) = 10 9 . 6 p δ ( k ) max k { p x ( k ) } . (26) Correspondingly , the threshold θ x ( k ) becomes: θ x ( k ) = 10 θ x 10 (27) C. Notations and Definitions The notations and definitions used in our proposed algo- rithms are listed in T able 3 . D. Implementation Details The adversarial examples generated in our paper are all optimized via Adam optimizer ( Kingma & Ba , 2014 ). The hyperparameters used in each section are displayed belo w . D.1. Imper ceptible Adversarial Examples In order to construct imperceptible adversarial examples, we divide the optimization into two stages. In the first stage, the learning rate lr 1 is set to be 100 and the number of iterations T 1 is 1000 as ( Carlini & W agner , 2018 ). The max-norm bound  starts from 2000 and will be gradually reduced during optimization. In the second stage, the number of iterations T 2 is 4000. The learning rate lr 2 starts from 1 and Algorithm 1 Optimization with Masking Threshold Input: audio wa veform x , target phrase y , ASR system f ( · ) , perturbation δ , loss function ` ( x, δ, y ) , hyperparam- eters  and α , learning rate in the first stage lr 1 and second stage lr 2 , number of iterations in the first stage T 1 and second stage T 2 . # Stage 1: minimize k δ k Initialize δ = 0 ,  = 2000 and α = 0 . for i = 0 to T 1 − 1 do δ ← δ − l r 1 · sign ( ∇ δ ` ( x, δ, y )) Clip k δ k ≤  if i % 10 = 0 and f ( x + δ ) = y then if  > max( k δ k ) then  ← max( k δ k ) end if  ← 0 . 8 ·  end if end for # Stage 2: minimize the perceptibility Reassign α = 0 . 05 for i = 0 to T 2 − 1 do δ ← δ − l r 2 · ∇ δ ` ( x, δ, y ) if i % 20 = 0 and f ( x + δ ) = y then α ← 1 . 2 · α end if if i % 50 = 0 and f ( x + δ ) 6 = y then α ← 0 . 8 · α end if end for Output: adversarial example x 0 = x + δ will be reduced to be 0.1 after 3000 iterations. The adaptiv e parameter α which balances the importance between ` net and ` θ begins with 0 . 05 and gradually updated based on the performance of adversarial examples. Algorithm 1 shows the details of the two-stage optimization. D.2. Rob ust Adversarial Examples T o develop the rob ust adversarial examples that could work after played over -the-air , we also optimize the adversarial perturbation in two stages. The first stage intends to find a relativ e small perturbation while the second stage focuses on making the constructed adversarial example more robust to random room configurations. The learning rate lr 1 in the first stage is 50 and δ will be updated for 2000 iterations. The max-norm bound  for the adversarial perturbation δ starts from 2000 as well and will be gradually reduced. In the second stage, the number of iterations is set to be 4000 and the learning rate lr 2 is 5 . In this stage,  is fixed and equals the optimized  ∗ r in the first stage plus ∆ . The size of transformation set Ω is set to be M = 10 . Imperceptible, Rob ust, and T argeted Adv ersarial Examples for A utomatic Speech Recognition x The clean audio input δ The adversarial perturbation added to clean audio x 0 The constructed adversarial e xample y The targeted transcription f ( · ) The attacked neural network (ASR) F ( · ) Fourier transform k The index of the spectrum N The window size in short term F ourier transform s x ( k ) The k -th bin of the spectrum for audio x s δ ( k ) The k -th bin of the spectrum for perturbation δ p x ( k ) The log-magnitude power spectral density (PSD) for audio x at inde x k ¯ p x ( k ) The normalized PSD estimated for audio x at index k p δ ( k ) The log-magnitude power spectral density (PSD) for audio δ at index k ¯ p δ ( k ) The normalized PSD estimated for audio δ at index k θ x ( k ) The frequency masking threshold for audio x at index k ` ( x, δ, y ) Loss function to optimize to construct adversarial e xamples ` net ( · , y ) Loss function to fool the neural network with the input ( · ) and output y ` θ ( x, δ ) Imperceptibility loss function α A hyperparameter to balance the importance of ` net and ` θ k·k Max-norm  Max-norm bound of perturbation δ ∇ δ ( · ) The gradient of ( · ) with regard to δ lr 1 , l r 2 , l r 3 The learning rate in gradient descent r Room re verberation t ( · ) The room transformation related to the room configuration T The distribtion from which the transformation t ( · ) is sampled from δ ∗ im The optimized δ in the first stage in constructing imperceptible adversarial examples  ∗ r The optimized  in the first stage in constructing robust adv ersarial examples δ ∗ r The optimized δ in the first stage in constructing robust adv ersarial examples  ∗∗ r The max-norm bound for δ used in the second stage in constructing robust adv ersarial examples δ ∗∗ r The optimized δ in the second stage in constructing robust adv ersarial examples ∆ The difference between  ∗∗ r −  ∗ r Ω A set of transformations sampled from distribution T M The size of the transformation set Ω T able 3. Notations and Definitions used in our algorithms. Original phrase 1 the more she is engaged in her proper duties the less leisure will she ha ve for it ev en as an accomplishment and a recreation T argeted phrase 1 old will is a fine fello w but poor and helpless since missus rogers had her accident Original phrase 2 a little cracked that in the popular phrase w as my impression of the stranger who now made his appearance in the supper room T argeted phrase 2 her regard shifted to the green stalks and lea ves again and she started to mov e away T able 4. Examples of the original and targeted phrases on the LibriSpeech dataset. Imperceptible, Rob ust, and T argeted Adv ersarial Examples for A utomatic Speech Recognition D.3. Imper ceptible and Robust Attacks T o construct imperceptible as well as robust adversarial examples, we begin with the robust adversarial examples generated in Section. D.2 . In the first stage, we focus on reducing the imperceptibility by setting the initial α to be 0.01 and the learning rate is set to be 1. W e update the adv er- sarial perturbation δ for 4000 iterations. If the adversarial example successfully attacks the ASR system in 4 out of 10 randomly chosen rooms, then α with be increased by 2. Otherwise, for every 50 iterations, α will be decreased by 0.5. In the second stage, we focus on improving the less percep- tible adversarial e xamples to be more robust. The learning rate is 1.5 and α starts from a very small value 0 . 00005 . The perturbation will be further updated for 6000 iterations. If the adversarial example successfully attacks the ASR sys- tem in 8 out of 10 randomly chosen rooms, then α will be increased by 1.2. E. T ranscription Examples Some examples of the original phrases and targeted tran- scriptions from the LibriSpeech dataset ( Panayotov et al. , 2015 ) are shown in T able 4 .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment