CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition

CommanderSong: A Systematic A ppr oach f or Practical Adversarial V oice Recognition Xuejing Y uan 1,2 , Y uxuan Chen 3 , Y ue Zhao 1,2 , Y unhui Long 4 , Xiaokang Liu 1,2 , Kai Chen ∗ 1,2 , Shengzhi Zhang 3,5 , Heqing Huang , XiaoFeng W ang 6 , and Carl A. Gunter 4 1 SKLOIS, Institute of Information Engineering, Chinese Academy of Sciences, China 2 School of Cyber Security , Univ ersity of Chinese Academy of Sciences, China 3 Department of Computer Science, Florida Institute of T echnology , USA 4 Department of Computer Science, Univ ersity of Illinois at Urbana-Champaign, USA 5 Department of Computer Science, Metropolitan College, Boston Uni versity , USA 6 School of Informatics and Computing, Indiana Univ ersity Bloomington, USA Abstract The popularity of automatic speech recognition (ASR) systems, like Google Assistant, Cortana, brings in secu- rity concerns, as demonstrated by recent attacks. The impacts of such threats, ho wev er, are less clear , since they are either less stealthy (producing noise-like v oice com- mands) or requiring the physical presence of an attack device (using ultrasound speak ers or transducers). In this paper , we demonstrate that not only are more practical and surreptitious attacks feasible but they can ev en be auto- matically constructed. Speciﬁcally , we ﬁnd that the v oice commands can be stealthily embedded into songs, which, when played, can effecti vely control the target system through ASR without being noticed. For this purpose, we dev eloped nov el techniques that address a key technical challenge: integrating the commands into a song in a way that can be effecti vely recognized by ASR through the air , in the presence of background noise, while not being detected by a human listener . Our research shows that this can be done automatically against real world ASR applica- tions 1 . W e also demonstrate that such CommanderSongs can be spread through Internet (e.g., Y ouT ube) and radio, potentially affecting millions of ASR users. Finally we present mitigation techniques that defend existing ASR systems against such threat. 1 Introduction Intelligent voice control (IVC) has been widely used in human-computer interaction, such as Amazon Alexa [1], Google Assistant [ 6 ], Apple Siri [ 3 ], Microsoft Cor- tana [ 14 ] and iFL YTEK [ 11 ]. Running the state-of- the-art ASR techniques, these systems can effecti vely interpret natural voice commands and execute the cor- responding operations such as unlocking the doors of ∗ Corresponding author: chenkai@iie.ac.cn 1 Demos of attacks are uploaded on the website (https://sites.google.com/view/commandersong/) home or cars, making online purchase, sending mes- sages, and etc. This has been made possible by recent progress in machine learning, deep learning [ 31 ] in par- ticular , which vastly improv es the accuracy of speech recognition. In the meantime, these deep learning tech- niques are kno wn to be vulnerable to adv ersarial perturba- tions [ 37 , 21 , 27 , 25 , 20 , 49 , 28 , 44 ]. Hence, it becomes imperativ e to understand the security implications of the ASR systems in the presence of such attacks. Threats to ASR Prior research sho ws that carefully- crafted perturbations, ev en a small amount, could cause a machine learning classiﬁer to misbehave in an unexpected way . Although such adversarial learning has been e xten- si vely studied in image recognition, little has been done in speech recognition, potentially due to the new challenge in this domain: unlike adversarial images, which include the perturbations of less noticeable background pixels, changes to v oice commands often introduce noise that a modern ASR system is designed to ﬁlter out and therefore cannot be easily misled. Indeed, a recent attack on ASR utilizes noise-like hid- den voice command [ 22 ], but the white box attack is based on a traditional speech recognition system that uses a Gaussian Mixture Model (GMM), not the DNN behind today’ s ASR systems. Another attack transmits inaudible commands through ultrasonic sound [ 53 ], but it exploits microphone hardware vulnerabilities instead of the weak- nesses of the DNN. Moreo ver , an attack device, e.g., an ultrasonic transducer or speaker , needs to be placed close to the target ASR system. So far little success has been reported in generating “adversarial sound” that practically fools deep learning technique but remains inconspicu- ous to human ears, and meanwhile allo ws it to be played from the remote (e.g., through Y ouT ube) to attack a large number of ASR systems. T o ﬁnd practical adversarial sound, a few technical challenges need to be addressed: (C1) the adversarial au- dio sample is expected to be effecti ve in a complicated, real-world audible en vironment, in the presence of elec- tronic noise from speaker and other noises; (C2) it should be stealthy , unnoticeable to ordinary users; (C3) impactful adversarial sound should be remotely deli verable and can be played by popular devices from online sources, which can affect a lar ge number of IVC de vices. All these chal- lenges hav e been found in our research to be completely addressable, indicating that the threat of audio adv ersarial learning is indeed realistic. CommanderSong. More speciﬁcally , in this paper , we report a practical and systematic adversarial attack on real world speech recognition systems. Our attack can automatically embed a set of commands into a (randomly selected) song, to spread to a large amount of audience (addressing C3). This revised song, which we call Com- manderSong , can sound completely normal to ordinary users, but will be interpreted as commands by ASR, lead- ing to the attacks on real-world IVC devices. T o build such an attack, we leverage an open source ASR sys- tem Kaldi [ 13 ], which includes acoustic model and lan- guage model. By carefully synthesizing the outputs of the acoustic model from both the song and the giv en v oice command, we are able to generate the adversarial audio with minimum perturbations through gradient descent, so that the CommanderSong can be less noticeable to hu- man users (addressing C2, named WT A attack). T o make such adversarial samples practical, our approach has been designed to capture the electronic noise produced by dif- ferent speakers, and inte grate a generic noise model into the algorithm for seeking adversarial samples (addressing C1, called W AA attack). In our experiment, we generated over 200 Comman- derSongs that contain dif ferent commands, and attacked Kaldi with an 100% success rate in a WT A attack and a 96% success rate in a W AA attack. Our e valuation further demonstrates that such a CommanderSong can be used to perform a black box attack on a mainstream ASR system iFL YTEK 2 [ 11 ] (neither source code nor model is a vail- able). iFL YTEK has been used as the voice input method by many popular commercial apps, including W eChat (a social app with 963 million users), Sina W eibo (another social app with 530 million users), JD (an online shop- ping app with 270 million users), etc. T o demonstrate the impact of our attack, we show that CommanderSong can be spread through Y ouT ube, which might impact millions of users. T o understand the human perception of the at- tack, we conducted a user study 3 on Amazon Mechanical T urk [ 2 ]. Among over 200 participants, none of them identiﬁed the commands inside our CommanderSongs. W e further de veloped the defense solutions against this attack and demonstrated their effecti veness. 2 W e have reported this to iFL YTEK, and are waiting for their re- sponses. 3 The study is approved by the IRB. Contributions . The contributions of this paper are sum- marized as follows: • Practical adver sarial attack a gainst ASR systems . W e designed and implemented the ﬁrst practical adv ersarial attacks against ASR systems. Our attack is demonstrated to be r obust , working across air in the presence of en- vironmental interferences, transfer able , effecti ve on a black box commercial ASR system (i.e., iFL YTEK) and r emotely deliverable , potentially impacting millions of users. • Defense against CommanderSong . W e design two ap- proaches (audio turbulence and audio squeezing) to de- fend against the attack, which prov es to be ef fectiv e by our preliminary experiments. Roadmap . The rest of the paper is organized as fol- lows: Section 2 gi ves the background information of our study . Section 3 provides moti v ation and ov erviews our approach. In Section 4, we elaborate the design and imple- mentation of CommanderSong. In Section 5, we present the experimental results, with emphasis on the difference between machine and human comprehension. Section 6 in vestigates deeper understanding on CommanderSongs. Section 7 sho ws the defense of the CommanderSong at- tack. Section 8 compares our work with prior studies and Section 9 concludes the paper . 2 Background In this section, we ov erview existing speech recognition system, and discuss the recent advance on the attacks against both image and speech recognition systems. 2.1 Speech Recognition Automatic speech recognition is a technique that allo ws machines to recognize/understand the semantics of hu- man voice. Besides the commercial products like Amazon Alexa, Google Assistant, Apple Siri, iFL YTEK, etc., there are also open-source platforms such as Kaldi toolkit [ 13 ], Carnegie Mellon University’ s Sphinx toolkit [ 5 ], HTK toolkit [ 9 ], etc. Figure 1 presents an overvie w of a typical speech recognition system, with two major components: feature extraction and decoding based on pre-trained mod- els (e.g., acoustic models and language models). After the raw audio is ampliﬁed and ﬁltered, acoustic features need to be extracted from the preprocessed au- dio signal. The features contained in the signal change signiﬁcantly ov er time, so short-time analysis is used to ev aluate them periodically . Common acoustic feature extraction algorithms include Mel-Frequency Cepstral Coefﬁcients (MFCC) [ 40 ], Linear Predicti ve Coef ﬁcient (LPC) [ 34 ], Perceptual Linear Predicti ve (PLP) [ 30 ], etc. Among them, MFCC is the most frequently used one in voice Text Figure 1: Architecture of Automatic Speech Recognition System. both open source toolkit and commercial products [ 42 ]. GMM can be used to analyze the property of the acous- tic features. The extracted acoustic features are matched against pre-trained acoustic models to obtain the likeli- hood probability of phonemes. Hidden Marko v Models (HMM) are commonly used for statistical speech recogni- tion. As GMM is limited to describe a non-linear mani- fold of the data, Deep Neural Network-Hidden Markov Model (DNN-HMM) has been widely used for speech recognition in academic and industry community since 2012 [32]. Recently , end-to-end deep learning becomes used in speech recognition systems. It applies a large scale dataset and uses CTC (Connectionist T emporal Classi- ﬁcation) loss function to directly obtain the characters rather than phoneme sequence. CTC locates the align- ment of text transcripts with input speech using an all- neural, sequence-to-sequence neural network. Traditional speech recognition systems in volv e many engineered pro- cessing stages, while CTC can supersede these processing stages via deep learning [ 17 ]. The architecture of end-to- end ASR systems always includes an encoder network corresponding to the acoustic model and a decoder net- work corresponding to the language model [ 47 ]. Deep- Speech [ 17 ] and W av2Letter [ 24 ] are popular open source end-to-end speech recognition systems. 2.2 Existing Attacks against Image and Speech Recognition Systems Now adays people are enjoying the conv enience of in- tegrating image and speech as new input methods into mobile de vices. Hence, the accuracy and dependability of image and speech recognition pose critical impact on the security of such de vices. Intuitively , the adversaries can compromise the integrity of the training data if they ha ve either physical or remote access to it. By either revising existing data or inserting e xtra data in the training dataset, the adversaries can certainly tamper the dependability of the trained models [38]. When adversaries do not have access to the training data, attacks are still possible. Recent research has been done to decei ve image recognition systems into making wrong decision by slightly revising the input data. The fundamental idea is to revise an image slightly to make it “look” different from the views of human being and machines. Depending on whether the adversary knows the algorithms and parameters used in the recognition sys- tems, there exist white box and black box attacks. Note that the adversary always needs to be able to interact with the tar get system to observe corresponding output for any input, in both white and black box attacks. Early researches [ 50 , 48 , 19 ] focus on the revision and gener- ation of the digital image ﬁle, which is directly fed into the image recognition systems. The state-of-the-art re- searches [ 37 , 21 , 27 ] adv ance in terms of practicality by printing the adversarial image and presenting it to a de vice with image recognition functionality . Ho wev er , the success of the attack against image recog- nition systems has not been ported to the speech recogni- tion systems until very recently , due to the complexity of the latter . The speech, a time-domain continuous signal, contains much more features compared to the static im- ages. Hidden voice command [ 22 ] launched both black box (i.e., in verse MFCC) and white box (i.e., gradient de- cent) attacks against speech recognition systems, and gen- erated obfuscated commands to ASR systems. Though seminal in attacking speech recognition systems, it is also limited to make practical attacks. For instance, a large amount of human ef fort is in volved as feedback for the black box approach, and the white box approach is based on GMM-based acoustic models, which hav e been replaced by DNN-based ones in most modern speech recognition systems. The recent work DolphinAttack [ 53 ] proposed a completely inaudible v oice attack by modu- lating commands on ultrasound carriers and leveraging microphone vulnerabilities (i.e., the nonlinearity of the microphones). As noted by the authors, such attack can be eliminated by an enhanced microphone that can suppress acoustic signals on ultrasound carrier , like iPhone 6 Plus. 3 Overview In this section, we present the motiv ation of our work, and ov erview the proposed approach to generate the practical adversarial attack. 3.1 Motivation Recently , adversarial attacks on image classiﬁcation hav e been extensi vely studied [ 21 , 27 ]. Results sho w that ev en the state-of-the-art DNN-based classiﬁer can be fooled by small perturbations added to the original image [ 37 ], producing erroneous classiﬁcation results. Howe ver , the impact of adv ersarial attacks on the most advanced speech recognition systems, such as those integrating DNN mod- els, has nev er been systematically studied. Hence, in this paper , we in vestigated DNN-based speech recognition systems, and explored adv ersarial attacks against them. Researches show that commands can be transmitted to IVC devices through inaudible ultrasonic sound [ 53 ] and noises [ 22 ]. Even though the existing works against ASR systems are seminal, they are limited in some aspects. Speciﬁcally , ultrasonic sound can be defeated by using a lo w-pass ﬁlter (LPF) or analyzing the signal frequency range, and noises are easy to be noticed by users. Therefore, the research in this paper is motiv ated by the following questions: (Q1) Is it possible to build the practical adversarial attack against ASR systems, giv en the facts that the most ASR systems are becoming more intelligent (e.g., by integrating DNN models) and that the generated adversarial samples should work in the very complicated physical en vironment, e.g., electronic noise from speaker , background noise, etc.? (Q2) Is it feasible to generate the adv ersarial samples (including the target commands) that are difﬁcult, or e ven impossible, to be noticed by ordinary users, so the control over the ASR systems can happen in a “hidden” f ashion? (Q3) If such adversarial audio samples can be produced, is it possible to impact a large amount of victims in an automated way , rather than solely relying on attackers to play the adv er- sarial audio and affecting victims nearby? Below , we will detail how our attack is designed to address the above questions. 3.2 The Philosophy of Designing Our At- tack T o address Q3, our idea is to choose songs as the “carrier” of the voice commands recognizable by ASR systems. The reason of choosing such “carrier” is at least two-fold. On one hand, enjoying songs is always a preferred way for people to relax, e.g., listening to the music station, stream- ing music from online libraries, or just browsing Y ouT ube for favorite programs. Moreover , such entertainment is not restricted by using radio, CD player , or desktop com- puter any more. A mobile device, e.g., Android phone or Apple iPhone, allo ws people to enjoy songs e verywhere. Hence, choosing the song as the “carrier” of the voice command automatically helps impact millions of people. On the other hand, “hiding” the desired command in the song also makes the command much more dif ﬁcult to be noticed by victims, as long as Q2 can be reasonably ad- dressed. Note that we do not rely on the lyrics in the song to help integrate the desired command. Instead, we intend to avoid the songs with the lyrics similar to our desired command. For instance, if the desired command is “open the door”, choosing a song with the lyrics of “open the door” will easily catch the victims’ attention. Hence, we decide to use random songs as the “carrier” regardless of the desired commands. Actually choosing the songs as the “carrier” of desired commands makes Q2 e ven more challenging. Our basic idea is when generating the adversarial samples, we revise the original song leveraging the pure voice audio of the desired command as a reference. In particular , we ﬁnd the revision of the original song to generate the adversarial samples is always a trade off between preserving the ﬁdelity of the original song and recognizing the desired commands from the generated sample by ASR systems. T o better obfuscate the desired commands in the song, in this paper we emphasize the former than the latter . In other words, we designed our revision algorithm to maximally preserve the ﬁdelity of the original song, at the expense of losing a bit success rate of recognition of the desired commands. Howe ver , such expense can be compensated by integrating the same desired command multiple times into one song (the command of “open the door” may only last for 2 seconds.), and the successful recognition of one sufﬁces to impact the victims. T echnically , in order to address Q2, we need to in vesti- gate the details of an ASR system. As shown in Figure 1, an ASR system is usually composed of two pre-trained models: an acoustic model describing the relationship between audio signals and phonetic units, and a language model representing statistical distrib utions over sequences of words. In particular , giv en a piece of pure voice audio of the desired command and a “carrier” song, we can feed them into an ASR system separately , and intercept the intermediate results. By in vestigating the output from the acoustic model when processing the audio of the desired command, and the details of the language model, we can conclude the “information” in the output that is necessary for the language model to produce the correct text of the desired command. When we design our approach, we want to ensure such “information” is only a small subset (hopefully the minimum subset) of the output from the acoustic model. Then, we carefully craft the output from the acoustic model when processing the original song, to make it “include” such “information” as well. Finally , we in verse the acoustic model and the feature e xtraction together , to directly produce the adversarial sample based on the crafted output (with the “information” necessary for the language model to produce the correct text of the desired command). Theoretically , the adversarial samples generated abov e can be recognized by the ASR systems as the desired command if directly fed as input to such systems. Since such input usually is in the form of a wav e ﬁle (in “W A V” format) and the ASR systems need to expose APIs to accept the input, we deﬁne such attack as the W A V -T o- API (WT A) attack. Howe ver , to implement a practical Figure 2: Result of decoding “Echo”. attack as in Q1, the adv ersarial sample should be played by a speaker to interact with IVC de vices over the air . In this paper , we deﬁne such practical attack as W A V -Air- API (W AA) attack. The challenge of the W AA attack is when playing the adversarial samples by a speaker , the electronic noise produced by the loudspeakers and the background noise in the open air ha ve signiﬁcant impact on the recognition of the desired commands from the ad- versarial samples. T o address this challenge, we improv e our approach by integrating a generic noise model to the abov e algorithm with the details in Section 4.3. 4 Attack A pproach W e implement our attack by addressing two technical challenges: (1) minimizing the perturbations to the song, so the distortion between the original song and the gener - ated adversarial sample can be as unnoticeable as possible, and (2) making the attack practical, which means Com- manderSong should be played o ver the air to compromise IVC devices. T o address the ﬁrst challenge, we proposed pdf-id sequence matching to incur minimum re vision at the output of the acoustic model, and use gradient de- scent to generate the corresponding adv ersarial samples as in Section 4.2. The second challenge is addressed by introducing a generic noise model to simulate both the electronic noise and background noise as in Section 4.3. Below we elaborate the details. 4.1 Kaldi Platf orm W e choose the open source speech recognition toolkit Kaldi [ 13 ], due to its popularity in research community . Its source code on github obtains 3,748 stars and 1,822 forks [ 4 ]. Furthermore, the corpus trained by Kaldi on “Fisher” is also used by IBM [18] and Microsoft [52]. In order to use Kaldi to decode audio, we need a trained model to begin with. There are some models on Kaldi website that can be used for research. W e took advan- T able 1: Relationship between transition-id and pdf-id. Phoneme HMM- state Pdf- id T ransition- id T ransition eh B 0 6383 15985 0 → 1 15986 0 → 2 eh B 1 5760 16189 self-loop 16190 1 → 2 k I 0 6673 31223 0 → 1 31224 0 → 2 k I 1 3787 31379 self-loop 31380 1 → 2 ow E 0 5316 39643 0 → 1 9644 0 → 2 ow E 1 8335 39897 self-loop 39898 1 → 2 tage of the “ ASpIRE Chain Model” (referred as “ ASpIRE model” in short), which was one of the latest released decoding models when we began our study 4 . After man- ually analyzing the source code of Kaldi (about 301,636 lines of shell scripts and 238,107 C++ SLOC), we com- pletely explored ho w Kaldi processes audio and decodes it to texts. Firstly , it e xtracts acoustic features like MFCC or PLP from the raw audio. Then based on the trained probability density function (p.d.f.) of the acoustic model, those features are taken as input to DNN to compute the posterior probability matrix. The p.d.f. is indexed by the pdf identiﬁer (pdf-id), which exactly indicates the column of the output matrix of DNN. Phoneme is the smallest unit composing a word. There are three states (each is denoted as an HMM state) of sound production for each phoneme, and a series of tran- sitions among those states can identify a phoneme. A transition identiﬁer (transition-id) is used to uniquely iden- tify the HMM state transition. Therefore, a sequence of transition-ids can identify a phoneme, so we name such a sequence as phoneme identiﬁer in this paper . Note that the transition-id is also mapped to pdf-id. Actually , during the procedure of Kaldi decoding, the phoneme identiﬁers can be obtained. By referring to the pre-obtained mapping be- tween transition-id and pdf-id, any phoneme identiﬁer can also be e xpressed as a speciﬁc sequence of pdf-ids. Such a speciﬁc sequence of pdf-ids actually is a segment from the posterior probability matrix computed from DNN. This implies that to make Kaldi decode any speciﬁc phoneme, we need to have DNN compute a posterior probability matrix containing the corresponding sequence of pdf-ids. 4 There are three decoding models on Kaldi platform currently . AS- pIRE Chain Model we used in this paper was released on October 15th, 2016, while SRE16 Xvector Model was released on October 4th, 2017, which w as not a vailable when we began our study . The CVTE Mandarin Model, released on June 21st 2017 was trained in Chinese [13]. T o illustrate the above ﬁndings, we use Kaldi to process a piece of audio with several known w ords, and obtain the intermediate results, including the posterior probability matrix computed by DNN, the transition-ids sequence, the phonemes, and the decoded words. Figure 2 demon- strates the decoded result of Echo , which contains three phonemes. The red boxes highlight the id representing the corresponding phoneme, and each phoneme is identiﬁed by a sequence of transition-ids, or the phoneme identiﬁers . T able 1 is a se gment from the the relationship among the phoneme, pdf-id, transition-id, etc. By referring to T a- ble 1, we can obtain the pdf-ids sequence corresponding to the decoded transition-ids sequence 5 . Hence, for any posterior probability matrix demonstrating such a pdf-ids sequence should be decoded by Kaldi as eh B . 4.2 Gradient Descent to Craft A udio Figure 3 demonstrates the details of our attack approach. Giv en the original song x ( t ) and the pure v oice audio of the desired command y ( t ) , we use Kaldi to decode them separately . By analyzing the decoding procedures, we can get the output of DNN matrix A of the original song (Step 1  in Figure 3) and the phoneme identiﬁers of the desired command audio (Step 4  in Figure 3). The DNN’ s output A is a matrix containing the prob- ability of each pdf-id at each frame. Suppose there are n frames and k pdf-ids, let a i , j ( 1 ≤ i ≤ n , 1 ≤ j ≤ k ) be the element at the i th row and j th column in A . Then a i , j represents the probability of the jt h pdf-id at frame i . For each frame, we calculate the most likely pdf-id as the one with the highest probability in that frame. That is, m i = arg max j a i , j . Let m = ( m 1 , m 2 , . . . , m n ) . m represents a sequence of most likely pdf-ids of the original song audio x ( t ) . For simpliﬁcation, we use g to represent the function that takes the original audio as input and outputs a sequence of most likely pdf-ids based on DNN’ s predictions. That is, g ( x ( t )) = m . As shown in Step 5  in Figure 3, we can extract a sequence of pdf-id of the command b = ( b 1 , b 2 , . . . , b n ) , where b i ( 1 ≤ i ≤ n ) represents the highest probability pdf- id of the command at frame i . T o hav e the original song decoded as the desired command, we need to identify the minimum modiﬁcation δ ( t ) on x ( t ) so that m is same or close to b . Speciﬁcally , we minimize the L 1 distance between m and b . As m and b are related with the pdf- id sequence, we deﬁne this method as pdf-id sequence matching algorithm. 5 For instance, the pdf-ids sequence for eh B should be 6383, 5760, 5760, 5760, 5760, 5760, 5760, 5760, 5760, 5760 . Based on these observ ations we construct the follo wing objectiv e function: arg min δ ( t ) k g ( x ( t ) + δ ( t )) − b k 1 . (1) T o ensure that the modiﬁed audio does not deviate too much from the original one, we optimize the objective function Eq (1) under the constraint of | δ ( t ) | ≤ l . Finally , we use gradient descent [ 43 ], an iterativ e opti- mization algorithm to ﬁnd the l ocal minimum of a func- tion, to solve the objecti ve function. Giv en an initial point, gradient descent follo ws the direction which reduces the value of the function most quickly . By repeating this pro- cess until the v alue starts to remain stable, the algorithm is able to ﬁnd a local minimum v alue. In particular , based on our objectiv e function, we revise the song x ( t ) into x 0 ( t ) = x ( t ) + δ ( t ) with the aim of making most likely pdf-ids g ( x 0 ( t )) equal or close to b . Therefore, the crafted audio x 0 ( t ) can be decoded as the desired command. T o further preserve the ﬁdelity of the original song, one method is to minimize the time duration of the revision. T ypically , once the pure command voice audio is gen- erated by a te xt-to-speech engine, all the phonemes are determined, so as to the phoneme identiﬁers and b . How- ev er, the speed of the speech also determines the number of frames and the number of transition-ids in a phoneme identiﬁer . Intuitiv ely , slow speech always produces re- peated frames or transition-ids in a phoneme. T ypically people need six or more frames to realize a phoneme, but most speech recognition systems only need three to four frames to interpret a phoneme. Hence, to introduce the minimal re vision to the original song, we can analyze b , reduce the number of repeated frames in each phoneme, and obtain a shorter b 0 = ( b 1 , b 2 , . . . , b q ) , where q < n . 4.3 Practical Attack over the Air By feeding the generated adversarial sample directly into Kaldi, the desired command can be decoded correctly . Howe ver , playing the sample through a speaker to physi- cally attack an IVC device typically cannot w ork. This is mainly due to the noises introduced by the speaker and en- vironment, as well as the distortion caused by the receiver of the IVC de vice. In this paper , we do not consider the in variance of background noise in dif ferent en vironments, e.g., grocery , restaurant, ofﬁce, etc., due to the follo wing reasons: (1) In a quite noisy en vironment like restaurant or grocery , ev en the original voice command y ( t ) may not be correctly recognized by IVC de vices; (2) Model- ing any slightly v ariant background noise itself is still an open research problem; (3) Based on our observ ation, in a normal en vironment like home, of ﬁce, lobby , the major impacts on the physical attack are the electronic noise from the speaker and the distortion from the recei ver of the IVC devices, rather than the background noise. Figure 3: Steps of attack. Hence, our idea is to build a noise model, considering the speaker noise, the receiver distortion, as well as the generic background noise, and integrate it in the approach in Section 4.2. Speciﬁcally , we carefully picked up several songs and played them through our speaker in a very quiet room. By comparing the recorded audio (captured by our receiv er) with the original one, we can capture the noises. Note that playing “silent” audio does not w ork since the electronic noise from speakers may depend on the sound at dif ferent frequencies. Therefore, we intend to choose the songs that cover more frequencies. Regarding the comparison between two pieces of audio, we have to ﬁrst manually align them and then compute the difference. W e redesign the objecti ve function as shown in Eq (2). arg min µ ( t ) k g ( x ( t ) + µ ( t ) + n ( t )) − b k 1 , (2) where µ ( t ) is the perturbation that we add to the original song, and n ( t ) is the noise samples that we captured. In this way , we can get the adversarial audio x 0 ( t ) = x ( t ) + µ ( t ) that can be used to launch the practical attack ov er the air . Such noise model above is quite device-dependent. Since dif ferent speakers and recei vers may introduce dif- ferent noises/distortion when playing or recei ving speciﬁc audio, x 0 ( t ) may only work with the devices that we use to capture the noise. T o enhance the robustness of x 0 ( t ) , we introduce random noise , which is shown in Eq (3) . Here, the function rand() returns an vector of random numbers in the interval (-N,N), which is sa ved as a “W A V” format ﬁle to represent n ( t ) . Our ev aluation results show that this approach can make the adversarial audio x 0 ( t ) robust enough for different speak ers and receiv ers. n ( t ) = rand ( t ) , | n ( t ) | < = N . (3) 5 Evaluation In this section, we present the experimental results of CommanderSong. W e ev aluated both the WT A and W AA attacks against machine recognition. T o ev al- uate the human comprehension, we conducted a sur- ve y examining the effects of “hiding” the desired com- mand in the song. Then, we tested the transferability of the adversarial sample on other ASR platforms, and checked whether CommanderSong can spread through Internet and radio. Finally , we measured the efﬁ - ciency in terms of the time to generate the Comman- derSong. Demos of attacks are uploaded on the website (https://sites.google.com/view/commandersong/). 5.1 Experiment Setup The pure v oice audio of the desired commands can be gen- erated by any T e xt-T o-Speech (TTS) engine (e.g., Google text-to-speech [ 7 ], etc.) or recording human voice, as long as it can be correctly recognized by Kaldi platform. W e also randomly downloaded 26 songs from the Internet. T o understand the impact of using different types of songs as the carrier , we intended to choose songs from different categories, i.e., popular, rock, rap, and soft music. Re- garding the commands to inject, we chose 12 commonly used ones such as “turn on GPS”, “ask Capital One to make a credit card payment”, etc., as shown in T able 2. Regarding the computing en vironment, one GPU server (1075MHz GPU with 12GB memory , and 512GB hard driv e) was used. T able 2: WT A attack results. Command Success rate ( % ) SNR ( d B ) Efﬁciency (frames/hours) Okay google restart phone now . 100 18.6 229/1.3 Okay google ﬂashlight on. 100 14.7 219/1.3 Okay google read mail. 100 15.5 217/1.5 Okay google clear notiﬁcation. 100 14 260/1.2 Okay google good night. 100 15.6 193/1.3 Okay google airplane mode on. 100 16.9 219/1.1 Okay google turn on wireless hot spot. 100 14.7 280/1.6 Okay google read last sms from boss. 100 15.1 323/1.4 Echo open the front door . 100 17.2 193/1.0 Echo turn off the light. 100 17.3 347/1.5 Okay google call one one zero one one nine one two zero. 100 14.8 387/1.7 Echo ask capital one to make a credit card payment. 100 15.8 379/1.9 5.2 Effectiveness WT A Attack. In this WT A attack, we directly feed the generated adversarial songs to Kaldi using its exposed APIs, which accept raw audio ﬁle as input. Particularly , we injected each command into each of the do wnloaded 26 songs using the approach proposed in Section 4.2. T o- tally we got more than 200 adversarial songs in the “W A V” format and sent them to Kaldi directly for recognition. If Kaldi successfully identiﬁed the command injected inside, we denote the attack as successful. T able 2 sho ws the WT A attack results. Each command can be recognized by Kaldi correctly . The success rate 100% means Kaldi can decode ev ery word in the desired command correctly . The success rate is calculated as the ratio of the number of words successfully decoded and the number of words in the desired command. Note in the case that the decoded word is only one character dif ferent than that in the desired command, we consider the word is not correctly recognized. For each adversarial song, we further calculated the av erage signal-noise ratio (SNR) against the original song as shown in T able 2. SNR is a parameter widely used to quantify the level of a signal power to noise, so we use it here to measure the distortion of the adversarial sample ov er the original song. W e then use the following equation SN R ( d B ) = 10 l og 10 ( P x ( t ) / P δ ( t ) ) to obtain SNR, where the original song x ( t ) is the signal while the per- turbation δ ( t ) is the noise. Larger SNR value indicates a smaller perturbation. Based on the results in T able 2, the SNR ranges from 14 ∼ 18.6 d B , indicating that the pertur- bation in the original song is less than 4%. Therefore, the perturbation should be too slight to be noticed. W AA Attack. T o practically attack Kaldi ov er the air , the ideal case is to ﬁnd a commercial IVC de vice imple- mented based on Kaldi and play our adversarial samples against the device. Howe ver , we are not aware of any such IVC device, so we simulate a pseudo IVC device based on Kaldi. In particular , the adversarial samples are played by speakers over the air . W e use the recording functionality of iPhone 6S to record the audio, which is sent to Kaldi API to decode. Overall, such a pseudo IVC device is built using the microphone in iPhone 6S as the audio recorder , and Kaldi system to decode the audio. W e conducted the practical W AA attack in a meeting room (16 meter long, 8 meter wide, and 4 meter tall). The songs were played using three dif ferent speakers in- cluding a JBL clip2 portable speaker , an ASUS laptop and a SENMA TE broadcast equipment [ 16 ], to examine the effecti veness of the injected random noise. All of the speakers are easy to obtain and carry . The distance between the speaker and the pseudo IVC device (i.e., the microphone of the iPhone 6S) was set at 1.5 meters. W e chose two commands as in T able 3, and generated adver- sarial samples. Then we played them over the air using those three dif ferent speakers and used the iPhone 6S to record the audios, which were sent to Kaldi to decode. T able 3 shows the W AA attack results. For both of the two commands, JBL speak er overwhelms the other two with the success rate up to 96% , which might indicate its sound quality is better than the other two. All the SNRs are below 2 d B , which indicates slightly bigger perturba- tion to the original songs due to the random noise from the signal’ s point of vie w . Below we will ev aluate if such “bigger” perturbation is human-noticeable by conducting a surve y . Human comprehension from the survey . T o ev aluate the ef fectiv eness of hiding the desired command in the song, we conducted a survey on Amazon Mechanical T urk T able 3: W AA attack results. Command Speaker Success rate ( % ) SNR ( d B ) Efﬁciency (frames/hours) Echo ask capital one JBL speaker 90 1.7 to make a credit card ASUS Laptop 82 1.7 379/2.0 card payment. SENMA TE Broadcast 72 1.7 Okay google call one JBL speaker 96 1.3 one zero one one nine ASUS Laptop 60 1.3 400/1.8 one two zero. SENMA TE Broadcast 70 1.3 (MT urk) [ 2 ], an online marketplace for crowdsourcing intelligence. W e recruited 204 indi viduals to participate in our surve y 6 . Each participant was asked to listen to 26 adversarial samples, each lasting for about 20 seconds (only about four or ﬁv e seconds in the middle is crafted to contain the desired command.). A series of questions re- garding each audio need to be answered, e.g., (1) whether they hav e heard the original song before; (2) whether they heard an ything abnormal than a regular song (The four options are no , not sur e , noisy , and wor ds differ ent than lyrics ); (3) if choosing noisy option in (2), where they believ e the noise comes from, while if choosing words differ ent than lyrics option in (2), they are asked to write down those words, and ho w many times they listened to the song before they can recognize the words. T able 4: Human comprehension of the WT A samples. Music Classiﬁcation Listened ( % ) Abnormal ( % ) Recognize Command ( % ) Soft Music 13 15 0 Rock 33 28 0 Popular 32 26 0 Rap 41 23 0 The entire surve y lasts for about ﬁve to six minutes. Each participant is compensated $0.3 for successfully completing the study , provided they pass the attention check question to moti vate the participants concentrate on the study . Based on our study , 63.7% of the participants are in the age of 20 ∼ 40 and 33.3% are 40 ∼ 60 years old, and 70.6% of them use IVC devices (e.g., Amazon Echo, Google home, Smartphone, etc.) everyday . T able 4 sho ws the results of the human comprehension of our WT A samples. W e show the av erage results for songs belonging to the same category . The detailed re- sults for each indi vidual song can be referred to in T able 7 in Appendix. Generally , the songs in soft music cate- gory are the best candidates for the carrier of the desired command, with as low as 15% of participants noticed the 6 The survey will not cause any potential risks to the participants (physical, psychological, social, legal, etc.). The questions in our survey do not in volv e any conﬁdential information about the participants. W e obtained the IRB Exempt certiﬁcates from our institutes. abnormality . None of the participants could recognize any word of the desired command injected in the adv ersarial samples of any category . T able 5 demonstrates the results of the human comprehension of our W AA samples. On av erage, 40% of the participants believed the noise was generated by the speaker or like radio, while only 2.2% of them thought the noise from the samples themselves. In addition, less than 1% believ ed that there were other words e xcept the original lyrics. Howe ver , none of them successfully identiﬁed any word e ven repeating the songs sev eral times. 5.3 T owards the T ransferability Finally , we assess whether the proposed CommanderSong can be transfered to other ASR platforms. T ransfer from Kaldi to iFL YTEK. W e choose iFL Y - TEK ASR system as the tar get of our transfer , due to its popularity . As one of the top ﬁve ASR systems in the world, it possesses 70% of the market in China. Some applications supported by iFL YTEK and their downloads on Google Play as well as the number of worldwide users are listed in T able 8 in Appendix. In particular , iFLY - TEK Input is a popular mobile voice input method, which supports mandarin, English and personalized input [ 12 ]. iFLYREC is an online service offered by iFL YTEK to con vert audio to text [ 10 ]. W e use them to test the trans- ferability of our W AA attack samples, and the success rates of dif ferent commands are sho wn in T able 6. Note T able 5: Human comprehension of the W AA samples. Song Name Listened ( % ) Abnormal ( % ) Noise- speaker ( % ) Noise- song ( % ) Did Y ou Need It 15 67 42 1 Outlaw of Love 11 63 36 2 The Saltwater Room 27 67 39 3 Sleepwalker 13 67 41 0 Underneath 13 68 45 3 Feeling Good 38 59 36 4 A verage 19.5 65.2 40 2.2 T able 6: T ransferability from Kaldi to iFL YTEK. Command iFL YREC ( % ) iFL YTEK Input ( % ) Airplane mode on. 66 0 Open the door . 100 100 Good night. 100 100 that W AA audio samples are directly fed to iFL YREC to decode. Meanwhile, they are played using Bose Com- panion 2 speaker towards iFLYTEK Input running on smartphone LG V20, or using JBL speaker tow ards iFLY - TEK Input running on smartphone Huawei honor 8/MI note3/iPhone 6S. Those adv ersarial samples containing commands like open the door or good night can achieve great transferability on both platforms. Howe ver , the com- mand airplane mode on only gets 66% success rate on iFLYREC , and 0 on iFL YTEK Input . T ransferability from Kaldi to DeepSpeech. W e also try to transfer CommanderSong from Kaldi to DeepSpeech, which is an open source end-to-end ASR system. W e directly fed several adversarial WT A and W AA attack samples to DeepSpeech, but none of them can be decoded correctly . As Carlini et al. have successfully modiﬁed any audio into a command recognizable by DeepSpeech [ 23 ], we intend to le verage their open source algorithm to e x- amine if it is possible to generate one adversarial sample against both tw o platforms. In this experiment, we started by 10 adv ersarial samples generated by CommanderSong, either WT A or W AA attack, integrating commands like Okay google call one one zer o one one nine one two zer o , Echo open the fr ont door , and Echo turn off the light . W e applied their algorithm to modify the samples until Deep- Speech can decode the target commands correctly . Then we tested such newly generated samples against Kaldi as WT A attack, and Kaldi can still successfully recognize them. W e did not perform W AA attack since their algo- rithm targeting DeepSpeech cannot achie ve attacks ov er the air . The preliminary e valuations on transferability giv e us the opportunities to understand CommanderSongs and for designing systematic approach to transfer in the future. 5.4 A utomated Spreading Since our W AA attack samples can be used to launch the practical adversarial attack against ASR systems, we w ant to explore the potential channels that can be le veraged to impact a large amount of victims automatically . Online sharing. W e consider the online sharing plat- forms like Y ouT ube to spread CommanderSong. W e picked up one ﬁ ve-second adversarial sample embedded with the command “ open the door ” and applied W indows Movie Mak er software to make a video, since Y ouT ube only supports video uploading. The sample was repeated four times to make the full video around 20 seconds. W e then connected our desktop audio output to Bose Com- panion 2 speaker and installed iFLYTEK Input on LG V20 smartphone. In this e xperiment, the distance between the speaker and the phone can be up to 0.5 meter , and iFLY - TEK Input can still decode the command successfully . Radio br oadcasting. In this e xperiment, we used HackRF One [ 8 ], a hardware that supports Software De- ﬁned Radio (SDR) to broadcast our CommanderSong at the frequency of FM 103.4 MHz, simulating a radio sta- tion. W e setup a radio at the corresponding frequency , so it can recei ve and play the CommanderSong. W e ran the W eChat 7 application and enabled the iFLYTEK Input on dif ferent smartphones including iPhone 6S, Huawei Honor 8 and XiaoMi MI Note3. iFLYTEK Input can always successfully recognize the command “ open the door ” from the audio played by the radio and display it on the screen. 5.5 Efﬁciency W e also e valuate the cost of generating CommanderSong in the aspect of the required time. For each command, we record the time to inject it into different songs and compute the a verage. Since the time required to craft also depends on the length of the desired command, we deﬁne the ef ﬁciency as the ratio of the number of frames of the desired command and the required time. T able 2 and T a- ble 3 show the ef ﬁciency of generating WT A and W AA samples for different commands. Most of those adversar - ial samples can be generated in less than tw o hours, and some simple commands like “ Ec ho open the fr ont door ” can be done within half an hour . Howe ver , we do notice that some special words (such as GPS and airplane ) in the command make the generation time longer . Probably those words are not commonly used in the training process of the “ ASpIRE model” of Kaldi, so generating enough phonemes to represent the words is time-consuming. Fur- thermore, we ﬁnd that, for some songs in the rock cate- gory such as “ Bang bang ” and “ Roaked ”, it usually takes longer to generate the adversarial samples for the same command compared with the songs in other categories, probably due to the unstable rhythm of them. 6 Understanding the Attacks W e try to deeply understand the attacks, which could po- tentially help to deri ve defense approaches. W e raise some 7 W eChat is the most popular instant messaging application in China, with approximately 963,000,000 users all over the world by June 2017 [15]. Figure 4: SNR impacts on correlation of the audios and the success rate of adversarial audios. questions and perform further analysis on the attacks. In what ways does the song help the attack? W e use songs as the carrier of commands to attack ASR sys- tems. Obviously , one beneﬁt of using a song is to pre vent listeners from being aw are of the attack. Also Comman- derSong can be easily spread through Y outube, radio, TV , etc. Does the song itself help generate the adversarial audio samples? T o answer this question, we use a piece of silent audio as the “carrier” to generate Commander- Song A cs (W AA attack), and test the effecti veness of it. The results show that A cs can work, which is aligned with our ﬁndings – a random song can serv e as the “carrier” because a piece of silent audio can be viewed as a special song. Ho wev er , after listening to A cs , we ﬁnd that A cs sounds quite similar to the injected command, which means an y user can easily notice it, so A cs is not the adversarial sam- ples we desire. Note that, in our human subject study , none of the participants recognized any command from the generated CommanderSongs. W e assume that some phonemes or even smaller units in the original song work together with the injected small perturbations to form the tar get command . T o verify this assumption, we prepare a song A s and use it to generate the CommanderSong A cs . Then we calculate the dif ference ∆ ( A s , A cs ) between them, and try to attack ASR systems using ∆ ( A s , A cs ) . Howe ver , after sev eral times of testing, we ﬁnd that ∆ ( A s , A cs ) does not work, which indicates the pure perturbations we in- jected cannot be recognized as the target commands. Recall that in T able 5, the songs in the soft music category are proven to be the best carrier, with lowest abnormality identiﬁed by participants. Based on the ﬁnd- ings abov e, it appears that such songs can better aligned with the phonemes or smaller “units” in the target com- mand to help the attack. This is also the reason why ∆ ( A s , A cs ) cannot directly attack successfully: the “units” Figure 5: Explaination of Kaldi and human recognition of the audios. in the song combined with ∆ ( A s , A cs ) together construct the phonemes of the target command. What is the impact of noise in generating adversar - ial samples? As mentioned early , we build a generic random noise model to perform the W AA attack over the air . In order to understand the impact of the noise in generating adversarial samples, we crafted Comman- derSong using noises with dif ferent amplitude values. Then we observed the dif ferences between the Comman- derSong and the original song, the differences between the CommanderSong and the pure command audio, and the success rates of the CommanderSong to attack. T o characterize the difference, we le verage Spearman’ s rank correlation coefﬁcient [ 46 ] ( Spearman’ s rho for short) to represent the similarity between two pieces of audio. Spearman’ s rho is widely used to represent the corre- lation between two variables, and can be calculated as follows: r ( X , Y ) = Cov ( X , Y ) / p Var [ X ] V ar [ Y ] , where X and Y are the MFCC features of the two pieces of audio. Cov ( X , Y ) represents the cov ariance of X and Y . Var [ X ] and Var [ Y ] are the variances of X and Y respecti vely . The results are sho wn in Figure 4. The x-axis in the ﬁgure shows the SN R (in d B ) of the noise, and the y-axis giv es the correlation. From the ﬁgure, we ﬁnd that the correlation between the CommanderSong and the original song (red line) decreases with SN R . It means that the CommanderSong sounds less like the original song when the amplitude value of the noise becomes larger . This is mainly because the original song has to be modiﬁed more to ﬁnd a CommanderSong robust enough against the introduced noise. On the contrary , the CommanderSong becomes more similar with the target command audio when the amplitude values of the noise increases (i.e., decrease of SN R in the ﬁgure, blue line), which means that the CommanderSong sounds more like the target command. The success rate (black dotted line) also in- creases with the decrease of SNR. W e also note that, when Figure 6: Audio turbulence defense. SN R = 4 d B , the success rate could be as high as 88%. Also the correlation between CommanderSong and the original song is 90%, which indicates a high similarity . Figure 5 shows the results from another perspectiv e. Suppose the dark blue circle is the set of audios that can be recognized as commands by ASR systems, while the light blue circle and the red one represent the sets of audio recognized as commands and songs by human respectiv ely . At ﬁrst, the original song is in the red circle, which means that neither ASR systems nor human being recognize any command inside. WT A attack slightly modiﬁes the song so that the open source system Kaldi can recognize the command while human cannot. After noises are introduced to generate CommanderSong for W AA attacks, CommanderSong will fall into the light blue area step by step, and in the end be recognized by human. Therefore, attackers can choose the amplitude values of noise to balance between robustness to noise and identiﬁability by human users. 7 Defense W e propose tw o approaches to defend against Comman- derSong: Audio turbulence and Audio squeezing. The ﬁrst defense is effecti ve against WT A, but not W AA; while the second defense works against both attacks. A udio turbulence. From the ev aluation, we observe that noise (e.g., from speaker or background) decreases the success rate of CommanderSong while impacts little on the recognition of audio command. So our basic idea is to add noise (referred to as turbulence noise A n ) to the input audio A I before it is received by the ASR sys- tem, and check whether the resultant audio A I +  A n can be interpreted as other words. Particularly , as sho wn in Figure 6, A I is decoded as text 1 by the ASR system. Then we add A n to A I and let the ASR system extract the text text 2 from A I +  A n . If text 1 6 = text 2 , we say that the CommanderSong is detected. W e did e xperiments using this approach to test the ef- fectiv eness of such defense. The target command “open the door” was used to generate a CommanderSong. Fig- ure 7 sho ws the result. The x-axis shows the SN R ( A I to A n ), and the y-axis shows the success rate. W e found that the success rate of WT A dramatically drops when SN R Figure 7: The results of audio turbulence defense. decreases. When SN R = 15 d B , WT A almost always fails and A I can still be successfully recognized, which means this approach works for WT A. Howe ver , the success rate of W AA is still very high. This is mainly because Com- manderSongs for W AA is generated using random noises, which is robust against turb ulence noise. A udio squeezing. The second defense is to reduce the sampling rate of the input audio A I (just like squeezing the audio). Instead of adding A n in the defense of audio tur - bulence, we do wnsample A I (referred to as D ( A I ) ). Still, ASR systems decode A I and D ( A I ) , and get text 1 and text 2 respecti vely . If text 1 6 = text 2 , the Commander- Song is detected. Similar to the previous experiment, we e valuate the effecti veness of this approach. The results are shown in Figure 8. The x-axis shows the ratio ( 1 / M ) of downsampling (M is the do wnsampling f actor or decima- tion factor , which means that the original sampling rate is M times of the do wnsampled rate). When 1 / M = 0 . 7 (if the sample rate is 8000 samples/second, the do wnsampled rate is 5600 samples/second), the success rates of WT A and W AA are 0% and 8% respectiv ely . A I can still be successful recognized at the rate of 91%. This means that Audio squeezing is effecti ve to defend against both WT A and W AA. 8 Related W ork Attack on ASR system. Prior to our work, many re- searchers have dev oted to security issues about speech controllable systems [ 36 , 35 , 26 , 41 , 51 , 22 , 53 , 23 ]. De- nis et al. found the vulnerability of analog sensor and injected bogus voice signal to attack the microphone [ 36 ]. Kasmi et al. stated that, by lev eraging intentional electro- magnetic interference on headset cables, voice command could be injected and carried by FM signals which is further received and interpreted by smart phones [ 35 ]. Figure 8: Audio squeezing defense result. Diao et al. demonstrated that, through permission by- passing attack in Android smart phones, voice commands could be played using apps with zero permissions [ 26 ]. Mukhopadhyay et al. considered voice impersonation attacks to contaminate a voice-based user authentication system [ 41 ]. They reconstructed the victims voice model from the victims voice data, and launched attacks that can bypass v oice authentication systems. Dif ferent from these attacks, we are attacking the machine learning models of ASR systems. Hidden voice command [ 22 ] launched both black box (i.e., in verse MFCC) and white box (i.e., gradient decent) attacks against ASR systems with GMM-based acous- tic models. Different from this work, our target is a DNN-based ASR system. Recently , the authors posted the achiev ement that can construct targeted audio adversar- ial examples on DeepSpeech, an end-to-end open source ASR platform [ 23 ]. T o perform the attack, the adver - sary needs to directly upload the adversarial W A V ﬁle to the speech recognition system. Our attacks on Kaldi are concurrent to their work, and our attack approaches are in- dependent to theirs. Moreover , our attacks succeed under a more practical setting that let the adversarial audio be played ov er the air . The recent work DolphinAttack [ 53 ] proposed a completely inaudible v oice attack by modu- lating commands on ultrasound carriers and leveraging microphone vulnerabilities to attack. As noted by the authors, such attack can be eliminated by ﬁltering out ultrasound carrier (e.g., iPhone 6 Plus). Dif ferently , our attack uses songs instead of ultrasound as the carriers, making the attack harder to defend. Adversarial research on machine learning. Besides attacking speech recognition systems, there has been sub- stantial work on adv ersarial machine learning e xamples tow ards physical world. Kurakin et al. [ 37 ] proved it is doable that Inception v3 image classiﬁcation neural network could be compromised by adversarial images. Bro wn et al. [ 21 ] sho wed by adding an uni versal patch to an image they could fool the image classiﬁers successfully . Evtimov et al. [ 27 ] proposed a general algorithm which can produce robust adv ersarial perturbations into images to ov ercome physical condition in real world. They suc- cessfully fooled road sign classiﬁers to mis-classify real Stop Sign. Different from them, our study targets speech recognition system. Defense of Adversarial on machine lear ning. Defend- ing against adversarial attacks is kno wn to be a challeng- ing problem. Existing defenses include adversarial train- ing and defensi ve distillation. Adversarial training [ 39 ] adds the adv ersarial examples into the model’ s training set to increase its robustness against these examples. Defen- si ve distillation [ 33 ] trains the model with probabilities of different class labels supported by an early model trained on the same task. Both defenses perform a kind of gra- dient masking [ 45 ] which increases the difﬁculties for the adversary to compute the gradient direction. In [ 29 ], Dawn Song attempted to combine multiple defenses in- cluding feature squeezing and the specialist to construct a larger strong defense. They stated that defenses should be e valuated by strong attacks and adaptiv e adversarial examples. Most of these defenses are ef fectiv e for white box attacks b ut not for black box ones. Binary classiﬁ- cation is another simple and ef fecti ve defense for white box attacks without an y modiﬁcations of the underlying systems. A binary classiﬁer is b uilt to sepa rate adv ersarial examples apart from the clean data. Similar as adversarial training and defensive distillation, this defense suffers from generalization limitation. In this paper, we propose two nov el defenses against CommanderSong attack. 9 Conclusion In this paper , we perform practical adversarial attacks on ASR systems by injecting “voice” commands into songs (CommanderSong). T o the best of our kno wledge, this is the ﬁrst systematical approach to generate such practical attacks against DNN-based ASR system. Such CommanderSong could let ASR systems ex ecute the com- mand while being played o ver the air without notice by users. Our ev aluation sho ws that CommanderSong can be transferred to iFL YTEK, impacting popular apps such as W eChat, Sina W eibo, and JD with billions of users. W e also demonstrated that CommanderSong can be spread through Y ouT ube and radio. T wo approaches (audio turbu- lence and audio squeezing) are proposed to defend against CommanderSong. Acknowledgments IIE authors are supported in part by National Ke y R&D Program of China (No.2016QY04W0805), NSFC U1536106, 61728209, National T op-notch Y outh T alents Program of China, Y outh Innov ation Promotion Associa- tion CAS and Beijing Nov a Program. Indiana University author is supported in part by the NSF 1408874, 1527141, 1618493 and AR O W911NF1610127. Illinois Univer - sity authors are supported in part by NSF CNS grants 13-30491, 14-08944, and 15-13939. References [1] Amazon Alexa . https://de veloper .amazon.com/alexa. [2] Amazon Mechanical T urk . https://www .mturk.com. [3] Apple Siri . https://www .apple.com/ios/siri. [4] Aspir e . https://github .com/kaldi-asr/kaldi/tree/master/egs/aspire. [5] CMUSphinx . https://cmusphinx.github .io/. [6] Google Assistant . https://assistant.google.com. [7] Google T e xt-to-speech . https://play .google.com/store/apps. [8] HackRF One . https://greatscottgadgets.com/hackrf/. [9] HTK . http://htk.eng.cam.ac.uk/. [10] iFLYREC . https://www .iﬂyrec.com/. [11] iFLYTEK . http://www .iﬂytek.com/en/index.html. [12] iFLYTEK Input . http://www .iﬂytek.com/en/mobile/iﬂyime.html. [13] Kaldi . http://kaldi-asr .org. [14] Micr osoft Cortana . https://www .microsoft.com/en-us/cortana. [15] Number of monthly active W eChat user s fr om 2nd quarter 2010 to 2nd quarter 2017 (in millions) . https://www .statista.com/statistics/255778/number-of-activ e- wechat-messenger-accounts/. [16] SENMA TE br oadcast . http://www .114pifa.com/p106/34376.html. [17] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper , Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International Confer ence on Machine Learning , pages 173–182, 2016. [18] Kartik Audhkhasi, Brian Kingsbury , Bhuvana Ramabhadran, George Saon, and Michael Picheny . Building competitiv e di- rect acoustics-to-word models for english con versational speech recognition. IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2018. [19] W ei Bao, Hong Li, Nan Li, and W ei Jiang. A liv eness detection method for face recognition based on optical ﬂow ﬁeld. In Ima ge Analysis and Signal Pr ocessing, 2009. IASP 2009. International Confer ence on , pages 233–236. IEEE, 2009. [20] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim ˇ Srndi ´ c, Pav el Laskov , Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint Eur opean Confer ence on Machine Learning and Knowledge Dis- covery in Databases , pages 387–402. Springer , 2013. [21] T om B Brown, Dandelion Man ´ e, Aurko Roy , Mart ´ ın Abadi, and Justin Gilmer. Adversarial patch. 31st Confer ence on Neural Information Pr ocessing Systems (NIPS) , 2017. [22] Nicholas Carlini, Pratyush Mishra, T a vish V aidya, Y uankai Zhang, Micah Sherr, Clay Shields, David W agner, and W enchao Zhou. Hidden voice commands. In USENIX Security Symposium , pages 513–530, 2016. [23] Nicholas Carlini and David W agner . Audio adversarial examples: T argeted attacks on speech-to-te xt. Deep Learning and Security W orkshop , 2018. [24] Ronan Collobert, Christian Puhrsch, and Gabriel Synnaev e. W av2letter: an end-to-end convnet-based speech recognition sys- tem. arXiv pr eprint arXiv:1609.03193 , 2016. [25] Nilesh Dalvi, Pedro Domingos, Sumit Sanghai, Deepak V erma, et al. Adversarial classiﬁcation. In Pr oceedings of the tenth ACM SIGKDD international confer ence on Knowledge discovery and data mining , pages 99–108. A CM, 2004. [26] W enrui Diao, Xiangyu Liu, Zhe Zhou, and K ehuan Zhang. Y our voice assistant is mine: How to ab use speak ers to steal information and control your phone. In Proceedings of the 4th ACM W orkshop on Security and Privacy in Smartphones & Mobile Devices , pages 63–74. A CM, 2014. [27] Ivan Evtimo v , Kevin Eykholt, Earlence Fernandes, T adayoshi K ohno, Bo Li, Atul Prakash, Amir Rahmati, and Dawn Song. Robust physical-world attacks on deep learning models. Computer V ision and P attern Recognition , 2018. [28] Ian J Goodfellow , Jonathon Shlens, and Christian Szegedy . Ex- plaining and harnessing adversarial examples. International Con- fer ence on Learning Representations , 2015. [29] W arren He, James W ei, Xinyun Chen, Nicholas Carlini, and Dawn Song. Adversarial e xample defenses: Ensembles of weak defenses are not strong. USENIX W orkshop on Of fensive T ec hnologies , 2017. [30] Hynek Hermansky . Perceptual linear predictiv e (plp) analysis of speech. the Journal of the Acoustical Society of America , 87(4):1738–1752, 1990. [31] Geoffre y Hinton, Li Deng, Dong Y u, George E Dahl, Abdel- rahman Mohamed, Navdeep Jaitly , Andrew Senior , V incent V an- houcke, Patrick Nguyen, T ara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Pr ocessing Magazine , 29(6):82–97, 2012. [32] Geoffre y Hinton, Li Deng, Dong Y u, George E Dahl, Abdel- rahman Mohamed, Navdeep Jaitly , Andrew Senior , V incent V an- houcke, Patrick Nguyen, T ara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Pr ocessing Magazine , 29(6):82–97, 2012. [33] Geoffre y Hinton, Oriol Vin yals, and Jef f Dean. Distilling the knowledge in a neural network. NIPS Deep Learning W orkshop , 2014. [34] Fumitada Itakura. Line spectrum representation of linear predictor coefﬁcients of speech signals. The Journal of the Acoustical Society of America , 57(S1):S35–S35, 1975. [35] Chaouki Kasmi and Jose Lopes Esteves. Iemi threats for informa- tion security: Remote command injection on modern smartphones. IEEE T ransactions on Electr omagnetic Compatibility , 57(6):1752– 1755, 2015. [36] Denis Foo Kune, John Backes, Shane S Clark, Daniel Kramer, Matthew Reynolds, Ke vin Fu, Y ongdae Kim, and W enyuan Xu. Ghost talk: Mitigating emi signal injection attacks against analog sensors. In Security and Privacy (SP), 2013 IEEE Symposium on , pages 145–159. IEEE, 2013. [37] Alexe y Kurakin, Ian Goodfello w , and Samy Bengio. Adversarial examples in the physical world. arXiv pr eprint arXiv:1607.02533 , 2016. [38] Pan Li, Qiang Liu, W entao Zhao, Dongxu W ang, and Siqi W ang. BEBP: an poisoning method against machine learning based idss. arXiv pr eprint arXiv:1803.03965 , 2018. [39] Aleksander Madry , Aleksandar Makelov , Ludwig Schmidt, Dim- itris Tsipras, and Adrian Vladu. T ow ards deep learning models resistant to adv ersarial attacks. International Conference on Learn- ing Repr esentations , 2018. [40] Lindasalwa Muda, Mumtaj Begam, and Irraiv an Elamvazuthi. V oice recognition algorithms using mel frequency cepstral coefﬁ- cient (mfcc) and dynamic time warping (dtw) techniques. Journal of Computing, V olume 2, Issue 3 , 2010. [41] Dibya Mukhopadhyay , Maliheh Shirvanian, and Nitesh Saxena. All your voices are belong to us: Stealing voices to fool humans and machines. In European Symposium on Resear ch in Computer Security , pages 599–621. Springer , 2015. [42] Douglas OShaughnessy . Automatic speech recognition: History , methods and challenges. P attern Recognition , 41(10):2965–2979, 2008. [43] Nicolas Papernot, Nicholas Carlini, Ian Goodfellow , Reuben Feinman, Fartash Faghri, Alexander Matyasko, Karen Ham- bardzumyan, Y i-Lin Juang, Alexey K urakin, Ryan Sheatsley , et al. clev erhans v2. 0.0: an adversarial machine learning library . arXiv pr eprint arXiv:1610.00768 , 2016. [44] Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow . T rans- ferability in machine learning: from phenomena to black-box at- tacks using adv ersarial samples. arXiv pr eprint arXiv:1605.07277 , 2016. [45] Nicolas P apernot, Patrick McDaniel, Ian Goodfello w , Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks ag ainst machine learning. In Pr oceedings of the 2017 A CM on Asia Confer ence on Computer and Communications Security , pages 506–519. A CM, 2017. [46] W Pirie. Spearman rank correlation coef ﬁcient. Encyclopedia of statistical sciences , 1988. [47] Kanishka Rao, Ha s ¸ im Sak, and Rohit Prabhav alkar . Exploring ar- chitectures, data and units for streaming end-to-end speech recog- nition with rnn-transducer . In Automatic Speech Recognition and Understanding W orkshop (ASR U), 2017 IEEE , pages 193–199. IEEE, 2017. [48] Stephanie A C Schuckers. Spooﬁng and anti-spooﬁng measures. Information Security technical r eport , 7(4):56–62, 2002. [49] Christian Sze gedy , W ojciech Zaremba, Ilya Sutskev er , Joan Bruna, Dumitru Erhan, Ian Goodfellow , and Rob Fergus. Intriguing properties of neural networks. arXiv preprint , 2013. [50] Roberto T ronci, Daniele Muntoni, Gianluca Fadda, Maurizio Pili, Nicola Sirena, Gabriele Murgia, Marco Ristori, Sardegna Ricerche, and Fabio Roli. Fusion of multiple clues for photo-attack detec- tion in face recognition systems. In Biometrics (IJCB), 2011 International Joint Confer ence on , pages 1–6. IEEE, 2011. [51] T avish V aidya, Y uankai Zhang, Micah Sherr , and Clay Shields. Cocaine noodles: exploiting the gap between human and machine speech recognition. WOO T , 15:10–11, 2015. [52] W ayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer , Andreas Stolcke, Dong Y u, and Geoffrey Zweig. The microsoft 2016 conversational speech recognition system. In Acoustics, Speech and Signal Pr ocessing (ICASSP), 2017 IEEE International Confer ence on , pages 5255–5259. IEEE, 2017. [53] Guoming Zhang, Chen Y an, Xiaoyu Ji, Tianchen Zhang, T aimin Zhang, and W enyuan Xu. Dolphinattack: Inaudible voice com- mands. In Pr oceedings of the 2017 A CM SIGSAC Confer ence on Computer and Communications Security , pages 103–117. A CM, 2017. A ppendix T able 7: The detailed results of indi vidual song in human comprehension survey for WT A samples. When we were checking the surve y results from MT urk, we found the average familiarity of MT urk workers to wards our songs is not as good as we expected. So streaming counts from Spotify are also listed in the table, as we w ant to sho w the popularity of our sample songs. The song Selling Brick in Str eet is not in Spotify database so we can not pro vide the count for it. Music Clas- siﬁcation Song Name Spotify Streaming Count Listened ( % ) Abnormal ( % ) Recognize Command ( % ) Heart and Soul 13,749,471 15% 8% 0 Castle in the Sky 2,332,348 9% 6% 0 Soft Music A Comme Amour 1,878,899 14% 18% 0 Mariage D’amour 337,486 17% 33% 0 Lotus 49,443,256 11% 12% 0 A verage 13,548,292 13% 15% 0 Bang Bang 532,057,658 52% 24% 0 Soaked 29,734 13% 32% 0 Rock Gold 11,614,629 14% 41% 0 W e are ne ver Getting back together 113,806,946 66% 38% 0 When can I See Y ou again 26,463,993 20% 9% 0 A verage 136,794,562 33% 28% 0 Lov e Story 109,952,344 49% 24% 0 Hello Seattle 9,850,328 29% 16% 0 Popular Good T ime 125,125,693 48% 32% 0 T o the Sk y 4,860,627 27% 30% 0 A Loaded Smile 658,814 8% 26% 0 A verage 50,089,561 32% 26% 0 Rap God 349,754,768 43% 32% 0 Let Me Hold Y ou 311,569,726 31% 15% 0 Rap Lose Y ourself 483,937,007 75% 14% 0 Remember the Name 193,564,886 48% 32% 0 Selling Brick in Street N/A 6% 24% 0 A verage 334,706,597 41% 23% 0 T able 8: The detailed information of some sample applications which utilize iFL YTEK as voice input, including number of downloads from Google Play and total user amount. Since Google Services are not accessible in China and information of Apple App Store is not collected, the number of users may not be associated with the number of downloads in Google Play . As shown in the table, each of these applications has o ver 0.2 billion users in the world. Application Usage Downloads fr om Google Play T otal Users W orldwide (Billion) Sina W eibo Social platform 11,000,000 0.53 JD Online shopping 1,000,000 0.27 CMbrowser Searching engine 50,000,000 0.64 Ctrip T rav el advice website 1,000,000 0.30 Migu Digital V oice assistant 5,000 0.46 W eChat Chatting, Social 100,000,000 0.96 iFL YTEK Input T yping, V oice Input 500,000 0.5

CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment