Robust Audio Adversarial Example for a Physical Attack

We propose a method to generate audio adversarial examples that can attack a state-of-the-art speech recognition model in the physical world. Previous work assumes that generated adversarial examples are directly fed to the recognition model, and is …

Authors: Hiromu Yakura, Jun Sakuma

Robust Audio Adversarial Example for a Physical Attack
Rob ust A udio Adversarial Example f or a Ph ysical Attack Hiromu Y akura 1 , 2 ∗ and Jun Sak uma 1 , 2 1 Univ ersity of Tsukub a 2 RIKEN Center for Advanced Intelligence Project hiromu@md l .cs.tsukuba.ac.jp, jun@cs.tsukuba.ac.jp Abstract W e propo se a method to genera te audio adversarial examples that c an attack a state-o f-the-ar t sp e ech recogn itio n model in the ph ysical world. Previous work assumes that gen erated adversarial examples are d irectly f ed to the r ecognition model, and is not able to perform such a physical attack because of reverberation and noise fro m playback environ- ments. In contr a st, o ur metho d o b tains robust ad - versarial examples by simulating transformation s caused by playback or rec ording in the physical world and inc o rpora ting the transformatio ns into the g e n eration p rocess. Evaluation and a listening experiment demonstrated that o ur adversarial ex- amples are able to attack without being noticed by humans. T his result sugg ests that audio adversar ial examples generated by the proposed method may become a real threat. 1 Intr oduction In recent years, deep lear ning has achieved vastl y improved accuracy , especially in fields such as im age classification and speech recognition, an d has come to be u sed pr actically [Le- Cun et al. , 201 5]. On the other hand, deep lea r ning methods are known to be vuln erable to adversarial examp les [Szegedy et al. , 2 014, Goodfellow et al. , 2015]. More specifically , an attacker can make d eep learning m odels m isclassify e xamples by intentionally ad ding a small perturbation to the exam ples. Such examples are r e f erred to as adversarial examples. While many papers discussed i mage adversarial examples against ima ge classification models, little researc h has bee n done on audio adversarial examples ag ainst speech recog - nition mo dels, e ven though speech r ecognitio n models are widely used at present in commercial ap p lications like Ama- zon Ale xa, Ap ple S iri, Google Assistant, and Microso f t Cor- tana an d devices like Amazon Echo and Google H o me. F or example, [ Carlini and W agne r , 2018] proposed a method to generate audio adversarial examples ag ainst Deep Sp eech [Hannun et al. , 2 0 14], which is a s tate-of-th e-art speech recogn itio n m odel. Ho we ver , this method tar gets the case in ∗ Contact Author which th e wa veform o f the ad versarial example is inpu t di- rectly to th e mo del, as s hown in Figure 1( A ) . In other words, it is not fea sib le to attack in th e case that the ad versarial e x- ample is played by a spea ker and reco rded by a micro phone in th e p hysical world ( hereinaf ter called the over-the-air con- dition), as s hown in Figure 1( B). The difficulty of such an over -the-air attack can be at- tributed to the reverberation of the en vironm ent and n oise from both the spea ker and the micro p hone. More specifi- cally , in the case of the direct inpu t, adversarial exam ples can be gene r ated by de termining a single d a ta po int that fools th e targeted model using an optimization algorithm for a clearly described objective. In co n trast, un der the over-the-air con- dition, ad versarial examples are required to be robust against unknown environments and equipment. Considering that audio signals spread thro ugh th e air , th e impact of a physical attac k using audio adversarial e xamples would be larger tha n that u sing image adversarial examples. For an attack scenario using an image adversarial example, the adversarial example must be presen ted e xplicitly in front of an image sensor of the attack target, e.g., th e cam era of an auto-dr iving car . In contrast, au dio adversarial examples can simultaneou sly attack nu merou s targets by spread in g via out- door spe akers or r a dios. If an attacker hijacks the b roadcast equipmen t of a business complex, it will be possible to a t- tack all the smartphone s owned by people in side via a single playback of the audio adversarial example. In the present paper, we propo se a m e thod by which to generate a robust aud io adversarial exam ple that can attac k speech recognition models in the p hysical w orld. T o the b e st of ou r knowledge, this is the first ap p roach to succeed in gen - erating such adversarial examples th at can attack comp lex speech rec ognition mode ls based on recur rent networks, such as DeepSpee ch, over the air . More over , we believe that our research will contr ibute to im proving the robustness of speech recogn itio n models by train ing mo dels to discriminate ad ver- sarial examp les through a process similar to adversarial train- ing in th e image d omain [Goo d fellow et al. , 2015] . 1.1 Related Resear ch Some studies have proposed meth ods to generate au dio adver- sarial examples ag ainst speech recognition m odels [Alzantot et a l. , 2018, T aori et al. , 2018, Ciss ´ e et al. , 20 1 7, Sch ¨ onherr et al. , 2018 , Carlini and W agner, 2018 ]. Th ese method s ar e Figure 1: Illustration of the proposed attack. [Carlini and W agner , 2018] assumed that adversarial examples are pro vided directly to the recognition model. W e propose a method that t argets an ove r-the-air condition, which leads to a real threat. divided into tw o grou ps: black-b ox an d wh ite-box settings. In the black -box setting, in which the attacker can on ly use the scor e that rep resents how close the input aud io is to the desired phrase, [Alzanto t et al. , 2018 ] proposed a method to attack a speech comman d classification mo del [Sain ath an d Parada, 2015]. This m ethod exploits a genetic algorithm to find an adversarial example, which is recog nized as a speci- fied com mand word. Inspired by this meth od, [T aori et al. , 2018] proposed a method to attac k DeepSpeec h [Hannun et al. , 2 0 14] under the b lack-bo x setting by comb ining ge- netic algorithms and gradie n t estimation. One lim itation of their method is that the length of the p hrase that the attacker can m ake the models recognize is restricted to two words at most, e ven when the obtain ed adversarial examp le is di- rectly inputted . [Ciss ´ e et al. , 201 7] performed an attack on Google V oice ap plication u sing ad versarial examples gener- ated ag ainst DeepSpeech - 2 [Amo d ei et al. , 201 6]. The aim of their attack was changin g recognition r esults to different words without bein g noticed by human s. In other words, they could not m a ke th e targeted mod el o u tput desired words and conclud e d that attackin g speech r e c ognition models so as to transcribe specified words “seem( s) to be m uch more chal- lenging. ” From these points, cur rent methods in the black- box settings are no t r ealistic f or considering th e attack sce- nario in the phy sical world. In th e white- box setting , in which the attacker c an acc e ss the parameters o f th e targeted models, [Y uan et al. , 2018 ] p ro- posed a method to attack Kaldi [Pove y et al. , 2011] , a conven- tional speech recognition mo del based on the com bination of deep neural network and hidden Markov mo del. [Sch ¨ onherr et al. , 2018] extended the method such that generated ad- versarial example s are not n o ticed by hu mans u sing a hiding technique based on psychoac o ustics. Althoug h [Y uan et al. , 2018] succeeded in attacking over the air, their m ethod is n ot applicable to speech recognition m o dels based o n recurrent networks, which are becoming m ore popular and highly f unc- tional. For e xample, Google replace d its conventional mod el with a r ecurren t network based m odel in 2 012 1 . In that resp ect, [Carlini and W agner, 2018] proposed a white-box method to attack against DeepSpeec h , a recurrent network based mod el. Howev er , as mentioned previously , this meth od succeeds in the case of the dire c t input, but n ot in the over-the-air cond ition, because of the reverberation of th e 1 https://ai.googleblog.com/20 15/08/the- neural- networks- behind- google- voice.html en vironm ent and noise from b oth the spe a ker and the micro - phone . Thus, the th reat o f the obtain ed adversarial examp le is limited r egarding the attack scenario in th e phy sical world. 1.2 Contrib ution The con tribution of the present pape r is two-fo ld: • W e p ropose a method by which to g enerate aud io ad- versarial e xamples that can attack speech recognition models based on recurrent networks u nder th e over-the- air condition. Note that such a practical attack is not achiev able u sing the co n ventional meth ods d escribed in Section 1.1. W e addr essed the problem of the rev erber- ation and the n o ise in the phy sical world by simu la tin g them and incorpo rating th e simulated influence into the generation pro c ess. • W e show the feasibility of th e practical attack u sin g the adversarial examples gene r ated by the pro posed method in ev aluation and a listening exper iment. Specifically , the generated adversarial examples d e monstrated a suc- cess rate o f 1 0 0% for the attack throu g h b oth speakers and radio broadcasting , although no participants heard the target phr ase in the listen in g experiment. 2 Backgr ound In this section , we briefly intro duce an adversarial example and review current speech recognition mo dels. 2.1 Adversa rial Example An adversarial example is defin ed as fo llows. Gi ven a trained classification model f : R n → { 1 , 2 , · · · , k } and an input sample x ∈ R n , an a ttac ker wishes to m o dify x so that the model recog nizes the sample as having a specified label l ∈ { 1 , 2 , · · · , k } and th e mo d ification does not ch ange the sample significan tly: ˜ x ∈ R n s.t. f ( ˜ x ) = l ∧ k x − ˜ x k ≤ δ (1) Here, δ is a p arameter that limits the magnitude of perturba- tion ad ded to th e input samp le and is intr oduced so that hu- mans c a nnot n o tice th e difference between a legitimate i nput sample and an inp ut sam p le modified b y an attacker . Let v = ˜ x − x be the p erturba tion. Then, adversarial exam- ples that satisfy Equatio n 1 can be foun d by op timizing this in wh ich L oss f is a loss function th a t represents h ow d istant the inpu t data are f rom the given label under the model f : argmin v Loss f ( x + v , l ) + ǫ k v k (2) By solving the problem using optimization algorithms, the at- tacker can ob tain an ad versarial examp le. In particular , when f is a differentiable mod el, suc h as a regular neural n etwork, and a gra d ient on v can be calculated, a g radient method such as Adam [Kingma and Ba, 2015] is often used. 2.2 Image Adversa rial Example for a Physi cal Attack Considering attacks on phy sical recognition de vices (e.g., ob- ject re cognition o f auto-driving c ars), adversarial e xamples are given to the model throug h senso rs. In the examp le o f the auto-d riving car , im age adversarial examples ar e given to the model after being printed on physical materials and photog raphed by a car-moun ted ca mera. Throu g h suc h a process, the adversarial examp les are transformed an d ex- posed to noise. Ho wev er , ad versarial examp les generated by Equation 2 are a ssumed to b e given directly to the mo del and do no t work f or such scenarios. In order to address this pr oblem, [Athalye et al. , 2018] pro - posed a metho d to simulate tr ansformatio ns caused b y p rint- ing or takin g a picture and inc o rpora te the tra nsformatio ns into the g eneration process of im age adversarial examples. This method can be represented as fo llows using a set of transform ations T consisting of, e.g., enlargement, reduction, rotation, chan ge in brigh tness, a n d addition of noise: argmin v E t ∼T  Loss f ( t ( x + v ) , l ) + ǫ k t ( x ) − t ( x + v ) k ] (3) As a result, adversarial examples ar e generated so that images work even after be in g printed and ph otograp hed. 2.3 A ud io Adve rsaria l Example As explained in Section 1.1, [Carlini an d W agner, 201 8] suc- ceeded to attack against DeepSpeech, a recurrent network based mod el. Here, the targeted mo del has time-dep endency and the same app roach as image adversarial examp les is n ot applicable. Thus, based on the fact that the targeted model uses Mel-Freque n cy Cepstrum Coefficient (MFCC) fo r the feature extraction, they im plemented MFCC calculation in a differentiable mann er and o ptimized an entire waveform u s- ing Adam [Kingm a and Ba, 201 5]. In detail, the pertu rbation v is obtained a gainst the in p ut sample x and the target p hrase l using the loss functio n of DeepSpeech as follows: argmin v Loss f ( M F C C ( x + v ) , l ) + ǫ k v k (4) Here, M F C C ( x + v ) repre sen ts the MFCC extraction fr om the wa veform of x + v . They rep orted the success rate of the obtained adversarial examp les as 1 00% wh e n inputting wa veforms directly into the recognition mod el, but did n o t succeed at all und er the over-the-air condition. T o the best of o ur knowledge, th ere has been no proposal to generate audio adversarial examples, wh ich work under the over -the-air con dition, targetin g speech recognition models using a re current network. 3 Pr oposed Method In this resear ch, we prop ose a meth od by which to genera te a robust adversarial example that can attack Deep Speech [Han- nun et a l. , 2014 ] und er the over-the-air cond ition. The basic idea is to incorporate transformatio n s caused by playbac k and recordin g in to the g eneration process, similar to [Athaly e et al. , 201 8]. W e intro duce th r ee techniq ues: a ba n d-pass filter , impulse respo nse, and white Gaussian noise. 3.1 Band-pass Filter Since the audib le ran ge o f human s is 20 to 20 ,000 Hz , nor- mal speakers are not m a de to play sou nds o utside this ran ge. Moreover , microp hones are often made to au tomatically cut out all but the au dible range in o rder to redu ce n oise. There- fore, if the ob tain ed per turbation is o utside the au dible range , the perturbation will be cut dur ing p layback and record ing and will not functio n a s an ad versarial examp le. Therefo re, we introdu ced a band- pass filter in order to ex- plicitly limit the fre quency ran ge of the per turbation . Based on emp ir ical ob servations, we set the b and to 1,00 0 to 4, 000 Hz , wh ic h exhibited less distortion. He r e, the g eneration p ro- cess is r e presented as follows based o n Equa tio n 4: argmin v Loss f ( M F C C ( ˜ x ) , l ) + ǫ k v k whe re ˜ x = x + B P F 1000 ∼ 4000Hz ( v ) (5) In this way , it is expected that the generated ad versarial ex- amples will acquire robustness such th at they fun ction even when fre quency bands o utside the audib le range are cu t by a speaker o r a microph one. 3.2 Impulse Response Impulse response is the reaction obtained when presented with a brief input signal, called an impu lse. Based on th e fact that impulse responses ca n reprod uce the reverberation in the cap tured environmen t by conv olution, a m ethod o f usin g impulse respo nses from various environments in the training of a speech recogn ition mode l to enhance th e r obustness to the re verberation has been pro posed [Ped dinti et al. , 201 5]. Similarly , we introd uced im pulse respo nses to the gener ation process in or der to make the o btained adversarial example ro- bust to reverberations. In addition, considerin g the scenario of attacking nu mer- ous devices at once via outdo or speakers or rad ios, we want the o btained ad versarial example to work in various environ- ments. Therefo re, in the sam e man ner as [ Athalye et al. , 2018] , we take an expec tation value over impulse respon ses recorde d in di verse environments. Here, Equation 5 is ex- tended like Equa tion 3, where th e set of co llec te d impu lse re- sponses is H and the co n volution using impulse r esponse h is C onv h : argmin v E h ∼H  Loss f ( M F C C ( ˜ x ) , l ) + ǫ k v k  whe re ¯ x = C onv h  x + B P F 1000 ∼ 4000Hz ( v )  (6) In this way , it is expected that the generated ad versarial ex- amples will acquire robustness such that th ey are not affected by reverberations p roduce d in the en viron m ent in wh ich they are played an d recor ded. 3.3 White Gaussian Noise White Gau ssian noise is given by N  0 , σ 2  and used for em- ulating the ef fect of ma ny random pr ocesses that o ccur in nature. For e xample, it is used in the evaluation of sp e ech recogn itio n mod els to measure their robustness a gainst the backgr o und noise [Hansen and Pello m , 1 998] . Consequen tly , we intro duce white Gaussian noise in the generatio n p rocess in order to make the o btained ad versarial example robust to backgr o und no ise. Here, Equ ation 6 is extended as follows: argmin v E h ∼H , w ∼N (0 ,σ 2 )  Loss f ( M F C C ( ˜ x ) , l ) + ǫ k v k  whe re ¯ x = C onv h  x + B P F 1000 ∼ 4000Hz ( v )  + w (7) In this way , it is expected that the generated ad versarial ex- amples will acquire robustness such that th ey are not affected by noise caused b y recording eq uipmen t and the en viron- ment. Note tha t th e white Gaussian noise should also be added before the c o n volution for the purpose o f emulating thermal no ise caused in both the p layback and reco rding de- vices. Howe ver , we ad d ed the no ise only after th e co n vo- lution because doing so makes the optimization easier and Equation 7 was sufficiently r o bust in the emp irical observa- tions. 4 Evaluation In ord er to co nfirm the effectiv eness of the p ropo sed metho d , we c onduc ted evaluation e xperimen ts. W e played and recorde d audio adversarial examp les gen erated by th e pro- posed method and verified whether these adversarial exam- ples are recogn ized as target p hrases. 4.1 Implementation W e impleme n ted E quation 7 using T en so rFlow 2 . Since cal- culating th e expected value o f the loss is difficult, w e instead ev aluated the sample app r oximation of Equation 7 with re- spect to a fixed nu mber of impulse responses sampled ran- domly from H . For o ptimization , we used Ad am [Kingma and Ba, 20 15] in the same manner as [Carlini and W agner, 2018] . 2 Our full implementation is av ail able at https://github .com/ hiromu/rob ust audio ae Figure 2: T wo att ack situations of the ev aluation: speak er and ra- tio. In the first situation, the adversarial examples were played and recorded by a speak er and a microphone. In the second situation, the adversa rial examp les were broadcasted using an FM r adio. 4.2 Settings For the input samp le x , we prepare d two different audio clips of f o ur second s cut from Cello Suite No. 1 by Bach and T o The Sky by Owl City . The first clip is the same a s the publicly released samp les 3 of [Carlin i and W agner, 2018]. The seco n d clip is the same as th e publicly rele a sed samples 4 of [Y u an et al. , 2018]. The d ifference betwee n the clips is that the first clip is an instrumental p iece and does not inc lude singin g voices, whereas singing voices are included in the second song by Owl City . For the target phrase l , we p r epared th ree d ifferent c a ses: “hello w orld, ” “open the door 5 , ” and “ok google 6 . ” Con- sidering tha t [ Carlini an d W agner, 2018 ] tested their metho d with 1,00 0 phr ases that were rand omly cho sen fr om a speech dataset, three ph rases appear to b e in sufficient to ev aluate the efficiency of our a ttack. However , unlike the direct attack as p erform ed by [Carlini and W agner , 20 18], our ev aluation in volves a nu mber of p layback cycles in the physical world. This mea ns that o ur experim e ntal evaluation in the over-the- air setting re quires actu al time for playing back the generated audio adversarial examples. For examp le , our evaluation of a single co mbination of the inp u t sample and the target phrase requires mor e than 18 ho urs in a qu iet room with out inter rup- tion because it inv olves playing 500 interm ediate examples 10 times each with a n interv al of several seco n ds. For this reason, we focused on these three phrases con sidering the at- tack scenario s. For th e set o f impulse responses H , we collected 615 impulse responses f r om various databases [ Kinoshita et a l. , 2013, Nakamu ra et al. , 2000, Jeub et al. , 20 09, W en et al. , 2006, H ¨ arm ¨ a, 2 001], which are constru cted p rimarily for re- search on dereverberation. For the playback and the recordin g, we prepared tw o differ - ent attack situations, as shown in Figur e 2, in order to confirm that the attack using the gener ated adversarial e xamples is ap- plicable via a wide range of of fensive m eans. Fir st, we played and reco rded the adversarial examples u sing a speaker an d a microph one (JBL CLIP2 / Sony ECM-PCV80 U) in a meet- 3 https://nicholas.carlini.com/code/aud io adversa rial examples/ 4 https://sites.google.com/vie w/commanderso ng/ 5 This phrase is used in [Y uan et al. , 2018] to discuss an attack scenario using voice commands. 6 This phrase is used as a trigger word of Google Home. Input sample T arget phr a se SNR (A) Bach hello world 9 . 3dB (B) Bach open the do or 5 . 3dB (C) Bach ok go ogle 0 . 2dB (D) Owl City hello world 11 . 8dB (E) Owl City open the door 13 . 4dB (F) Owl City ok go ogle 2 . 6dB T able 1: Details of the generated audio adversarial examples, which sho wed 100% success by both t he speaker and the radio and having the maximum va lue of SNR 8 . ing room with a d istance of approxim ately 0.5 meters. W e also examined whether the g enerated adversarial examples could attack th r ough the rad io using HackRF One 7 , a Soft- ware Defined Radio (SDR) cap able of transmission or recep- tion o f radio sig n als. W e broad casted a t 180 .0MHz FM and received th e signal with a por ta b le rad io (Sony ICF-P36 ) in the same room, while the playb ack was record ed by a micro- phone (Sony ECM-PCV80U). In b oth cases, we played each adversarial exam ple 1 0 time s to suppress rando m flu ctuation in the physical world and e valuated the reco gnition results obtained by Deep Speech. 4.3 Metrics For th e ev aluation metrics o f the obtain ed a dversarial exam- ple, we used th e signal-to- noise r atio ( SNR) of the pe r turba- tion, the success rate o f the attack, and th e edit distance o f the recogn itio n results. The SNR is g iv en b y 10 log 10 P x P v for the power of the in put sample P x = 1 T P T t =1 x 2 t and the power of perturbatio n P v = 1 T P T t =1 v 2 t . In oth er words, a la rger SNR is associated with a smaller p erturbatio n and a smaller likelihood for a human to notice. The success rate o f th e attack is the ra tio of the nu m ber of times that Deep Speech transcribed th e recorded adversarial example as the target phrase among all trials. The success rate becomes n on-zero only when DeepSpeec h tr a n scribes adver- sarial example s as the target phrase pe r fectly . Thus, we also in troduce d the edit distance between the recogn itio n results and the target phrase to confirm the progr ess of the gen eration pro cess. The edit d istance reveals the progress mor e pr ecisely , e ven when the succe ss ra te is 0%. Here, the edit distanc e is defined as the minimum numb er o f proced u res requ ired to tra n sform one string into the o th er b y inserting, deleting , an d repla c ing on e character . 4.4 Results The p rogress of the generation pr ocess is presented in Figure 3. The figu re shows that, as the gen e ration pro g resses, the SNR d ecreases and th e edit distan ce of the r ecognitio n re- sults to the target phr ase also d ecreases. The d etailed results of the gen erated adversarial exam ples showed certain levels of the suc cess rate ar e presented in T able 1. 7 https://greatscottgadgets.com/hac krf/ 8 These audio files are a v ailable at https://yumetaro.info/projects/ audio- ae/. Input T arget SNR Attack Success Edit sample phrase situation rate dist. (G) Bach hello world 11 . 9dB Speaker 60% 1.1 Radio 50% 1. 3 (H) Bach open the d oor 6 . 6dB Speaker 60% 1.8 Radio 60% 1. 8 (I) Bach ok google 4 . 2dB Speaker 80% 0.6 Radio 70% 0. 9 (J) Owl City hello world 12 . 2dB Speaker 70% 0.9 Radio 50% 1. 5 (K) Owl City open the d oor 14 . 6dB Speaker 90% 0.2 Radio 100% 0.0 (L) Owl City ok google 8 . 7dB Speaker 90% 0.6 Radio 70% 0. 9 T able 2: Details of the generated audio adversarial examples, which sho wed at least 50% success by both the speaker and t he radio and having the maximum value of SNR 8 . As shown in T able 1, in all combina tions of th e input sam- ple an d the target phr ase, the propo sed m ethod g enerated ad- versarial examples that sho wed 10 0% success by both th e speaker an d the ra dio. On th e o ther hand, th e magnitud e of the perturbation req uired to achieve 100% succe ss differed depend ing on the input sample and the target p hrase. In the previous meth od [Y uan et a l. , 201 8 ] targeted a t Kaldi [ Povey et al. , 20 11] u nder the over-the-air con d ition, an SNR of less than 2.0 dB w as reported in all cases. In other words, co n- sidering that (D) th rough (F) of T able 1 use the same input sample, the pro posed m ethod is able to gen erate an ad versar- ial example with less perturb a tio n while targeted at a more complex speec h recog nition m odel. T able 2 showed the adversarial exam ples having a su ccess rate of at least 50% by bo th the speaker an d the radio with the maximum v alue of SNR. W e foun d th a t m uch le ss p er- turbation w as required to achieve a success rate of 50%, as compare d to T able 1, in all cases. In other words, th e attack using the se adversarial examples will succeed o nce in two at- tempts and can b e a m ajor th r eat when the attacker allo ws uncertainty of th e attack. In all cases in T ab le 1 an d T able 2, the adversarial exam- ples generated with Bach’ s Cello Suite No. 1 hav e larger SNR compare d to the case of Owl City . T h is r esult supports the discussion of [Y uan et al. , 2 018], whereb y some phone m es from sing in g voices in th e input sample work to gether with the injected small pe r turbation s to form the target phrases. Considering such a n effect, we can determ ine th a t the result is due to the song by Owl City having more phon emes to help form the target phrases, as com pared to Bach’ s instrumen tal piece, and re q uires less p erturba tio n. W e a lso fo und that th e recognition r e su lts of the adversarial examples in T able 2 c hanged only sligh tly between the cases of the sp eaker an d the r adio. This result su g gests that the propo sed metho d makes the generated ad versarial example robust fo r FM transmission also. For example, the add ition of wh ite Gaussian noise in the propo sed m ethod would also work for th e noise caused by FM transmission . M oreover , as m e ntioned in Section 1, on e of the m ajor con cerns of an Figure 3: P rogress of t he generation process of the adversarial examples. As the generation progresses, the SNR and the edit distance in the speak er situation decreases. The detailed results of the highlighted adversarial examples are sho wn i n T able 1 and T able 2. Used tech niques Input sample Band-pass Impulse Gau ssian Bach Owl filter resp o nse no ise City – – ✓ – – ✓ – – ✓ – – ✓ ✓ – – ✓ ✓ − 4 . 2dB − 3 . 8dB ✓ ✓ – – ✓ ✓ ✓ 9 . 3dB 9 11 . 8dB 9 T able 3: Results of switching in the presence of each technique. Only t he case of combining the band-pass filter and white Gaussian noise succeeded to generate, though it requires much more pertur- bation than the case of T able 1. audio adversarial examp le is th at it can attack num erous tar- gets simultaneou sly . In this respect, th e success of the attack throug h the rad io might h av e a sign ificant imp act because such an attack can b e made without m aking victims play an adversarial example activ ely on their own. 4.5 Effect of Each T echniq ue W e the n in vestigated the indi vidual effect of the thre e tech- niques on the su c cess o f the pro posed m e thod. In d etail, we ev aluated the effect of the three techniqu es with changin g the combinatio ns in the g eneration . Once we obtain ed an ad ver- sarial example that is recognized as the target p hrase using the spe a ker in a similar environment as Section 4.2, we co m- pared its SNR in T able 1. Her e, we used “hello world” as the targeted phrase becau se it is sug gested to be relatively easy to generate accor ding to T able 1. The results are sho wn in T able 3. W e note that th e gen- eration without any of the prop osed techn iques is eq uiv alent to [Carlini and W agner , 2 018]. Her e, all th e cases except the 9 These values are from T able 1. ok go ogle tu rn off op e n the do or hap py bir thday good night call john hello world air plane mode on T able 4: List of choices presented to participants in the list ening experimen ts. W e chose simple phrases of lengths similar to t hose of “hello world” or “open the door , ” concentrating on phrases that are used as voice commands. combinatio n of the band -pass filter and wh ite Gaussian noise could not generate adversar ial examples that can attack un der the over-the-air condition . In ad d ition, the succeeded combi- nation requires much m o re perturbation than the case of using all the th ree techn iques. From these results, it is sug gested that we can gen erate ad - versarial examples that work ov er the air withou t the help of the impulse re sp onses by u sing white Gaussian noise, whereas th e ban d-pass filter seems to be essential conside r- ing th e limitation of ph ysical d evices. At the same tim e, the impulse responses are con sidered to make adversaria l exam- ples robust specifically for reverberations, wh ich r e sults in th e reduction of th e pertur b ation, a s discussed in Section 3 . 2. 5 Listening Experiment In order to consider an attack scen a rio u sing the generated ad- versarial examples, wheth er huma ns can no tice is im portan t. If an attacker can make intended phrases to be recognized without it being noticed by human s, then an attack explo iting speech reco gnition devices will b e possible. For examp le, [Y uan et al. , 2018 ] co n ducted listening ex- periments using Ama zon Mechan ic a l T u r k (AMT) in th e pro- posal of th e attack fo r Kaldi [Povey et al. , 20 11]. As a result, they rep o rted that only 2. 2 % of the particip ants realize d that the lyrics h ad chan ged f rom th e original son gs u sed as in- put samp les, whereas ap prox im ately 65% noticed abno rmal noises in the gen e rated ad versar ial examp le s. W e similarly cond ucted listening experiments using AMT in orde r to con firm whether hum ans n otice an attack . Hear Hear W ith pr esentation o f anything a target the choices listed in T able 4 abnorm al phrase Corre c t Incorr ect Not sure (A) 36.0% 0.0% 4.0% 28.0% 68.0% (B) 56.0% 0.0% 4. 0 % 32.0% 64.0 % (C) 48.0% 0.0% 4. 0 % 24.0% 72.0 % (D) 32.0% 0.0% 4. 0 % 28.0% 68.0 % (E) 44.0% 0.0% 8. 0 % 16.0% 76.0 % (F) 48.0% 0.0% 0. 0 % 32.0% 68.0 % T able 5: Results of the listening experiments of T able 1. Alt hough a certain number of participants felt abnormal, most of the participants could not he ar the target phrases, even when presented with choices. 5.1 Settings W e u sed th e six generated adversarial examples (A) th rough (F) of T able 1, wh ich were recogn ized as target ph rases with a succe ss r ate o f 1 00%. W e condu cted an on line survey sepa- rately for each adversarial example with 25 par ticipants. T hey listened to the adversarial example th ree time s, and, a f ter ea c h listening, we asked each of the following questions: 1. Did y ou hear anything abnor mal? (For affirmati ve re- sponses, we asked th em to write what th ey felt.) 2. (W ith the disclosure that some voice is i nclude d) did you hear any word s? (For affirmati ve responses, we a sked them to wr ite down the words.) 3. (W ith the pr esen tation of eight phrases in T able 4) which phrase do y ou believe was inclu d ed? 5.2 Results The results are shown in T able 5. Altho ugh a certa in num - ber of participan ts felt a b norm a l, no one could hear the target phrases in all cases. In detail, for example, only 32% of the participants felt abnorm al abo ut T able 5(D) and provided commen ts like, “It was not very clear, ” ”Th e music seem ed a bit fu zzy , ” and “It sounded like bird s in the b ackgro und. ” Although T able 5(B) showed the highest rate of the p articipants feeling ab norma l, only com ments similar to (D), su ch as, “It was like he aring over a bad Skype con nection or phon e call, ” were provided . For (D) thro ugh (F) of T able 5, which are generate d on the song by Owl City , we found mo re comments r elated to th e sound q uality , such as, “It sou nds h ighly co mpressed, ” com- pared to the case of Bach ’ s Cello Suite. Howe ver , an in d i- cation of any messages o r u tterances was not available in a ll cases. Furthermo re, e ven when presented with choices f or the tar- get phrases, more than half of the pa r ticipants respon ded, “I could n ot catch anyth ing. ” In particular, no one co u ld choose the co rrect c h oice, ev en thou gh seven par ticipants ch ose th e incorrect choice s fo r T able 5(F). M oreover , in the o ther five adversarial examples, o nly one or two participa nts chose the target ph rase co rrectly . N o te that these results were o btained under the con d ition in which we explicitly instructed the p ar- ticipants to listen for th e adversaria l examples and p resented them with choices for the target phr ases. T hus, we belie ve this r esult does not deter th e attack scenario, wh ich usu ally seeks a situatio n that is less likely to be no ticed. Based on the above consideratio ns, we conclude that the generated adversarial examples sou nd like mere noise and are almost u nnoticeab le to hum a ns, which can be a real threat. I n addition, the obtained comments suggest directions f or fu- ture inves tigation of attack scenar io s. For example, we m ight able to use birdso n g as the in put samples o r play the sam- ples thr o ugh a teleph one to make adversar ia l examples more difficult to notice. 6 Conclusion In this research, we propo sed a method by wh ich to gen er- ate audio adversarial examp le targeting the state-of- the-art speech recognition model th at can attack pr actically in th e physical world. W e were able to gener a te such robust ad ver- sarial example s by introdu cing a band -pass filter , impu lse re- sponse, and white Gaussian no ise to the gene ration process in order to simulate the transform ations caused by the over -the- air playb a c k. In the ev aluation, we confirme d that the adver- sarial examples g enerated by th e pr oposed metho d can have smaller perturbatio ns th an the con ventional metho d , wh ich cannot deal w ith recurr ent networks. Mor e over , the results o f listening experimen ts confirme d that the ob tained adversar- ial examples ar e almost u nnoticeab le to human s. T o the b e st of ou r kn owledge, this is the first ap proach to successfully generate aud io adversarial examples for speech recognition models that use a r ecurren t network in the phy sical world. In th e futur e, we would like to exam in e a detailed attack scenario and possible defen se metho ds regarding the ge n er- ated au dio adversar ia l examples. W e would also like to co n- sider the possibility o f realizing a robust speec h reco gnition model usin g adversar ial training, as d iscussed fo r the image classification [Goo d fellow et al. , 201 5]. Acknowledgmen ts This work was partly suppo rted b y K A KE NHI (Grants-in- Aid for scientific research) Grant Num bers JP19 H0416 4 a nd JP18H0409 9. Refer ences [Alzantot et al. , 2018] Moustafa A lz a n tot, Bharath an Balaji, and Mani B. Sriv asta va. Did you hear th at? adversarial ex- amples again st automatic speech r ecognitio n. In Pr oceed- ings of the 2017 NIPS W orksho p on Machine De ception , pages 1– 6, 2018. [Amodei et al. , 20 16] Dario Amod ei, Sundaram Anantha - narayan an, Rishita Anubhai, Jinglian g Bai, Eric Batten- berg, Carl Case, Jared Casper , Bryan Catan zaro, Jing dong Chen, Mike Chrza n owski, Adam Coates, Greg Diamos, Erich Elsen, Jesse E ngel, Lin xi Fan, Chr isto p her Fougner, A wni Y . Han n un, Billy Jun, T ony Han, Patrick Le Gr es- ley , Xiangang Li, Lib by Lin, Shar an Narang , Andrew Y . Ng, Sherjil Ozair, Ryan Preng er , She ng Qian, Jonathan Raiman, Sanjeev Satheesh, David Seeta p un, Shu bho Sen- gupta, Chong W a ng, Y i W ang, Zhiqian W ang, Bo Xiao, Y an Xie, Dani Y ogatam a, Jun Zh an, a n d Zhenyao Z hu. Deep speech 2: En d-to-en d spe ech r ecognition in eng lish and mandarin . In P r o ceedings of the 33rd In ternational Confer ence on Machine L earning , volume 48, pag es 173– 182, 201 6. [Athalye et al. , 2018] Anish Athalye, Logan Engstrom , An- drew Ilya s, and Ke vin Kwok. Synthesizing robust ad ver- sarial examples. In P r o ceedings o f the 35rd Internatio nal Confer ence on Machine L earning , p ages 284 – 293, 20 18. [Carlini and W agner , 2 018] Nicholas Carlini and David A. W agner . Aud io adversarial exam ples: T argeted attacks on speech-to- text. In Pr oceedings of th e 1st IEEE W o rkshop on Deep Learning and Security , pag es 1– 7, 2018 . [Ciss ´ e et al. , 2017] Moustaph a Ciss ´ e, Y ossi Adi, Natalia Nev erova, and Joseph K eshet. Houdini: F ooling deep structured v isual an d speech recog nition models with ad - versarial examples. In Pr oc eedings of the 31st Annua l Confer ence on Neu ral Information Pr ocessing Systems , pages 69 80–69 90, 2017. [Goodf ellow et al. , 2 015] Ian J. Goodfellow , Jon athon Shlens, and Christian Szegedy . Exp laining and har nessing adversarial examples. In P r o ceedings of the 3 r d I nter- nationa l Co nfer ence o n Learnin g Representations , p ages 1–11, 2015 . [Hannun et al. , 2014] A wn i Y . Hannun , Carl Case, Jar ed Casper , Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Preng er , Sanjeev Satheesh, Shub ho Sengupta, Ada m Coates, and An drew Y . Ng. Deep speech : Scaling up end- to-end speech rec o gnition. arXiv , 141 2.556 7:1–12 , 2 014. [Hansen and Pello m, 199 8] John H. L. Hansen and Bryan L . Pellom. An effecti ve qu ality ev aluation pr otocol for speech enhancem ent alg orithms. I n Pr oceedings of the 1 998 In - ternational Con fer ence on Spoken Lang uage P r o cessing , pages 28 19–28 22, 1998. [H ¨ arm ¨ a, 2001 ] Aki H ¨ arm ¨ a. Acou stic measureme n t data from the varechoic chamber . T ec h nical r eport, Agere Systems, 2001. [Jeub et al. , 2 009] Marco Jeub, Magnu s Schafer, an d Peter V ary . A bin aural room impulse respo nse database for the ev aluation of dereverberation alg orithms. In P r o ceedings of the 16th Internationa l Confer en ce on Digital Signal Pr o cessing , pag es 1– 5, July 2009. [Kingma and Ba, 2015 ] Diederik P . Kin gma and Jimmy Ba. Adam: A method fo r stoc hastic optimizatio n. In Pr oceed- ings of the 3r d Internatio nal Confer ence on L earning Rep- r esentations , pages 1– 13, 2 015. [Kinoshita et al. , 2013] Keisuke Kinoshita, Marc Delcroix, T akuya Y oshioka, T omoh ir o Nakatani, Armin Seh r , W al- ter Kellermann, an d Roland Maas. The reverb challeng e: Acommon evaluation fr amew ork fo r dereverberation an d recogn itio n of reverberant speech . In Pr oceedings o f the 2013 IEEE W orkshop on App lications of Sig nal Pr o cess- ing to Audio and Aco ustics , p ages 1–4 , 2 013. [LeCun et al. , 2 015] Y an n LeCun, Y oshu a Ben gio, and Ge- offrey E. Hinton. Deep learnin g. Natur e , 52 1(755 3):436 – 444, 201 5. [Nakamur a et al. , 2000 ] Satoshi Nakamur a , Kazu o Hiyane, Futoshi Asano, T akanobu Nishiura , and T akeshi Y amada. Acoustical sound databa se in re a l en vironm ents f or sound scene understan ding and han ds-free sp e ech recognition. In Pr o ceedings o f the 2nd Lan guage Resou r ces a nd Evalua- tion Confer ence , pages 96 5–96 8, 2000 . [Peddinti et al. , 2015 ] V ijaya d itya Peddinti, Guoguo Che n, Daniel Povey , an d Sanjeev Khudan pur . Rev erberation ro- bust acoustic modeling using i-vectors with time delay neural networks. In Pr oceedings of the 16 th Ann ual Con- fer ence of the In ternationa l Speech Communication Asso- ciation , page s 2 440– 2444, 201 5 . [Povey et al. , 201 1 ] Daniel Povey , Arn a b Ghoshal, Gilles Boulianne, Luka s Burget, On drej Glembek, Nagendr a Goel, M ir ko Hann emann, Petr Motlicek, Y an min Qian , Petr Schwarz, Jan Silovsky , Georg Stem mer, and Karel V esely . The kaldi speech rec ognition toolkit. In Pr o- ceedings of the 2011 IEEE W orksho p on A utomatic Speech Recognition and Understanding , pag es 1–4, 20 11. [Sainath and Parada, 20 15] T ara N. Sainath and Carolina Parada. Con volutional neural networks for small-f ootprin t keyword spo tting. In Pr oceedings of the 16th Annual Con- fer ence of the In ternationa l Speech Communication Asso- ciation , page s 1 478– 1482, 201 5 . [Sch ¨ onherr et al. , 201 8] Lea Sch ¨ onher r, Katharina K ohls, Steffen Zeiler, Tho rsten Holz, and Dor othea K olossa. Ad- versarial attacks ag a inst automatic speec h r ecognitio n sys- tems via psychoac o ustic h iding. arXiv , 1 808.0 5665 :1 –18, 2018. [Szegedy et a l. , 201 4] Christian Szegedy , W ojciec h Zaremba, Ilya Sutskever , Joan Bruna, Dumitru Er - han, Ian J. Goodfellow , and Rob Fergus. In tr iguing proper ties of neur al networks. In Pr oceedings of the 2nd Internation al Confer ence on Learn ing Representations , pages 1– 10, 2 0 14. [T aori et al. , 20 18] Rohan T aori, Amog Kamsetty , Brenton Chu, and Nik ita V emur i. T argeted adversarial exam p les for black box audio systems. arXiv , 1 805.0 7820:1 –9, 2018. [W en et al. , 20 06] Jimi Y . C. W en, Nikolay D. Gaubitch, Emanu ¨ el A. P . Habets, T ony My att, and Patrick A. Nay - lor . Evaluation of speech dereverberation alg orithms using the mar dy database. In Pr o ceedings of th e 10th Interna- tional W orksho p on Aco ustic Signa l Enh ancement , pag es 1–4, 200 6. [Y uan et al. , 2018 ] Xuejing Y u an, Y ux uan Chen, Y ue Zhao, Y unhu i Long , Xiaokang Liu, Kai Chen , Shengzh i Zha n g, Heqing Huang, Xiaofeng W ang , an d Carl A. Gun ter . Com - manderso ng: A systematic approach for practical ad- versarial v oice reco g nition. In Pr oc eedings of the 27 th USENIX Security Symposium , pages 4 9–64 , 201 8.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment