Protecting Voice Controlled Systems Using Sound Source Identification Based on Acoustic Cues

Protecting V oice Controlled Systems Using Sound Source Identiﬁcat ion Based on Acoustic Cues Y uan Gong Computer Science and Engineer ing University of Notre Dame, IN 4655 6 Email: ygo n g1@nd.ed u Christian Poellabaue r Computer Science and Eng ineering University of Notre Dame, IN 46 556 Email: cpoellab@nd. e du Abstract —Over the last few years, a rapidly increasing number of Internet-of-Things (IoT) systems that ad opt voice as the primary user input ha v e emerged. These systems ha ve been shown to be vulnerable to various types of vo ice spooﬁng attacks. Existing defense techniques can usually only protect fro m a speciﬁc type of a ttack or r equire an add itional authen tication step that in volv es another device. Such defense strategies are eith er not strong enough or l ower the usabil ity of the system. Based on th e fact that l egiti mate voice commands sh ould only come from humans rather th an a playback device, we pro pose a novel defense strategy th at is able to detect the sound source of a v oice command based on it s acoustic features. The p roposed d efense strategy does not require any information other than the vo ice command itself and can protect a system from multiple types of spooﬁng attacks. Our proof-of-concept experiments v erify the feasibility and effectiveness of th is d efense strategy . I . I N T RO D U C T I O N An incr easing number of IoT systems rely on voice input as the primar y user-machin e interface. For example, voice- controlled devices su c h as Amazon Echo , Go o gle H o me, Apple Hom e Pod, and Xiaomi AI allo w users to contr ol their smart home app lian ces, ad just thermostats, activ ate ho me security sy stems, pur c hase items on lin e, initiate phon e calls, and comp lete many other tasks with ease. I n addition, most smartphon es are also eq uipped with sma r t voice assistants such as Siri, Google Assistant, and Cortana, which provid e a conv enient and natur al user interface to contro l smartp hone function ality o r Io T devices. V oice-driven user in te r faces allow hands-fr ee a nd eyes-free oper ation where users can interact with a system while focusing their atten tion elsewhere. Despite their con venience, voice co ntrolled systems (VCSs) also raise new security co ncerns due to their vulnerability to voice replay attacks [1], i.e . , an attacker can replay a previously re corded voice to make an IoT system perform a sp e c iﬁc (maliciou s) ac- tion. Such malicious actions include the open ing and unlo cking of door s, m aking unauthoriz e d purch ases, co ntrolling sensiti ve home app lian ces ( e.g., security came ras and thermostats), and transmitting sensitive informa tion. While a simp le voice replay attack is relatively easy to detect by a user, and ther efore presents only a limited thr e at, recent studies h av e pointed out more concerning and effective type s of attac k s, includ ing self- triggered attacks [2], [3], inaudib le attack s [4], [5], an d human- imperceptib le attacks [6], [7], [8]. T hese attac k s are very different from e a c h oth er in terms of th e ir implem entation,  Fig. 1. A voice controlled system (e.g., Google Home, sho wn in the red rectan gle) not only accepts voice commands from humans, but also from playbac k de vic es, such as loudspeak ers, headpho nes, and phones. An attack er may tak e adv ant age of this by embedding hidden voice commands into online audio or vide o to maliciously control the VCS. Since legitima te voice commands should only come from a human (rathe r than a playback de vic e), identi fying if the sound source is a human speaker is a possible defense strate gy for differe nt type s of attack as long as the malici ous command is replaye d by an electronic de vi ce. which r equires different domain k nowledge in area s such as operating systems, sign al processing , and machine learning. Some of these attack s ar e described in [9] and an illustra tion of a typical attack scenario is shown in Figure 1. In ord er to defe n d against such attacks, multiple d efense strategies have been pro posed [1], [10], [11], [12]. Howe ver , most existing defen se tec hnolog ie s can eith er only defe n d against o ne spec iﬁc k ind o f attack or req uire an additio n al authenticatio n step using anoth er device, which limits th e effecti veness and usability of th e voice co ntrolled system. For example, AuDroid [2] defends against self-trig gered attacks by m a naging the audio channel auth ority o f the victim device, but it cann ot def end against othe r types of attack s. V Auth [1 2] guaran tee s that the voice comman d is from a user by collecting body- surface vibratio n s of th e user v ia a wearable device, but the re q uired wearable device (i.e., earbuds, eyeglasses, or n ecklaces) is inconvenient to the user . Hence, a defe n se strategy th at is robust to multiple typ es o f attacks and min- imally im pacts the u sability o f a VCS is hig hly desirable. T owards this end, we exp lo re a new def ense strategy that identiﬁes and r ejects received voice command s that are not from a h uman speaker, merely by usin g the acoustic cues of the voice comma nd itself. W e ﬁnd that the voice comman d from human s and playba c k devices can be differentiated based on th e differences o f the soun d prod uction mechanism. The advantage of this strategy is that it does no t req uire any additional informa tion o ther th an the voice command itself and it therefor e d oes n ot impact the usability of a VCS, while at the same time being robust to all v ariants of replay attacks. The rest of the p aper is organized as follows: in Section II, we revie w and classify state-of-the-art attack techn iq ues faced by cu rrent voice co ntrolled systems, arriving at th e conc lusion that most attacks are ac tu ally variants of the replay attac k. I n Section II I, we pro pose our n ew d efense strategy and co mpare it with existing defe nse appr oaches. I n Section IV, we present experimental ev aluation results. Finally , we conclude the paper in Section V. I I . A T TAC K S O N V O I C E C O N T RO L L E D S Y S T E M S In ord er to develop an effective defense strategy , it is importan t to have a good u n derstandin g of typical attack scenarios and state-of-the- art attack technique s. W ith the rapidly growing popu la r ity an d capabilities of voice-driven IoT systems, the likelihoo d and poten tial damage o f voice- based attacks also grow very quickly . As discussed in [2], [13], [11], an attack may lead to se vere consequ ences, e.g . , a burglar could ente r a house by tricking a voice-ba sed smart lock or an attacker co uld make u nautho rized p urchases an d credit c ard ch arges v ia a compr omised voice-based system. Such attacks can be very simp le, but still very difﬁcult o r ev en impossible to detect by human s. V oice attacks c a n also be h idden within othe r sounds an d em b edded into audio and video reco rdings. In addition , these attacks can be executed remotely , i.e., the attacker does no t hav e to be ph ysically close to the ta rgete d device, e.g., co mpromised audio and v ideo recordin gs can easily b e distributed v ia the Inter net. Once a recordin g is played back by a device such as the loudspeaker of a pho ne or laptop, the attack can impact VCSs near by . It is very ea sy to scale up such a ttac k s, e.g., a hidd e n malicious audio sample can b e em bedded into a pop u lar Y ouT ub e v ideo or transmitted via broadc a st radio and thereb y target millions of devices simultan eously . A fundamental reason for the vulnerability of v oice - controlled Io T systems is th at they con tinuously listen to the en v ironmen t to accept v o ice commands, providing users with hands-fr ee and eyes-fr e e operation of IoT systems. Howe ver, this also provides attackers with an always av ailab le voice interface. Sev er al po tential p oints o f attack are shown in Figure 2. Altho u gh the implementatio ns of existing attack technique s are very different, th e ir g oals ar e th e same: to ge n - erate a signal that leads a voice contr olled system to execute a speciﬁc malicious com mand that the u ser cannot d e tect or recogn ize . In the following sectio ns, we classify representative state-of-the- art a ttack ap proaches accor d ing to the ir type of implementatio n. The attacker per f ormance discussed in this section is taken from the orig in al publication s, but n ote that due to the rapid d ev e lo pments in the are a o f cloud-b ased systems, the attacker perfo rmance is likely to chan ge quickly over time. Command  ƻ   Machine Learning  Human Voice  Digital Speech Signal  Execu!on  ƻ   ƻ   ƻ   Fig. 2. A typical voic e-driv en dev ice captures the human voic e, con verts it into a digital speech signal, and feeds it into a machine learning model. The correspondi ng command is then exe cuted by the connected IoT de vices. Potenti al poin ts of atta ck in this scenario includ e: 1: spooﬁng the system using pre viously recorded audio, 2: hackin g into the oper ating system to force the voic e-driv en softw are to acc ept commands erron eously , 3: emitti ng carefully designed ille gitimat e analog signals that will be con verted into leg itimate digita l s peech signals by the hardw are, and 4: using careful ly crafte d speech adve rsarial exampl es to fool the machin e lea rning model. A. Im p ersonation Attack An imperson ation attack, i.e., someone other than the au- thorized user u sing a VCS maliciously , is the simplest attack and d oes not req u ire any particular expertise or knowledge. Howe ver, this attack can not b e executed r emotely and does not scale we ll. It requ ires that the attacker is in close p roximity of the VCS device, which is a rare attack scenario since these devices are ty p ically placed with in a person’ s h ome or on th e person’ s body . Therefore , this attack poses only a limited threat to VCSs. B. Ba sic V oice Replay Attack V oice rep lay attacks, i.e., an attacker m akes a VCS perform a speciﬁc malicious action by replaying a previously reco rded voice sample [1], [10], [11]. This attac k can be executed remotely , e.g., via the Interne t. A shortcomin g of the basic voice r eplay a ttac k is th at it is easy to detec t and therefore has limited p ractical impact. Nevertheless, as shown later in this section, voice repla y attack s are the basis of othe r more advanced and dan gerous attacks. C. Operating System Level Attack Compared to basic v oice re play attack s, an operatin g system (OS) le vel attack explo its vu lnerabilities of the OS to m a ke the attack self-trigg ered a n d more impe rceptible. Representative examples of this are the A1 1y attack [3 ] , GVS-Attack [2], and the ap proach presen ted in [13]. In [3], the au thors pro pose a malware that collects a user’ s voice and then p erforms a self-replay attack as a backgro u nd service. In [2] , th e authors further verify that the built-in micro phone and speaker can be used simultan eously and that the use of the sp eaker does not r equire user per mission o n And roid devices. T h ey take advantage of th is an d pr opose a zero -permission malware, which continu ously analy zes th e en vironm e nt and cond ucts the attack on ce it ﬁnds that n o u ser is nearb y . The attack uses the device’ s built-in speaker to replay a record ed or synthetic speech, whic h is then ac c epted as a legitimate command . In [ 13], the au thors prop ose an interactive attack that can execute m ultiple-step comma nds. OS level attack s are usu ally self-trig gered by the victim device and therefor e rather dangerou s and p r actical. T ABLE I R E P R E S E N TA T I V E VO I C E A T TAC K T E C H N I Q U E S Attack Name Attack T ype Implementa tion GVS Attack [2 ] Operati ng System Continuo usly analyz e the en vironment and conduct voice replay atta ck using the bu ilt-in micropho ne when the opportunity arises. A11y Attack [3] Operati ng System Collec t the voice of a user and perform a self-repla y attack as a backgrou nd service. Monke y Attack [13] Operati ng System Bypass auth ority manageme nt of the OS and perform an interacti ve voic e replay attack to exe cute more advanc ed commands. Dolphin Attack [4 ] Hardwa re Emit ultrasound signal that can be con verted into a legitimat e s peech digita l s ignal by the MEMS microphone. IEMI Attack [5 ] Hardware Emit AM-modulat ed signal that can be con verted into a le gitimate speech digital s ignal by the wired microphone -capable headphone. Cocain e Noodles [14] Machine Learning Similar to the hidden voice command. Hidden V oice Command [6] Machine Learning Mangle m alic ious voice commands so that they retain enough acoustic feature s for the ASR system, but become unintel ligible to humans. Houdini [15] Machine Learning Produce s ound that is almost no differe nt to normal speech, b ut fails to be rec ognized by both kno wn or unknown ASR systems. Speech Adversaria l E xample [7] Machine Learning Produce sound that is over 98% similar to any gi ven speech, but makes the DNN model fai l to recogniz e the gender , ident ity , and emotion . T arg eted Speech A dversarial E xample [8] Machine Learning Produce sound that is ov er 99.9% similar to any giv en speech, but transcri bes as any desired mali cious command by the ASR. D. Hardwar e Level Attac k A hardware lev el attack replays a synth etic n on-speech analog signal instead o f h u man voice. The ana lo g sign al is car efully designed according to the cha r acteristics of the hardware (e.g., the analog- digital converter). The sign al is inaudible, but can be converted into a legitimate digital spee c h signal by the h ardware. Represen tati ve appro aches are th e Dolphin attack [4] an d th e IEMI attack [ 5]. In [ 4], th e authors utilize the n o n-linearity of a Micr o Electr o Mechanica l Sys- tems ( MEMS) micr ophon e over ultrasoun d and successfully generate in audible ultrasou n d signals that can be a ccepted as legitimate target comman ds. In [5], th e autho rs take advantage of the fact th at a w ir ed micropho ne-capab le headphon e can be u sed as a microph one and an FM antenna simu ltan eously and d e monstrate that it is possible to trigger voice comman ds remotely by emittin g a caref ully design ed inaudible AM- modulated signal. Hard ware lev el attacks typically n e e d a special signal gener ator an d are typically used to affect mobile VCSs in crowded environments ( e.g., airport). Machi ne L ea r ni ng  Hu m a n V o i c e  Hu m a n V o i c e  ‘ Duck ’  ‘ Horse ’ h 0.07  +   ‘ How are you? ’  h 0.01  ‘ Open the door ’  +  Fig. 3. An illustrat ion of machine lear ning adversar ial examples. Studies hav e sho wn that by adding an impercep tibly small, but carefully designe d perturba tion, an attack can successful ly lead the machin e learning model to making a wrong predic tion. Such att acks have been used in computer vision (upper graphs) [16 ] and speech recog nition (lower graphs) [7 ], [8], [15]. E. Machine Learning Level Attack State-of-the- art voice controlled systems are usua lly equippe d with an autom atic speech r ecognition (ASR) algo- rithm to co n vert digital speech sign al to text. Deep ne ural network (DNN) based algorithm s such as Deep Sp eech [17] can ac h iev e excellent perf ormance with aroun d 95% word recogn itio n rate and hence do minate the ﬁeld. However , r e- cent studies show that machin e learn ing m odels, especially DNN based mod els, are vulnera b le to attacks b y adversar ia l examples [1 6]. That is, machine learning models might mis- classify per turbed exam ples that are only slightly different from corr ectly classiﬁed examp le s (illustrated in Figure 3 for both video an d audio scenarios). In speech, ad versar ial samples can sound like no rmal speech, but will actu ally be recogn ize d as a co mpletely different malicious comman d by the machin e, e.g., an au d io ﬁle might soun d like “hello”, but will be recognized as “open the doo r” b y the ASR system. In recent year s, se veral examples of such attack s ha ve been studied [6], [7], [8], [14], [ 15], [18]. Cocaine Nood les [1 4] and Hidden V oice Command [6] are the ﬁrst efforts to utilize the differences in th e way hu mans and compu ters recogn ize speech and to successfu lly gen erate adversarial soun d exam- ples that are intelligib le as a speciﬁc command to ASR systems (Google Now and CMU Sp hinx), but are not ea sily under- standable by hu mans. The limitation of the ap proach in [14], [6] is that the generated audio doe s not sou nd like legitimate speech. A user might notice that the malicious sound is an abnorm al cond ition an d may take counteractio ns. More recent efforts [15], [7], [8] take advantage of an intriguing pro perty of DNN by genera tin g ma licio us audio that sou nds almost completely like normal speech by adoptin g a mathem atical optimization method. The goal of these techniques is to desig n a minor perturb ation in the speech signal that can fool an ASR system. In [8], the au thors propose a m ethod that can produce an audio waveform that is less th an 0.1% d ifferent from a giv en audio wa veform , but will be transcrib ed as any desire d text by DeepSpeech [1 7]. In [7], the auth ors demo nstrate that a 2% designed distortion of speech can make state-o f -the- art DNN m o dels fail to recognize the g ender and identity of the speaker . I n [1 5], the a uthors show that such attacks are transfer able to different and unknown ASR mod els. Such attacks are dang e rous, becau se u sers do not expect that no rmal speech samples, such as “hello”, could be translated into a malicious comman d by a VCS. T able I provid es a sum mary of th e se attack techn iques. One important obser vation is that all existing attacks (except impersonation) are based o n the replay att a ck . Th at is, OS level and m achine learn in g lev el attacks replay a soun d into the micro phone of the target device. Har dware level attacks replay a speciﬁcally designed sig n al u sing some signal generato r . In other words, the sound source is always anoth er electronic device (e .g., loudspeaker or signal gen erator) instead of a hum a n speaker . Th is same fact makes it po ssible for such attack s to be per formed r emotely and at a large scale. Howe ver, only spoken commands from a li ve speaker shou ld be accepted as legitimate, which means that the id entity of the soun d sou rce could be used to d ifferentiate legitimate from potentially malicious voice com mands. Th at is, if we can determine if the received signal is fr om a li ve speaker or an electr onic device, we are able to prevent mu ltiple (inc luding yet unkn own) types of VCS attacks. Th ese observations and objectives lead u s to th e d esign o f a defense strategy that re lies on d etecting th e source of acou stic signa ls as p resented in this paper . I I I . S O U N D S O U R C E I D E N T I FI C AT I O N A. Ex isting Defen se Strate gies V ariou s defen se strategies have b een p roposed to help VCSs defend ag ainst speciﬁc ty pes of attacks. For example, the work in [10] p roposes a solution called AuDroid to mana g e audio c hannel authority . By using d ifferent security levels for different audio ch annel usag e pattern s, AuDro id can resist a v o ice attack using the device’ s b u ilt-in spea ker [2], [13]. Howe ver, AuDro id is on ly robust to such attacks. Adversar ial training [16], i.e., training a mach ine lea rning model that can classify legitimate samp les and adversaries is one d efense strategy against mach ine lear n ing lev el a ttacks. In [6], the authors train a log istic regression mo del to classify legitimate voice command s an d hidden voice comm ands, which ach iev es a 99. 8 % defen se rate. A limitation of adversarial training is that it needs to kn ow the details of the attack technolo gy and the trained d e fense model o nly protec ts against the cor- respond in g attack. In practice , the attackers will not pu blish their appr o aches an d they can always ch ange the parame te r s (e.g., the pertur bation factor in [7]) to by pass the defe nse. That is, the defen se range of adversarial tr aining is limited and in general, these defense techniq ues are able to add ress only some vulner abilities. On the o ther hand, defense strategies that can re sist multip le types of attacks usua lly r equire an addition a l auth e ntication step with the he lp fro m ano ther d evice. I n [12], the auth ors propo se V Auth, wh ich co llec ts the b ody-sur face vibr a tio n of the user via a wearable device and gua r antees tha t th e voice command is from the user . Howe ver , the r equired wearable devices (i.e., earbuds, eyeglasses, and necklaces) may be inconvenient fo r users. In [11], the authors propose a virtual security button (VSButto n ) that le verag e s W i-Fi tec h nology to detect indoo r human motio ns and voice co mmands ar e o nly accepted when hum an motion is detected. The lim itation is that voice com mands are not necessarily accompanied with a detectable motion . In [1], the author s determ ine if the source of voice co mmands is a loudspeaker via a magn etometer and reject such commands. Howe ver , th is approach works only up to 10cm, which is less than the u sual human-d evice distan c e. In summary , an additional authentication step (e.g., asking the user to wear a weara b le device, requ iring that v o ice com mands are provided only when the bo dy is in motion , or spea k ing very close to th e d evice) does indeed increase the security , but also lowers the usability , which goe s ag ainst the orig inal design intention of voice controlled sy stem s. Finally , other efforts [2], [6 ], [10] mentio n the possibility of u sing autom a tic sp e aker veriﬁcatio n (ASV) systems for defense. Howe ver, this is also not strong en ough, bec a use an ASV system itself is vu lnerable to mach ine learn in g adversar- ial examp le s [7] and previously record e d u ser speech [1], [6]. In ad dition, VCSs ar e often desig ned to be used by mu ltiple users and limiting use to certain users only will im pact the usability of a VCS. B. S o und Sour ce Identiﬁ cation Usin g Aco ustic Cues Based on the observations in Section II, id entifying the sound sou rce can help d efend ag ainst mu ltip le types o f at- tacks. But add ing an authenticatio n step that re quires a user to provide add itio nal inf ormation may hu rt the usability of VCS. The refore, we are co ncerned with the qu estion: can we identify t he sound source of a received voice command by merely using information that is embedded in the voice signal? In this work, we exp lo re the possibility of using acou stic featur e s o f a voice comman d to identify if the pro ducer is a live spea ker or a play back device. The motiv ation o f this approac h is that the sound productio n mechanisms of huma ns and playback devices are different, leading to a d ifference in f requenc ie s an d d irection of the output voice signal, e.g., the soun d p olar dia g ram of a human is d ifferent fro m that of a playback d evice [1 9]; the sound produ ced by a playb ack device usually con tains effects of unwanted high-pass ﬁlter in g [20]; the signal produ c ed by an ultrasound generator contain s carrier signal compon ents [4] , which may furthe r leave cues in the received digital audio signal corresp o nding to the voice com mand. Th erefore , it is possible that suc h sound sou rce differences can b e mod eled using the acoustic feature s of the received d igital aud io signal. From the p erspective of bion ics, we kn ow that h umans are intuitively able to distinguish between a liv e sp eaker and a playback d evice b y on ly listening to (but not seeing) the source. It is worth mentioning that a similar technolog y for d etect- ing replay a ttac ks has b een stud ied to pr otect ASV systems from spo oﬁng [2 0], [21], [22]. Howev er, ASV attack s and VCS attacks are actually very d ifferent. As shown in Figure 4, a typ ic a l replay attack can be divided into two phases: the recordin g ph ase and the playbac k phase. In the reco rding phase, th e attac ker record s o r sy n thesizes a malicious voice command and during the playback phase, the malicious voice command is transmitted from the play back device to the victim device over th e air . ASV attacks and VCS attacks d iffer during both phases: 1) Th e Recording P hase: In ASV attack scenarios, an attacker must either record or synthesize (e.g., using voice conv er sion or cu ttin g and pasting) the victim’ s voice (i.e., the voice of the authorized user) to be used as a m a licio us v oice command [23]. In both cases, various cue s will b e left in the malicious com mand that can be used to detect the attack. In c o ntrast, a VCS typically accep ts voice com m ands from anyone and the attacker d oes no t have to forge a particular victim’ s voice. This also means that typically few cu e s will be left in the malicious voice command . In ASV attacks, when the victim’ s voice is being record ed, th is typically has to occur either via a teleph one or far -ﬁeld micro phones, both of which will have certain lev els of ch a nnel o r ba c kgrou n d noise. The a uthors in [24], [ 25] explore the characteristics of far -ﬁeld r ecording s and how to use them to detect an a ttac k . In [26], th e auth o rs use channel noise p atterns to distingu ish between a pre-recorde d voice an d th e v o ice o f a live speaker . Further, in [27], [ 28], [29], the author s p ropose a schem e to reject voice that is too similar to ones previously r eceiv ed by the ASV system, because this cou ld indicate a reco rded voice. On the other hand, forged voice commands generated using v o ice conv e rsion or cutting and pasting techniques can also be distingu ished from genuin e voice samples [25], [3 0]. In con tr ast, in VCS attacks, faking a victim’ s voice is n ot needed, i.e., attackers can simply recor d their own voice at a close distance and with a h igh-qu ality reco rder to elimin ate backgr o und and ch annel n o ise in the voice command. Hen ce, the backg r ound and c hannel noise featur es ca n no long er be used to differentiate a fake v oice from a real o n e. Malicious command s are naturally d ifferent from the historical record s in a VCS, there f ore, the approaches in [27], [28], [29] will also fail. The attacker can also synthesize v oice commands using a text-to-speech system without the need o f voice co n version or copying and pasting an d con sequently , the approaches in [25], [30] will also not work. In summar y , the existing techniq ues built to p r otect ASV sy stems are not a good ﬁt for the defense needs of a VCS. 2) Th e P layback Phase: In ASV applicatio n s, the micr o - phone is usually positione d very clo se to the user (i.e., less than 0.5 m). At such distances, so me acoustic fea tu res can be used to identify the sound source o f the spe a ker, e .g., in [31], [32], the autho r s use the “pop noise” caused by b r eathing to identify a live speaker . Other efforts [33], [34], [35] do not exp licitly use clo se d istance fe a tures, but the da tab ases they use to develop their defen se strategies were recor ded at Room Acoustics  Source Recording  Playback Device  Victim Device  Post Processing  Recording Device  Speech Synthesi s   Recording Phase  Playback Phase  R ecor di ng D ev i ce 2  Fig. 4. T ypical replay atta cks incl ude a recording phase and a playback phase. In the recording phase, the attac ker records or synthesizes a malicious voice command. In the playback phase, the malicious voic e command is transmitted from the playbac k devic e to the victi m devi ce ov er the air . Unique aspects of attac ks on a VCS (in contrast to an ASV system) are that m alic ious commands can easily be genera ted (lea ving very few cues in the command itself) durin g the recording phase and the tran smission distances ca n be very long during the playback phase. close distances [ 2 2], [36], and therefore, these app roaches ma y also implicitly use close-distance fe atures. I n contrast, with th e help of far -ﬁeld speech recogn ition techn iques, modern v oice controlled systems can ty pically acc e pt voice comma nds fro m rather lon g distanc e s (i.e. , se veral meters to tens of meters). At such distances, close-distance f eatures canno t be u sed to distinguish between human speakers and recor ded voice, e.g., the pop noise effect qu ickly disappears over larger distances. In su mmary , VCS attac k scena rios may leave on ly very few cues during the recordin g ph ase that could help detect a replay attack. Instead, we have to focus o n the p layback p h ase, where we hav e to identify featu r es that discrim inate between human and electron ic comma n ds, especially when co mmands are g iven over larger d istan ces. Mod eling the soun d prod uction and tran smission over long distances with roo m reverberation is complex, making it difﬁcult to de sig n the required features. Therefo re, in th is work, we ﬁrst extract a large acoustic feature set and then u se machine learning techniq u es to id entify the discrimina tive featu res. Since this is a new directio n in protecting attack s o n a VCS, existing datasets are difﬁcult to use since they either contain recordin g s m a de over sho r t distances [37], [36] or they contain non-speech content [3 8]. Therefo re, we collec ted our own dataset con sisting of voice command s p roduce d by both human s and different p layback devices, and recor d ed at various distanc e s from the speaker in the playbac k ph ase (the details of this dataset are describ ed in Section IV -C). W e further use the CO V AREP [39] acoustic feature extraction toolkit, which extracts 74 features per 10ms, and the n ap ply thr ee statistic functions (m ean, max , and min) to ea ch featur e over the en tire voice comman d sample, wh ich leads to a 2 22-dim ensional feature vector for each voice command sample. W e use suppor t vector machine (SVM) with radial basis function (RBF) kernel as the mach in e learn ing algorithm . There ar e two c onsideration s that facilitate th e task o f building a soun d source identiﬁcation system for VCS using acoustic features. First, th e sound sou rce ide ntiﬁcation can be d one in a text-dependent way , i.e., e ven though a voice command cou ld be any text, it usually need s to start with a ﬁxed wake word (such as “ Alexa”, “Hey Goo gle”, o r “ Hey Siri”). W e th erefore o n ly need to identify th e sour ce of the ﬁxed wake word, which eliminates the im pacts of an alyzing Fig. 5. The VCS de vices used in our exp eriments: Amazon Ale xa-based Amazon E cho Dot (left) and Google Home Mini (right). The Am azon Echo Dot has 7 microphon es, while the Google Home Mini has 2 microphone s (the microphone positions are sho wn with the rect angles). Fig. 6. T he playb ack devi ces used in our e xperiment: Son y SRSX5 loud- speak er (left), Audio T echnica A TH-AD700X headphone (middle), and iPod touch (right). different spoken texts. Second, a VCS typically runs o nly on some dedicated devices that use a ﬁxed microp h one model, e.g., Alexa runs on Amazo n Echo devices, wh ile Google Home runs on Google Home devices. This means that th e v ictim device in Figure 4 is ﬁxed, which eliminates another v ariab le in the playback p hase. Otherwise, dif f erent micr ophon es may have different sou nd collection charac te r istics (e.g ., frequen cy response), which m ay be conf u sed with differences in p la y - back d evice chara c te r istics and th ereby affect the identiﬁca- tion. I V . E X P E R I M E N TA T I O N A. Re p lay Attacks on V CSs The lack of defense solutions against rep lay attacks has been reported in multiple pr evious efforts [11], [13], but d ue to the rap id advances of cloud-ba sed systems, we ﬁrst ev aluate replay attack s with th e goa l o f verify in g if state-of- the-art VCS devices will reject replayed voice comm ands, especially when a sensitive operation is requ ested. W e r u n these exper iments with Amazo n Alexa and Goog le Home d evices as shown in Figure 5. The user of th ese devices is an adu lt m ale. The replay attack is perfo rmed using a synth etic voice command by a female (imp lemented using Go ogle T ext-to-Speech, i.e., the resulting comma n d will not sound completely natu r al to humans). T h e conten t of th e voice command is “ Alexa, buy a laptop” and “Hey Goog le, buy a lapto p” with two subsequent “Y es” comm a nds. The voice c ommand s are rep layed using a headph one (shown in Figure 6) at a 50c m distance to the VCS. Experiment Environment 1  Experiment Environment 2  Fig. 7. Experiment locations: a typical meeti ng room (left) and a long corridor (right). The VCS devic e/microphone location is indi cated with the rectangle. The right pict ure is tak en from the edge of the attack range of the Amazon Echo Dot. ƶ   ƶ   ƶ        ƶ   ƶ   ƶ      ƶ   ƶ   ƶ      ƶ  ƶ  ƶ      ƶ  ƶ  ƶ   ƶ  ƶ  ƶ     ƶ   ƶ   ƶ   ƶ   4.2 m  6.2 m  The Test Room  2 1 3  4 5  Fig. 8. T he ﬂoor plan of exp eriment en vironment 1 (the room to the right). The recording de vice of the e xperiments in Section IV -C is placed in the circle . 22 positions are marked with the rec tangles (P1-P22). The dif ferent direct ions of the speake rs in the ex periment s in Section IV -C are indicat ed by the top left arro ws. This attack succe ssfully makes Alexa place an ord er of a $ 350 laptop, while we found that Google Home cu rrently does not allow purchases over $100. Ther efore, we changed the voice command and successfu lly let Google Home plac e a n orde r of a $15 set o f paper towels. Further tests also showed that both devices will still perform the requested action wh en th e genuine male voice c o mmand and the replay e d female v oice command are alternated in a single chat. Note that both Alexa and Google Home provide a feature to learn th e voice of the user (i. e., “ Alexa Y our V oice” an d “Google Ho me V oice Match”). In ou r next experim ent, we enable this featur e, let the VCS lear n the voice of the m ale user , a n d then repeat the a b ove described attac k . Alexa still accepts the voice com mand and places the order as befo re, while Goog le Home rejects the request, because of th e voice mismatch. W e then repeat the attac k using a p re-recor ded voice comm and of the user and successfully let Go ogle Ho me place an o rder . That is, this fea ture does not provid e a strong defense against replay attacks (note that th e purpose of the voice lea r ning feature is to provide p ersonalized services rather than ad dressing a security concern ). In add ition, this feature also affects the usability of legitimate shar e d u se of a VCS. T ABLE II T H E R E P L AY ATTA C K R A N G E O F A M A Z O N E C H O D O T A N D G O O G L E H O M E M I N I VCS Dev ice Meetin g Room Corridor Headphone iPod Loudspeake r Headphone iPod LoudSpeak er Amazon E cho Dot 1-19m 1-19m 1-22m 21.0m 22.4m 28.9m Google Home Mini 1-18 m 1-18m 1-21m 4.4m 4.7m 21.0m T o conclu de, although there a re constrain ts limiting th e im- pact of a ttac k s on VCS devices, state-of-the-ar t VCS solution s are not able to de tect a replaye d voice command, which leaves a sev er e security ﬂaw . B. Atta ck Ra nge An alysis As discussed in Section III- B, f or th e pro posed de fense strategy , we need to identify the sou rce of the voice command within the “attack range” o f a VCS, i.e., the maximum distance between a play back device and a VCS device. I n other words, we ne ed to m easure how far away th e malicious v oic e command can c ome f r om to still be accepted b y the VCS. This attack rang e d epends on thr ee paramete r s: (1 ) the playb ack device itself, which affects the soun d productio n , e.g., so u nd volume and speaker directivity; (2) th e environment, which affects the sound transm ission , the backgr ound noise, a n d the room reverberation; (3) th e VCS device itself, which affects the sou nd collection . I n this experiment, we test three playba c k devices: Sony SRSX5 po rtable speaker, Aud io T ech nica A TH- AD700X headp hone, and iPod T o uch. The devices are tested in two en viro nments: a ty pical meeting room and a long corridor, with two VCS devices (Amazon Echo Do t and Goog le Home Mini), therefore, we have a total of 12 attack conditions. The playback d evices and en v ironmen ts are shown in Figures 6, 7, and 8. Since the attacks (with the exception of impe r sonation) are based on the basic rep lay attack , in this experiment, we measure th e attack range of the ba sic reply attac k . T h e attack range of its v a r iants may be shorter than this range, e. g., minor perturb ations of a mach ine le a r ning le vel attack might not be able to be captured b y the VCS at larger d istances [8]. W e perfor m the attack s in different positions in the environment by r e playing a synthetic voice co mmand “ Alexa, wh at’ s the weather today?” an d “Hey Google, what’ s the wea th er tod ay?” via th e play back d evices, rep eating the experime n ts thr e e times. If the VCS accepts any one o f the voice com mands and r eports the w e ather , we r egard the attack as successful. As shown in T able I I, all thr ee variables have a large impact on the attack range. In experimen t environment 1 (m eeting room) , the attack rang e of all three playbac k devices covers the entire roo m (P1-P18 in Figur e 8) fo r Alexa and Go o gle Home. W e then op ened the do or of the room and extend ed the experiments into the neighbo ring room (left room in Fig ure 8 ). Here, we ﬁnd that the attack range can be even larger (P19- P22). W e further re p eat th e exper im ent in environmen t 2 (corrid o r) an d m easure the lon gest straight-line attack d istance. This experiment provided the following ﬁndings: 1) The atta ck ra nge is large. The attack rang e is larger than expected, e.g., h eadpho n es are not d esigned for replaying sound lo udly , but the ir attack range still covers a typica l r oom. The attack range of a loud sp eaker can be over 25 m (in the corrido r) and g o thr ough a wall (in the meeting room) . This means that attack s can be perfor med over lo ng distances and even from a different room. 2) The a ttack rang e depends on the VCS device. Th e Amazon Echo Dot an d the Goo gle Home Mini differ in their abilities to ca p ture sound. That is, the Amazon Echo Dot pick s up comm ands over lon g er distances and can theref ore b e attacked over a substantially larger range. This is likely d ue to the fact tha t the Amazon Echo Dot has seven micro p hones, while the Google Home Mini has only two. Such a micro phone array is beneﬁcial f or far -ﬁeld speech rec o gnition, but while this improves the usability of a VCS, it also increa ses the risk of being attacked. 3) The at tack ra nge depends on the en vironment. The attack range is also determined by the en v ironmen t, e.g., the straig ht-line attack distance of head phone/iPo d to Google Home M in i in the co rridor is o nly aro und 4.5m, but their straigh t-line attack distance in the me eting room is larger than 6m. T h is is likely due to the fact that sound wa ves are atten uated differently in an open space (corrid o r) com pared to a closed space (meeting room). 4) The attack range depends on the playback device. While it is unsurprising that the loudspeaker h a s a greater volume and therefo re a larger attack range, the relationship between attack ran ge and type o f playback device is mo re co mplex than simply a difference in volume. For example, the attack rang e difference be- tween h e a dphon e and Amazo n Ech o Dot and between headph one and Google Home M ini is much larger than that of th e loud speaker , in d icating th at there are other factors at play b esides a volume difference. T he goa l of this work is to explo it such differences to h elp identify the sound source. C. Sou nd Sour ce Iden tiﬁcation Considering that VCS devices are more likely to b e p lac ed in a ty pical room rather th an a cor ridor, we limit the following experiments to th e m eeting roo m environment (experim ent en v ironmen t 1). W e u se an iPod T ou ch to r e c ord the sou n d, simulating the sound collection of a VCS device, becau se we c annot directly access the sound stored in a com m ercial VCS. W e refer to this iPod T o uch as “sou n d collector” in the rem ainder of th is section. The sound co llec to r is p la c e d on one side of the room. W e then reco rd voice co m mand samples “ Alexa” at 22 different position s, i.e. , P1-P22 (shown in Figu re 8). The voice co mmand is prod uced by (1) a loudspeaker, (2) an iPod, (3 ) a head phone , and (4) a hu m an speaker . The voice com mands (which are then replayed by the playback devices) are recor ded from a human speaker using a profession al r ecorder (T ascam DR-0 5) at a close d istance and in a quiet environment. This emulates a realistic attack scenario, i.e., this will leave as f ew cues in the record ing phase as possible ( th e voice com mand itself is almo st exactly the same) . At each position, we also record multiple voice command samples that are prod uced whe n the speaker faces different dire c tions. In m o re detail, for the iPod and head - phone , w e record voice c o mmand s in p ositions P1- P18 and at each po sition, we reco r d three directions (all in forward directions to the sound collector, i.e., dire ctions 1- 3 shown in Figu re 8). For the loudspeaker and human spea ker , we produ ce voice comma nds in position s P1 - P22 and a t each position, we use ﬁve dire c tio ns (both fo rward and bac k ward directions relative to the recor der , i.e., all directions 1 -5 shown in Figur e 8). This is d one because our p revious re su lts showed that the loudsp eaker has a larger attack rang e. Overall, after eliminating a small nu m ber o f samples due to un desired noise, we obtained a d ataset consisting of a to tal number of 29 6 voice comm a nd sample s produ c ed by dif f erent sou nd sour ces (loudspea ker: 106, iPod: 44, head phone: 46, h uman: 100). W e then n o rmalized each voice com mand sample wa veform to ha ve the same max amp litu de of 1 in order to make all voice command samples have the same volume. W e then extract a 222-d imensional featu re vector u sing the CO V AREP too lk it and feed it to th e SVM with an RBF kerne l as de scribed in Section I II-B. W e use emp irical hype rparameter s: the co st of SVM C = 1 and the γ of the RBF kernel = 0 .25 (the recipr ocal of the nu mber of classes). W e then ev alu ate the results using a ten -fold cross validation, whe re the tr aining and testing sets are independ ent fro m each other in each round. The results of the experiment ar e shown in Figur e 9. For the human speaker-playbac k devices classiﬁcation, i.e., wh ere we consider loudspeaker, iPod, and headp hone as a single class, we a chieve an F1-score of 0. 9 4 on the playb a ck device and 0.90 on the huma n speaker class, leading to an average F1- score of 0.9 2 . W e further ﬁnd that hu man voice co mmands are more likely to be misclassiﬁed as those repr oduced by the iPod and the headph one. That is, if we regard iPod and headph one as a single class and take out the lo u dspeaker class ( since it is rarely misclassiﬁed with othe r c lasses), the F1-score is 0.87 in the ne w iPod- h eadpho ne class and 0.90 in th e hu man speaker class, lead ing to an average F1-score o f 0.89 . This result tells us that the loudspeaker is very different fr om the human spe a ker an d ev en other pla y back devices, which can therefor e be easily iden tiﬁed. Th e iPod and headphon e devices are closer to the hu m an speaker , but ca n still b e effectively identiﬁed. Interestingly , the microph one and iPod are difﬁcult to be classiﬁed as each other . All th ese r esults verif y the T ABLE III S E L E C T E D A C O U S T I C F E A T U R E S B Y C O R R E L AT I O N - B A S E D F E AT U R E S U B S E T S E L E C T I O N Statist ic Function Feature Selecte d (28) Max Fundamenta l Frequency , MFCC(3,4,8,10,13) Min MFCC(1,3,5,7,10,22) Mean Harmonic Structure (H1H2), HMPD M(4,6), HMPDD(0) MFCC(1,3,4,7,8,12,13,14,15), Peak Slope Fig. 9. The confusion matrix of the sound source classiﬁcation test. feasibility and effecti vene ss o f using acoustic cu es to identify the sound source of a voice command. W e further u se a featur e selection algorithm (which we did n ot use fo r learn ing the m odel, because it co uld lead to overﬁtting in a small dataset) to analyze the d iscriminative features. The results are shown in T able III, where we provid e the comb ination of featu res that co ntribute to the classiﬁer , including fundamen tal frequency , Mel-cepstral coef ﬁcients (MFCCs), and ha r monic model and ph ase distortion mean an d deviation (HMPDM, HMPDD). Th is indicates that modeling such a classiﬁer is comp lex and that the use of a machine learning model is essential. Finally , it needs to b e m entioned that while the em pirical results ar e en couragin g, the experimen ts ar e pe rformed in ﬁxed en v ironmen ts using three repr esentativ e p layback devices only . There f ore, the learned model may lack g e n eralization. In p ractice, there are inﬁn ite option s f or p layback devices and environments. Furth er , speaker variability should also be conside r ed. As a conseque n ce, an impor ta n t futu re step will be to build a larger da tab ase containing a variety of different conditions, whic h can then serve as th e basis f o r the development of more generalized machine learning models. V . C O N C L U S I O N S In this work, we ﬁrst revie w state-of- th e-art a ttack tech- nologies and ﬁnd that all of them (with the exception of the impersona tio n attack) are b a sed on the replay attac k , wher e the malicious voice co m mand is produced by a playbac k device. Based o n the fact that legitimate voice co m mands should only co m e fro m a h uman speaker, we the n proposed a novel defense strategy that uses the acou stic features of a speech signal to identify th e sound sou rce o f the voice comman d and only accept the ones coming from th e human. Com pared to existing defense strategies, the prop osed a pproach has the advantage that it minimally affects the usability of the VCS, while b eing ro bust to mo st types o f attacks. Since id entifying the sound source of voice com m ands in a far-ﬁeld cond ition has barely been stud ied before, we ﬁrst measure the practical attack ranges of modern VCS devices (i.e., Amazon Alexa and Goo gle Home ) and then use th e resu lts to con struct a d a taset con sisting of b oth genuin e and rep lay ed voice command samples. W e then use this dataset to develop a machine learning model that can be used to d istinguish the human speaker fr om the playback devices. Finally , our pro of- of-con cept experimen ts verify the feasibility of the pr oposed approa c h . R E F E R E N C E S [1] S. Chen, K. Ren, S. Piao, C. W ang, Q. W ang, J. W eng, L. Su, and A. Mohaise n, “Y ou can hear but you cannot steal: Defendi ng against voic e impersonation at tacks on smartphones, ” in Distrib uted Computi ng Systems (ICDCS), 2017 IEEE 37th Internati onal Confer ence on . IEEE, 2017, pp. 183–195. [2] W . Diao, X. Liu, Z. Zhou, and K. Zhang, “Y our voice assistant is mine: Ho w to ab use speake rs to steal informati on and cont rol your phone, ” in Pr oc. of the 4th ACM W orkshop on Security and Privacy in Smartphones & Mobile Dev ices . A CM, 2014, pp. 63–74. [3] Y . Jang, C. Song, S. P . Chung , T . W ang, and W . Lee, “ A11y attack s: Exploitin g accessibi lity in operating systems, ” in Pr oc. of the 2014 ACM SIGSAC Confer ence on Computer and Communicati ons Security . A CM, 2014, pp. 103–115. [4] G. Zhang, C. Y an, X. Ji et al. , “Dolphinatta ck: Inaudible voi ce com- mands, ” in Pr oc. of the 2017 ACM SIGSAC Confer ence on Comput er and Communication s Secu rity . A CM, 2017, pp. 103–117. [5] C. Kasmi and J. L. Estev es, “Iemi threat s for informati on securi ty: Re- mote command inject ion on modern sm artphone s, ” IEEE T ransactions on Elect r omagnetic Compatibilit y , vol. 57, no. 6, pp. 1752– 1755, 2015. [6] N. Carlini , P . Mishra, T . V aidya , Y . Zhang, M. Sherr, C. Shields, D. W agne r , and W . Z hou, “Hidden voice commands. ” in USENIX Securit y Symposium , 2016, pp. 513–530. [7] Y . Gong and C. Poel labaue r, “Craft ing adversa rial e xamples for speec h parali nguistics appl ications, ” arXiv prepri nt arXiv:1711.03280 , 2017. [8] N. Carlini and D. W agner , “ Audio adversa rial examples: T arget ed attac ks on speech-to-te xt, ” arX iv prepri nt arXiv:1801.01944 , 2018. [9] Y . Gong and C. Poellaba uer, “ An ov ervie w of vulnerabil ities of v oice control led systems, ” arXiv pr eprint arXiv:1803.09156 , 2018. [10] G. Petracca , Y . Sun, T . Jaeger , and A. Atamli, “ Audroid: Pre venting attac ks on audio channels in mobile de vices, ” in Proc. of the 31st Annual Computer Security Applicati ons Confer ence . ACM, 2015, pp. 181–190. [11] X. L ei, G.-H. Tu, A. X. Liu, C.-Y . Li, and T . Xie, “The insecurity of home digital voic e assista nts-amazon ale xa as a case study , ” arXiv pre print arXiv:1712.03327 , 2017. [12] H. Feng, K. Faw az, and K. G. Shin, “Continuou s aut hentication for voice assistant s, ” arXiv preprin t arXiv:1701.04507 , 2017. [13] E. Alepis and C. Patsakis, “Monk ey says, monke y does: security and pri vac y on voice assistants, ” IEEE Access , vol. 5, pp. 17 841–17 851, 2017. [14] T . V aidya, Y . Zhang, M. Sherr , and C. Shields, “Cocaine noodles: expl oiting the gap betwe en human and machine speech recognition, ” Prese nted at WOO T , vol. 15, pp. 10–11, 2015. [15] M. Cisse, Y . Adi, N. Ne verov a, and J. Keshe t, “Houd ini: Fo oling dee p structure d pred iction models, ” arXiv pr eprint arXiv:1707 .05373 , 2017. [16] C. Szegedy , W . Zaremba, I. Sutske ver , J. Bruna, D. Erhan, I. Goodfello w , and R. Fergus, “Intrigui ng properti es of neura l netw orks, ” arXiv pr eprint arXiv:1312.6199 , 2013. [17] A. Hannun, C. Case, J. Casper , B. Catanzaro, G. Diamos, E . Elsen, R. Prenger , S. Sathee sh, S. Sengupt a, A. Coates, and Y . N. Andre w , “Deep speech: Scaling up end-to-end speech recogniti on, ” arXiv preprint arXiv:1412.5567 , 2014. [18] M. Alza ntot, B. Balaji, and M. Sriv astav a, “Did you hear tha t? adve r- sarial examples against automatic speec h recognition, ” arXiv pr eprint arXiv:1801.00554 , 2018. [19] A. Bonellit oro and N. Cac avelos, “Human voic e pola r patt ern measure- ments: Opear s inger and speakers, ” 2015. [20] M. Smiatacz, “Playbac k attack detecti on: the search for the ultimate set of antispoo f features, ” in International Confere nce on Computer Recogn ition Systems . Springer , 2017, pp. 120–129. [21] Z. Wu, N. Eva ns, T . Kinnun en, J. Y amagishi, F . Alegre, and H. Li, “Spooﬁng and counte rmeasures for speak er veriﬁcat ion: A surve y , ” Speec h Communicati on , vol. 66, pp. 130–1 53, 2015. [22] T . Kinnunen, M. Sahidullah, H. Delgad o, M. T odisco, N. Evan s, J. Y a- magishi, and K. A. Lee, “The asvspoof 2017 chal lenge: Assessing the limits of replay spooﬁng attac k detect ion, ” 2017. [23] D. Mukhopadhyay , M. Shirv anian, and N. Saxena, “ All your voic es are belong to us: Stealin g voices to fool humans and machines, ” in Europe an Symposium on Resear ch in Compute r Security . Springer , 2015, pp. 599–621. [24] J. V illalb a and E. Lleida, “Detecting replay attacks from far-ﬁeld recordi ngs on speake r veri ﬁcation systems, ” in Eur opean W orkshop on Biometrics and Identity Mana gement . Sprin ger, 2011, pp. 274–285. [25] ——, “Pre vent ing replay attacks on speak er veriﬁca tion systems, ” in Securit y T ec hnology (ICCST), 2011 IE E E Interna tional Carnahan Con- fer ence on . IEEE, 2011, pp. 1–8. [26] Z. -F . W ang, G. W ei, and Q.-H. He, “Channel patt ern noise based playbac k attack detec tion algorithm for speak er recognitio n, ” in Machi ne Learning and Cyberneti cs (ICMLC), 2011 Internati onal Confe rence on , vol. 4. IEEE, 2011, pp. 1708–1713 . [27] W . Shang and M. Ste venson, “Score normaliz ation in playback attack detec tion, ” in Acousti cs Speech and Signal Pr ocessing (ICASSP), 2010 IEEE Internationa l Confer ence on . IEEE, 2010, pp. 1678–168 1. [28] ——, “ A preliminary study of facto rs affec ting the performance of a playback attack detector , ” in Electrical and Computer Engineerin g, 2008. CCECE 2008. Canadian Confer ence on . IEEE, 2008, pp. 459– 464. [29] ——, “ A pla yback attac k detec tor for speak er v eriﬁcation systems, ” in Communicat ions, Contr ol and Signal Pro cessing, 2008. ISCCSP 2008. 3rd International Symposium on . IEEE , 2008, pp. 1144–1149. [30] M. T odisco, H. Delgado , and N. Ev ans, “ A new featur e for automatic speak er ve riﬁcation anti-spooﬁng: Constant q cepstral coef ﬁcients, ” in Speak er Odyssey W orkshop, Bilbao, Spain , vol. 25, 2016, pp. 249–252. [31] S. Shiota, F . V illavice ncio, J. Y amagishi, N. Ono, I. E chize n, and T . Matsui, “V oice li veness detect ion for speak er v eriﬁcat ion based on a tandem single/double -channel pop noise detector , ” in internationa l confer ence , 2016. [32] ——, “V oice li veness det ection algorithms based on pop noise cause d by human breat h for automati c speake r veri ﬁcation, ” in Sixteenth Annual Confer ence of the International Speec h Communi cation Association , 2015. [33] P . Korshun ov , A. R. Goncalv es, R. P . V iolato, F . O. Sim ˜ oes, and S. Marce l, “On the use of con voluti onal neural netw orks for speech presenta tion attack dete ction, ” in International Conf erence on Identity , Securit y and Behavior Analysis , no. EPFL-CONF-233573, 2018. [34] L. Li, Y . Che n, D. W ang, and T . F . Z heng, “ A study on repla y attack and anti-spooﬁng for automatic speak er ver iﬁcati on, ” arXiv preprin t arXiv:1706.02101 , 2017. [35] M. W itko wski, S. Kacprza k, P . Z elasko, K. Ko walczy k, and J. Gałka, “ Audio replay attack detection using high-frequenc y features, ” P roc. Inter speec h 2017 , pp. 27–31, 2017. [36] T . Kinnunen, M. S ahidullah, M. Falcone, L. Costantini , R. G. Hau- tam ¨ aki, D. Thomsen, A. Sarkar , Z .-H. T an, H. Delgado, M. T odisco et al. , “Reddots replayed: A ne w replay spooﬁng attack corpus for tex t-depen dent spea ker veriﬁcati on researc h, ” in Acoustics, Speec h and Signal P r ocessing (ICASSP), 2017 IE EE International Conf erence on . IEEE, 2017, pp. 5395–5399. [37] H. Delga do, M. T odisco , M. Sahid ullah, N. Ev ans, T . Kinnunen , K. A. Lee, and J. Y amagi s hi, “ Asvspoof 2017 versio n 2.0: m eta-data anal ysis and baseline enhancements. ” [38] P . Foster , S. Sigtia, S. Krstulo vic, J. Bark er , and M. D. Plumbley , “Chime-ho me: A dataset for sound source rec ognition in a domestic en- vironment, ” in Applicatio ns of Signal Proce ssing to Audio and A coustics (W ASP A A), 2015 IEEE W orkshop on . IE EE, 2015, pp. 1–5. [39] G. Degotte x, J. Kane, T . Drugman, T . Raitio, and S. Scherer , “Co- v arepa colla borativ e voi ce analysis repository for speec h technologi es, ” in Acoustic s , Speech and Signal Pr ocessing (ICASSP), 2014 IEEE Internati onal Conf erence on . IE E E, 2014, pp. 960–964.

Protecting Voice Controlled Systems Using Sound Source Identification Based on Acoustic Cues

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment