Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems

Who is Real Bob? Adv ersarial Attacks on Speaker Recognition Systems Guangke Chen ∗ † ‡ , Sen Chen § k , Lingling Fan § , Xiaoning Du § , Zhe Zhao ∗ , Fu Song ∗ ¶ B and Y ang Liu § ∗ ShanghaiT ech Uni versity , † Shanghai Institute of Microsystem and Information T echnology , Chinese Academy of Sciences ‡ Univ ersity of Chinese Academy of Sciences, § Nanyang T echnological Uni versity ¶ Shanghai Engineering Research Center of Intelligent V ision and Imaging, k Co-ﬁrst Author Abstract —Speaker recognition (SR) is widely used in our daily life as a biometric authentication or identiﬁcation mechanism. The popularity of SR brings in serious security concerns, as demonstrated by r ecent adversarial attacks. Howe ver , the impacts of such threats in the practical black-box setting are still open, since current attacks consider the white-box setting only . In this paper , we conduct the ﬁrst comprehensiv e and sys- tematic study of the adversarial attacks on SR systems (SRSs) to understand their security weakness in the practical black- box setting. F or this purpose, we propose an adversarial attack, named F A K E B O B , to craft adv ersarial samples. Speciﬁcally , we formulate the adversarial sample generation as an optimization problem, incorporated with the conﬁdence of adversarial samples and maximal distortion to balance between the strength and imperceptibility of adversarial voices. One key contribution is to propose a novel algorithm to estimate the score threshold, a feature in SRSs, and use it in the optimization problem to solve the optimization problem. W e demonstrate that F A K E B OB achieves 99% tar geted attack success rate on both open-source and commercial systems. W e further demonstrate that F A K E B O B is also effective on both open-source and commercial systems when playing over the air in the physical world. Moreover , we ha ve conducted a human study which r eveals that it is hard for human to differentiate the speakers of the original and adversarial voices. Last but not least, we show that four promising defense methods for adversarial attack from the speech recognition domain become ineffective on SRSs against F A K E B O B , which calls for more effective defense methods. W e highlight that our study peeks into the security implications of adversarial attacks on SRSs, and r ealistically fosters to impr ove the security robustness of SRSs. I . I N T RO D U C T I O N Speaker recognition [1] is an automated technique to iden- tify a person from utterances which contain audio charac- teristics of the speaker . Speaker recognition systems (SRSs) are ubiquitous in our daily life, ranging from biometric au- thentication [2], forensic tests [3], to personalized service on smart devices [4]. Machine learning techniques are the mainstream method for implementing SRSs [5], howe ver , they are vulnerable to adversarial attacks (e.g., [6], [7], [8]). Hence, it is vital to understand the security implications of SRSs under adversarial attacks. Though the success of adversarial attack on image recog- nition systems has been ported to the speech recognition systems in both the white-box setting (e.g., [9], [10]) and black-box setting (e.g., [11], [12]), relatively little research has been done on SRSs. Essentially , the speech signal of an utterance consists of two major parts: the underlying text and the characteristics of the speaker . T o improve the performance, speech recognition will minimize speaker -dependent v ariations to determine the underlying text or command, whereas speaker recognition will treat the phonetic variations as extraneous noise to determine the source of the speech signal. Thus, adversarial attacks tailored to speech recognition systems may become ineffecti ve on SRSs. An adversarial attack on SRSs aims at crafting a sample from a voice uttered by some source speaker , so that it is mis- classiﬁed as one of the enrolled speakers (untargeted attack) or a tar get speaker (tar geted attack) by the system under attack, but still correctly recognized as the source speaker by ordinary users. Though current adversarial attacks on SRSs [13], [14] are promising, they suffer from the following three limitations: (1) They are limited to the white-box setting by assuming the adversary has access to the information of the target SRS. Attacks in a more realistic black-box setting are still open. (2) They only consider either the close-set identiﬁcation task [13] that alw ays classiﬁes an arbitrary voice as one of the enrolled speakers [15], or the speaker veriﬁcation task [14] that checks if an input v oice is uttered by the unique enrolled speaker or not [16]. Attacks on the open-set identiﬁcation task [17], which strictly subsumes both close-set identiﬁcation and speaker veriﬁcation, are still open. (3) They do not consider over - the-air attacks, hence it is unclear whether their attacks are still ef fectiv e when playing over the air in the physical world. Therefore, in this work, we in vestigate the adver sarial attac k on all the thr ee tasks of SRSs in the pr actical black-box setting , in an attempt to understand the security weakness of SRSs under adversarial attack in practice. In this work, we focus on the black-box setting, which as- sumes that the adversary can obtain at most the decision result and scores of the enrolled speakers for each input v oice. Hence attacks in the black-box setting is more practical yet more challenging than the existing white-box attacks [13], [14]. W e emphasize that the scoring and decision-making mech- anisms of SRSs are different among recognition tasks [18]. Particularly , we consider 40 attack scenarios (as demonstrated in Fig. 2) in total differing in attack types (targeted vs. untargeted), attack channels (API vs. over the air), genders of source and target speakers, and SR tasks (cf. § II-B). W e demonstrate our attack on 16 representative attack scenarios. T o launch such a practical attack, two technical challenges need to be addressed: (C1) crafting adversarial samples as less imperceptible as possible in the black-box setting, and (C2) making the attack practical, namely , adversarial samples are effecti ve on an unknown SRS, even when playing ov er the air in the physical world. In this paper , we propose a practical black-box attack, named F A K E B O B , which is able to o vercome these challenges. Speciﬁcally , we formulate the adversarial sample generation as an optimization problem. The optimization objectiv e is parameterized by a conﬁdence parameter and the maximal distortion of noise amplitude in L ∞ norm to balance between the strength and imperceptibility of adversarial voices, instead of using noise model [10], [19], [20], due to its device- and background-dependency . W e also incorporate the score thresh- old, a k ey feature in SRSs, into the optimization problem. T o solve the optimization problem, we le verage an efﬁcient gra- dient estimation algorithm, i.e., the natural e volution strategy (NES) [21]. Howe ver , e ven with the estimated gradients, none of the existing gradient-based white-box methods (e.g., [22], [23], [10], [24]) can be directly used to attack SRSs. This is due to the score threshold mechanism, where an attack fails if the predicated score is less than the threshold. T o this end, we propose a novel algorithm to estimate the threshold, based on which we le verage the Basic Iterativ e Method (BIM) [23] with estimated gradients to solve the optimization problem. W e ev aluate F A K E B O B for its attacking capabilities, on 3 SRSs (i.e., iv ector-PLD A [25], GMM-UBM [16] and xvector - PLD A [26]) in the popular open-source platform Kaldi [27] in the research community and 2 commercial systems (i.e., T alentedsoft [28] and Microsoft Azure [29]) which are pro- prietary without any publicly available information about the internal design and implementations, hence completely black- box. W e ev aluate F A K E B O B using 16 representati ve attack scenarios (out of 40) based on the following ﬁv e aspects: (1) effecti veness/efﬁcienc y , (2) transferability , (3) practicability , (4) imperceptibility , and (5) robustness. The results sho w that F A K E B O B achiev es 99% tar geted attack success rate (ASR) on all the tasks of iv ector-PLD A, GMM-UBM and xvector -PLD A systems, and 100% ASR on the commercial system T alentedsoft within 2,500 queries on av erage (cf. § V -B). T o demonstrate the transferability , we conduct a comprehensiv e ev aluation of transferability attack on i vector -PLD A, GMM-UBM and xvector -PLD A systems under cross-architecture, cross-dataset, and cross-parameter circumstances and the commercial system Microsoft Azure. F A K E B O B is able to achie ve 34%-68% transferability (attack success) rate except for the speaker veriﬁcation of Microsoft Azure. The transferability rate could be increased by crafting high-conﬁdence adversarial samples at the cost of increasing distortion. T o further demonstrate the practicability and im- perceptibility , we launch an ov er-the-air attack in the physical world and also conduct a human study on the Amazon Mechanical T urk platform [30]. The results indicate that F A K E B O B is effecti ve when playing o ver the air in the physical world against both the open-source systems and the open-set identiﬁcation task of Microsoft Azure (cf. § V -D) and it is hard for humans to differentiate the speakers of the original and adversarial voices (cf. § V -E). Finally , we study four defense methods that are reported promising in speech recognition domain: audio squeezing [10], [31], local smoothing [31], quantization [31] and temporal dependency-based detection [31], due to lacking of domain- speciﬁc defense solutions for adversarial attack on SRSs. The results demonstrate that these defense methods have limited effects on F A K E B O B , indicating that F A K E B O B is a practical and powerful adversarial attack on SRSs. Our study reveals that the security weakness of SRSs under black-box adversarial attacks. This weakness could lead to lots of serious security implications. For instance, the adversary could launch an adversarial attack (e.g., F A K E B O B ) to bypass biometric authentication on the ﬁnancial transaction [2], [32] and smart devices [4], as well as high-security intelligent voice control systems [33] so that follo w-up voice command attacks can be launched, e.g., CommanderSong [10] and hidden voice commands [34]. For the v oice-enabled cars using Dragon Driv e [33], the attacker could bypass its v oice biometrics using F A K E B O B so that command attacks can be launched to control cars. Even for commercial systems, it is a signiﬁcant threat under such a practical black-box adv ersarial attack, which calls for more robust SRSs. T o shed further light, we discuss the potential mitigation and further attacks to understand the arm race in this topic. In summary , our main contributions are: • T o our kno wledge, this is the ﬁrst study of targeted adver- sarial attacks on SRSs in the black-box setting. Our attack is launched by not only using gradient estimation based methods, but also incorporating the score threshold into the adversarial sample generation. The proposed algorithm to estimate the score threshold is unique in SRSs. • Our black-box attack addresses not only the speaker recog- nition tasks considered by existing white-box attacks but also the more general task, open-set identiﬁcation, which has not been considered by previous adversarial attacks. • Our attack is demonstrated to be effective on the popular open-source systems and commercial system T alentedsoft, transfer able and pr actical on the popular open-source sys- tems and the open-set identiﬁcation task of Microsoft Azure ev en when playing over the air in the physical world. • Our attack is r obust against four potential defense methods which are reported very promising in speech recognition domain. Our study reveals the security implications of the adversarial attack on SRSs, which calls for more robust SRSs and more effecti ve domain-speciﬁc defense methods. For more information of F A K E B O B , please refer to our website [35] which includes voice samples and source code. I I . B A C K G RO U N D In this section, we introduce the preliminaries of speaker recognition systems (SRSs) and the threat model. A. Speaker Reco gnition System (SRS) Speaker recognition is an automated technique that allo ws machines to recognize a person’ s identity based on his/her utterances using the characteristics of the speaker . It has been 2 B a c k g r o u n d v o i c e s F e a t ur e E xt r a c t i on U B M C on s t r uc t i on En r o l l i n g s p e a k e r ' s v o i c e s O f f l i n e p h a s e O n l i n e p h a s e S pe a ke r M ode l C on s t r uc t i on D e c i s i on M odul e ( D ) U n k n o w n s p e a k e r ' s v o i c e En r ol l me n t p h as e R e c ogn i ti on p h as e R e s u l t S c o r i n g M od ul e ( S ) T h r e s h o l d Fig. 1: Overvie w of a typical SRS studied acti vely for four decades [18], and currently supported by a number of open-source platforms (e.g., Kaldi and MSR Identity [36]) and commercial solutions (e.g., Microsoft Azure, Amazon Ale xa [37], Google home [38], T alentedsoft, and SpeechPro V oiceK ey [39]). In addition, NIST acti vely orga- nizes the Speaker Recognition Ev aluation [40] since 1996. Overview of SRSs. Fig. 1 shows an ov erview of a typical SRS, which includes ﬁv e key modules: Feature Extraction, Univ ersal Background Model (UBM) Construction, Speaker Model Construction, Scoring Module and Decision Module. The top part is an ofﬂine phase, while the lower two parts are an online phase composed of speaker enrollment and recognition phases. In the ofﬂine phase, a UBM is trained using the acoustic feature vectors extracted from the background voices (i.e., voice training dataset) by the feature extraction module. The UBM, intending to create a model of the a verage features of ev eryone in the dataset, is widely used in the state-of-the-art SRSs to enhance the robustness and improve efﬁciency [1]. In the speaker enrollment phase, a speaker model is built using the UBM and feature vectors of enrolling speaker’ s voices for each speak er . During the speaker recognition phase, gi ven an input v oice x , the scores S ( x ) of all the enrolled speakers are computed using the speaker models, which will be emitted along with the decision D ( x ) as the recognition result. The feature extraction module con verts a raw speech signal into acoustic feature vectors carrying characteristics of the signal. V arious acoustic feature extraction algorithms ha ve been proposed such as Mel-Frequency Cepstral Coefﬁcients (MFCC) [41], Spectral Subband Centroid (SSC) [42] and Perceptual Linear Predicti ve (PLP) [43]. Among them, MFCC is the most popular one in practice [1], [18]. Speaker recognition tasks. There are three common recogni- tion tasks of SRSs: open-set identiﬁcation (OSI) [17], close-set identiﬁcation (CSI) [15] and speaker veriﬁcation (SV) [16]. An OSI system allo ws multiple speakers to be enrolled during the enrollment phase, forming a speak er group G . For an arbitrary input voice x , the system determines whether x is uttered by one of the enrolled speakers or none of them, according to the scores of all the enrolled speak ers and a preset (score) threshold θ . Formally , suppose the speaker group G has n speakers { 1 , 2 , · · · , n } , the decision module outputs D ( x ) : D ( x ) = ( argmax i ∈ G [ S ( x )] i , if max i ∈ G [ S ( x )] i ≥ θ ; reject , otherwise . where [ S ( x )] i for i ∈ G denotes the score of the voice x that is uttered by the speaker i . Intuitively , the system classiﬁes the input voice x as the speaker i if and only if the score [ S ( x )] i of the speaker i is the largest one among all the enrolled speakers, and not less than the threshold θ . If the largest score is less than θ , the system directly rejects the v oice, namely , it is not uttered by any of the enrolled speakers. CSI and SV systems accomplish similar tasks as the OSI system, but with some special settings. A CSI system ne ver rejects any input voices, i.e., an input will always be classiﬁed as one of the enrolled speakers. Whereas an SV system can hav e exactly one enrolled speaker and checks if an input voice is uttered by the enrolled speaker , i.e., either accept or r eject . T ext-Dependency . SRSs can be either te xt-dependent, where cooperativ e speakers are required to utter one of pre-deﬁned sentences, or text-independent, where the speakers are allo wed to speak anything. The former achieves high accuracy on short utterances, but always requires a large amount utterances repeating the same sentence, thus it is only used in the SV task. The latter may require longer utterances to achie ve high accuracy , but practically it is more versatile and can be used in all tasks (cf. [18]). Therefore, in this work, we mainly demonstrate our attack on text-independent SRSs. SRS implementations. iv ector-PLD A [25], [44] is a main- stream method for implementing SRSs in both academia [27], [45], [46] and industries [47], [48]. It achie ves the state-of- the-art performance for all the speaker recognition tasks [49], [50]. Another one is GMM-UBM based methods, which train a Gaussian mixture model (GMM) [16], [51] as UBM. Basically , GMM-UBM tends to provide comparati ve (or higher) accurac y on short utterances [52]. Recently , deep neural network (DNN) becomes used in speech [53] and speak er recognition (e.g., xv ector-PLD A [26]), where speech recognition aims at determining the underlying text or command of the speech signal. Howe ver , the major breakthroughs made by DNN-based methods reside in speech recognition; for speaker recognition, iv ector based methods still exhibit the state-of-the-art performance [5]. Moreov er , DNN-based methods usually rely on a much lar ger amount of training data, which could greatly increase the computational complexity compared with iv ector and GMM based meth- ods [54], thus are not suitable for off-line enrollment on client- side devices. W e denote by ivector , GMM, and xvector the iv ector-PLD A, GMM-UBM, and xv ector-PLD A, respecti vely . B. Threat Model W e assume that the adversary intends to craft an adversarial sample from a voice uttered by some source speaker , so that it is classiﬁed as one of the enrolled speakers (untargeted attack) or the target speaker (targeted attack) by the SRS under attack, but is still recognized as the source speaker by ordinary users. T o deliberately attack the authentication of a target vic- tim, we can compose adversarial voices, which mimic the voiceprint of the victim from the perspectiv e of the SRSs. Reasonably , the adversary can unlock the smartphones [55], 3 A P I O v e r - t h e - a i r A t t a c k c hann e l SV * C SI O S I T a r g e t SR S D e c i s i o n + Sc o r e s D e c i s i o n - o n l y O ut pu t U n t a r g e t e d T a r g e t e d A t t a c k t y p e I n t r a - g e n d e r I n t e r - g e n d e r G e nd e r Fig. 2: Attack scenarios, where ∗ means that targeted and untargeted are the same on the SV task, as an SV system only has one enrolled speaker . log into applications [56], and conduct illegal ﬁnancial trans- actions [2]. Under untargeted attack, we can manipulate voices to mimic the voiceprint of any one of enrolled speakers. For example, we can bypass the voice-based access control such as iFL YTEK [57], where multiple speak ers are enrolled. After bypassing the authentication, follo w-up hidden voice command attacks (e.g., [10], [34]) can be launched, e.g., on smart car with Dragon Dri ve [33]. These attack scenarios are practically feasible, for example, when the victim is not within the hearable distance of the adversarial voice, or the attack voice does not raise the alertness of the victim due to the presence of other voice sources, either human or loudspeakers. This paper focuses on the practical black-box setting where the adversary has access only to the recognition result (deci- sion result and scores) of a target SRS for each test input, but not the internal conﬁgurations or training/enrollment voices. This black-box setting is feasible in practice, e.g., the commer- cial systems T alentedsoft [28], iFL YTEK, SinoV oice [58] and SpeakIn [59]. If the scores are not accessible (e.g., OSI task in the commercial system Microsoft Azure), we can le verage transferability attacks. W e assume the adversary has some voices of the target speakers to b uild a surrogate model, while these voices are not necessary the enrollment voices. This is also feasible in practice as one can possibly record speeches of target speakers. T o our knowledge, the targeted black-box setting renders all previous adversarial attacks impractical on SRSs. Indeed, all the adversarial attacks on SRSs are white- box [13], [14] except for the concurrent work [60], which performs only untargeted attacks. Speciﬁcally , in our attack model, we consider ﬁv e param- eters: attack type (targeted vs. untargeted attack), genders of speakers (inter-gender vs. intra-gender), attack channel (API vs. ov er-the-air), speak er recognition task (OSI vs. CSI vs. SV) and output of the target SRS (decision and scores vs. decision- only) as shown in Fig. 2. Intra-gender (resp. inter-gender) means that the genders of the source and target speakers are the same (resp. dif ferent). API attack assumes that the target SRS (e.g., T alentedsoft) provides an API interface to query , while ov er-the-air means that attacks should be played over the air in the physical world. Decision-only attack means that the target SRS (e.g., Microsoft Azure) only outputs decision result (i.e., the adversary can obtain the decision result D ( x ) ), but not the scores of the enrolled speakers. Therefore, targeted, inter-gender , over -the-air and decision-only attacks are the most practical yet the most challenging ones. In summary , by counting all the possible combinations of the parameters in Fig. 2, there are 48 = 2 × 2 × 2 × 3 × 2 attack scenarios. Since targeted and untargeted attacks are the same on the SV task, there are 40 = 48 − 2 × 2 × 2 attack scenarios. Howe ver , demonstrating all the 40 attack scenarios requires huge engineering ef forts, we design our experiments to cov er 16 representative attack scenarios (cf. Appendix B). I I I . M E T H O D O L O G Y In this section, we start with the motiv ations, then explain the design philosophy of our attack in black-box setting and the possible defenses, ﬁnally present an overvie w of our attack. A. Motivation The research in this work is motiv ated by the following questions: (Q1) How to launch an adversarial attack against all the tasks of SRSs in the practical black-box setting? (Q2) Is it feasible to craft robust adversarial v oices that are transferable to an unknown SRS under cross-architecture, cross-dataset and cross-parameter circumstances, and commercial systems, even when played ov er the air in the physical world? (Q3) Is it possible to craft human-imperceptible adversarial voices that are difﬁcult, or even impossible, to be noticed by ordinary users? (Q4) If such an attack exists, can it be defended? B. Design Philosophy T o address Q1, we inv estigate existing methods for black- box attacks on image/speech recognition systems, i.e., surro- gate model [61], gradient estimation [62], [21] and genetic algorithm [63], [64]. Surrogate model methods are proved to be outperformed by gradient estimation methods [62], hence are excluded. For the other two methods: it is kno wn that nat- ural e volution strategy (NES) based gradient estimation [21] requires much fewer queries than ﬁnite difference gradient estimation [62], and particle swarm optimization (PSO) is prov ed to be more computationally efﬁcient than other genetic algorithms [63], [65]. T o this end, we conduct a comparison experiment on an OSI system using NES as a black-box gradi- ent estimation technique and PSO as a genetic algorithm. The result shows that the NES-based gradient estimation method obviously outperforms the PSO-based one (cf. Appendix A). Therefore, we exploit the NES-based gradient estimation. Howe ver , ev en with the estimated gradients, none of the existing gradient based white-box methods (e.g., [22], [23], [66], [67], [10], [20], [19], [24]) can be directly used to attack SRSs. This is due to the threshold θ which is used in the OSI and SV tasks, b ut not in image/speech recognition. As a result, these methods will fail to mislead SRSs when the resulted score is less than θ . T o solve this challenge, we incorporate the threshold θ into our adversarial sample generation and propose a novel algorithm to estimate θ in the black-box setting. Theoretically , the adversarial samples crafted in the above way are effecti ve if directly fed as input to the target SRS via exposed API. Ho wever , to launch a practical attack as in Q2, adversarial samples should be played over the air in the physical world to interact with a SRS that may differ from the SRS on which adversarial samples are crafted. T o address Q2, we increase the strength of adv ersarial samples and the range of noise amplitude, instead of using noise model [10], [19], 4 T hr e s hol d e s t i m a t i on I npu t voi c e A t t a c k t y p e S t r e n g t h M a x i m a l d i s t or t i on G r a di e nt e s t i m a t i o n ( N E S ) B a s i c I t e r a t i v e M e t hod ( B I M ) R e s u l t O p t i m i z a t i o n p r o b l e m ( l o s s f un c t i on s ) T a r g e t SR S  O S I  C S I  SV O u r a t t ac k : F A K E B OB A d v e r s a r i a l v o i c e Fig. 3: Overvie w of our attack: F A K E B O B [20], due to its de vice- and background-dependency . W e hav e demonstrated that our approach is effecti ve in transferability attack even when playing over the air in the physical world. T o address Q3, we should consider two aspects of the human-imperceptibility . First, the adversarial samples should sound natural when listened by ordinary users. Second, and more importantly , they should sound as uttered by the same speaker of the original one. As a ﬁrst step towards addressing Q3, we add a constraint onto the perturbations using L ∞ norm, which restricts the maximal distortion at each sample point of the audio signal. W e also conduct a real human study to illustrate the imperceptibility of our adversarial samples. T o address Q4, we should launch attacks on SRSs with defense methods. Howe ver , to our knowledge, no defense solution exists for adversarial attacks on SRSs. Therefore, we use four defense solutions for adversarial attacks on speech recognition systems: audio squeezing [10], [31], local smoothing [31], quantization [31] and temporal dependency detection [31], to defend against our attack. C. Overview of Our Attack: F A K E B O B According to our design philosophy , in this section, we present an overvie w (shown in Fig. 3) of our attack, named F A K E B O B , addressing two technical challenges (C1) and (C2) mentioned in § I. T o address C1, we formulate adversarial sample generation as an optimization problem (cf. § IV -A), for which speciﬁc loss functions are deﬁned for different attack types (i.e., targeted and untargeted) and tasks (i.e., OSI, CSI and SV) of SRSs (cf. § IV -B, § IV -C and § IV -D). T o solv e the optimization problem, we propose an approach by lev eraging a nov el algorithm to estimate the threshold, NES to estimate gradient and the BIM method with the estimated gradients. C2 is addressed by incorporating the maximal distortion ( L ∞ norm) of noise amplitude and strength of adversarial samples into the optimization problem (cf. § IV -A, § IV -B, § IV -C and § IV -D). I V . O U R A T TAC K : F A K E B O B In this section, we elaborate on the techniques behind F A K E B O B , including the problem formulation and attacks on OSI, CSI, and SV systems. A. Pr oblem F ormulation Giv en an original voice, x , uttered by some source speaker , the adv ersary aims at crafting an adv ersarial voice ´ x = x + δ by ﬁnding a perturbation δ such that (1) ´ x is a v alid voice [68], (2) δ is as human-imperceptible as possible, and (3) the SRS under attack classiﬁes the voice ´ x as one of the enrolled speakers or the tar get speaker . T o guarantee that the adversarial v oice ´ x is a valid v oice, which relies upon the audio ﬁle format (e.g., W A V , I am B o b. Open the do o r pl ea s e! O pen - set ident ifi cation + Speaker s O rig ina l v o ice Pert urba tion Adversa rial v o ice Im po ster Enr o lled Spea k ers Sp ea k er t R eject Fig. 4: Attack on OSI systems MP3 and AA C), our attack F A K E B O B ﬁrst normalizes the amplitude value x ( i ) of a voice x at each sample point i into the range [ − 1 , 1] , then crafts the perturbation δ to make sure − 1 ≤ ´ x ( i ) = x ( i ) + δ ( i ) ≤ 1 , and ﬁnally transforms ´ x back to the audio ﬁle format which will be fed to the target SRS. Hereafter , we assume that the range of amplitude values is [ − 1 , 1] . T o be as human-imperceptible as possible, our attack F A K E B O B adapts L ∞ norm to measure the similarity between the original and adversarial voices and ensures that the L ∞ distance k ´ x, x k ∞ := max i {| ´ x ( i ) − x ( i ) |} is less than the giv en maximal amplitude threshold  of the perturbation, where i denotes sample point of the audio wa veform. T o successfully fool the target SRS, we formalize the problem of ﬁnding an adversarial voice ´ x for a v oice x as the following constrained minimization problem: argmin δ f ( x + δ ) such that k x + δ, x k ∞ <  and x + δ ∈ [ − 1 , 1] n (1) where f is a loss function. When f is minimized, x + δ is recognized as the target speaker (targeted attack) or one of enrolled speakers (untargeted attack). Our formulation is designed to be fast for minimizing the loss function rather than minimizing the perturbation δ , as done in [22], [23]. Some studies, e.g., [24], [7], formulate the problem to minimize both the loss function and perturbation. It remains to deﬁne the loss function and algorithm to solv e the optimization problem. In the rest of this section, we mainly address them on the OSI system, then adapt the solution to the CSI and SV systems. B. Attack on OSI Systems As shown in Fig. 4, to attack an OSI system, we want to craft an adversarial voice ´ x starting from a voice x uttered by some source speaker (i.e., D ( x ) = reject ) such that the voice ´ x is classiﬁed as the tar get speaker t ∈ G = { 1 , · · · , n } by the SRS, i.e., D ( ´ x ) = t . W e ﬁrst present the loss function f and then sho w ho w to solve the minimization problem. Loss function f . T o launch a successful tar geted attack on an OSI system, the follo wing two conditions need to be satisﬁed simultaneously : the score [ S ( x )] t of the target speaker t should be (1) the maximal one among all the enrolled speakers, and (2) not less than the preset threshold θ . Therefore, the loss function f for the target speak er t is deﬁned as follo ws: f ( x ) = max  (max { θ , max i ∈ G \{ t } [ S ( x )] i } − [ S ( x )] t ) , − κ  (2) 5 where the parameter κ , inspired by [24], intends to control the strength of adversarial voices: the larger the κ is, the more conﬁdently the adversarial voice is recognized as the target speaker t by the SRS. This has been v alidated in § V -C. Our loss function is similar to the one deﬁned in [24], but we also incorporate an additional threshold θ . Considering κ = 0 , when (max { θ , max i ∈ G \{ t } [ S ( x )] i } − [ S ( x )] t ) is minimized, the score [ S ( x )] t of the target speaker t will be maximized until it exceeds the threshold θ and the scores of all other enrolled speakers. Hence, the system recognizes the voice x as the speaker t . When κ > 0 , instead of looking for a voice that just barely changes the recognition result of x to the speaker t , we want that the score [ S ( x )] t of the speaker t is much larger than an y other enrolled speak ers and the threshold θ . T o launch an untargeted attack, the loss function f can be revised as follows: f ( x ) = max { ( θ − max i ∈ G [ S ( x )] i ) , − κ } . (3) Intuitiv ely , we want to ﬁnd a perturbation δ such that the largest score of x is at least κ greater than the threshold θ . Solving the optimization problem . T o solve the optimization problem in Eq. (1), we use NES as a gradient estimation technique and employ the BIM method with the estimated gradients to craft adversarial e xamples. Speciﬁcally , the BIM method begins by setting ´ x 0 = x and then on the i th iteration, ´ x i = clip x, { ´ x i − 1 − η · sign ( ∇ x f ( ´ x i − 1 )) } where η is a hyper-parameter indicating the learning rate, and the function clip x, ( ´ x ) , inspired by [23], performs per- sample clipping of the voice ´ x , so the result will be in L ∞  -neighbourhood of the source voice x and will be a valid voice after being transformed back into the audio ﬁle format. Formally , clip x, ( ´ x ) = max { min { ´ x, 1 , x +  } , − 1 , x −  } . W e compute the gradient ∇ x f ( ´ x i − 1 ) by leveraging NES, which only depends on the recognition result. In detail, on the i th iteration, we ﬁrst create m (must be ev en) Gaussian noises ( u 1 , ..., u m ) and add them onto ´ x i − 1 , leading to m new voices ´ x 1 i − 1 , ..., ´ x m i − 1 , where ´ x j i − 1 = ´ x i − 1 + σ × u j and σ is the search variance of NES. Note that u j = − u m +1 − j for j = 1 , ..., m 2 . Then, we compute the loss values f ( ´ x 1 i − 1 ) , ..., f ( ´ x m i − 1 ) by querying the target system ( m queries). Next, the gradient ∇ x f ( ´ x i − 1 ) is approximated by computing 1 m × σ P m j =1 f ( ´ x j i − 1 ) × u j . In our experiments, m = 50 and σ = 1 e − 3 . Finally , we com- pute sign ( ∇ x f ( ´ x i − 1 )) , a v ector over the domain {− 1 , 0 , 1 } , by applying element-wise sign mathematical operation to the gradient vector 1 m × σ P m j =1 f ( ´ x j i − 1 ) × u j . Howe ver , the BIM method with the estimated gradients alone is not suf ﬁcient to construct adversarial samples in the black-box setting, due to the fact that the adversary has no access to the threshold θ used in the loss function f . T o solve this problem, we present a novel algorithm for estimating θ . Estimating the threshold θ . T o estimate the threshold θ , the main technical challenge is that the estimated threshold ´ θ Algorithm 1 Threshold Estimation Algorithm Input: The target OSI system with scoring S and decision D modules An arbitrary voice x such that D ( x ) = reject Output: Estimated threshold ´ θ 1: ´ θ ← max i ∈ G [ S ( x )] i ;  initial threshold 2: ∆ ← | ´ θ 10 | ;  the searc h step 3: ´ x ← x ; 4: while True do 5: ´ θ ← ´ θ + ∆ ; 6: f 0 ← λx. max { ´ θ − max i ∈ G [ S ( x )] i , − κ } ;  loss function 7: while True do 8: ´ x ← clip x, { ´ x − η · sign ( ∇ x f 0 ( ´ x )) } ;  craft sample using f 0 9: if D ( ´ x ) 6 = reject then ;  max i ∈ G [ S ( ´ x )] i ≥ θ 10: retur n max i ∈ G [ S ( ´ x )] i ; 11: if max i ∈ G [ S ( ´ x )] i ≥ ´ θ then break ; should be no less than θ in order to launch a successful attack, but should not exceed θ too much, otherwise, the attack cost might become too expensi ve. Therefore, the goal is to compute a small ´ θ such that ´ θ ≥ θ . T o achieve this goal, we propose a nov el approach as shown in Algorithm 1. Gi ven an OSI system with the scoring S and decision D modules, and an arbitrary voice x such that D ( x ) = reject , i.e., x is uttered by an imposter , Algorithm 1 outputs ´ θ such that ´ θ ≥ θ . In detail, Algorithm 1 ﬁrst computes the maximal score ´ θ = max i ∈ G [ S ( x )] i of the voice x by querying the system (line 1). Since D ( x ) = reject , we can know ´ θ < θ . At Line 2, we initialize the search step ∆ = | ´ θ 10 | , which will be used to estimate the desired threshold ´ θ . | ´ θ 10 | is chosen as a tradeoff between the precision of ´ θ and efﬁciency of the algorithm. The outer-while loop (Lines 4-11) iteratively computes a new candidate ´ θ by adding ∆ onto it (Line 5) and computes the function f 0 = λx. max { ´ θ − max i ∈ G [ S ( x )] i , − κ } (Line 6). f 0 indeed is the loss function for untargeted attack in Eq. (3), in which θ is replaced by the candidate ´ θ . The function f 0 will be used to craft samples in the inner-while loop (Lines 7-11). For each candidate ´ θ , the inner -while loop (Lines 7-11) iterativ ely computes samples ´ x by querying the target system until the target system recognizes ´ x as some enrolled speaker (Line 9) or the maximal score of ´ x is no less than ´ θ (Line 11). If ´ x is recognized as some enrolled speaker (Line 9), then Algorithm 1 terminates and returns the maximal score of ´ x (Line 10), as max i ∈ G [ S ( ´ x )] i ≥ θ is the desired threshold. If the maximal score of ´ x is no less than ´ θ (Line 11), we restart the outer-while loop. One may notice that Algorithm 1 will not terminate when D ( ´ x ) is always equal to reject . In our experiments, this nev er happens (cf. § V). Furthermore, it estimates a very close value to the actual threshold. Remark that the actual threshold θ , obtained from the open-source SRS, is used to ev aluate the performance of Algorithm 1 only . C. Attack on CSI Systems A CSI system always classiﬁes an input voice as one of the enrolled speakers. Therefore, we can adapt the attack on the OSI systems by ignoring the threshold θ . Speciﬁcally , the loss function for targeted attack on CSI systems with the target speaker t ∈ G is deﬁned as: 6 T ABLE I: Dataset for experiments Datasets #Speaker Details T rain-1 Set 7,273 Part of V oxCeleb1 [69] and whole V oxCeleb2 [70] used for training iv ector and GMM T rain-2 Set 2,411 Part of LibriSpeech [71] used for training system C in transferability T est Speaker Set 5 5 speakers from LibriSpeech 3 female and 2 male, 5 voices per speaker , voices range from 3 to 4 seconds Imposter Speaker Set 4 Another 4 speakers from LibriSpeech 2 female and 2 male, 5 voices per speaker , voices range from 2 to 14 seconds f ( x ) = max  (max i ∈ G \{ t } [ S ( x )] i − [ S ( x )] t ) , − κ  Intuitiv ely , we want to ﬁnd some small perturbation δ such that the score of the speaker t is the largest one among all the enrolled speakers, and [ S ( x )] t is at least κ greater than the second-largest score. Similarly , the loss function for untargeted attack on CSI systems is deﬁned as: f ( x ) = max { ([ S ( x )] m − max i ∈ G \{ m } [ S ( x )] i ) , − κ } where m denotes the true speaker of the original v oice. Intuitiv ely , we want to ﬁnd some small perturbation δ such that the largest score among other enrolled speakers is at least κ greater than the score of the speaker m . D. Attack on SV Systems An SV system has exactly one enrolled speaker and checks if an input voice is uttered by the enrolled speaker or not. Thus, we can adapt the attack on OSI systems by assuming the speaker group G is a singleton set. Speciﬁcally , the loss function for attacking SV systems is deﬁned as: f ( x ) = max { θ − S ( x ) , − κ } Intuitiv ely , we want to ﬁnd a small perturbation δ such that the score of x being recognized as the enrolled speaker is at least κ greater than the threshold θ . W e remark that the threshold estimation algorithm for SV systems should be revised by replacing the loss function f 0 at Line 6 in Algorithm 1 with the following function: f 0 = λx. max { ´ θ − S ( x ) , − κ } . V . A T T A C K E V A L UA T I O N W e ev aluate F A K E B O B for its attacking capabilities based on the following ﬁve aspects: effecti veness/ef ﬁciency, trans- ferability , practicability , imperceptibility , and robustness. A. Dataset and Experiment Design Dataset. W e mainly use three widely used datasets: V ox- Celeb1, V oxCeleb2, and LibriSpeech (cf. T able I). T o demon- strate our attack, we target the ivector and GMM systems from the popular open-source platform Kaldi, having 7,631 stars and 3,418 forks on Github [27]. The UBM model is trained using the T rain-1 Set as the background voices. The OSI and CSI are enrolled by 5 speakers from the T est Speaker Set , forming a speaker group. The SV is enrolled by 5 speakers from the T est Speak er Set , resulting in 5 i vector and 5 GMM systems. T ABLE II: Metrics used in this work Metric Description Attack success rate (ASR) Proportion of adversarial voices that are recognized as the target speaker Untargeted success rate (UTR) for CSI Proportion of adversarial samples that are not recognized as the source speaker Untargeted success rate (UTR) for OSI Proportion of adversarial samples that are not rejected by the target system W e conducted the experiments on a server with Ubuntu 16.04 and Intel Xeon CPU E5-2697 v2 2.70GHz with 377G RAM (10 cores). W e set κ = 0 , max iteration=1,000, max/min learning rate η is 1e-3/1e-6, search variance σ in NES is 1e-3, and samples per draw m in NES is 50 , unless explicitly stated. Evaluation metrics. T o ev aluate our attack, we use the metrics shown in T able II. Signal-noise ratio (SNR) is widely used to quantify the le vel of signal po wer to noise po wer , so we use it to measure the distortion of the adversarial voices [10]. W e use the equation, SNR(dB) = 10 log 10 ( P x /P δ ) , to obtain SNR, where P x is the signal power of the original voice x and P δ is the power of the perturbation δ . Larger SNR value indicates a (relativ ely) smaller perturbation. T o ev aluate efﬁciency , we use two metrics: number of iterations and time . (Note that the number of queries is the number of iterations multiplied by samples per draw m in NES and m = 50 in this work.) Experiment design. W e design ﬁv e experiments. (1) W e ev aluate the effectiveness and efﬁciency on both open-source systems (i.e., ivector , GMM, and xvector) and the commercial system T alentedsoft. W e also ev aluate F A K E B O B under intra- gender and inter-gender scenarios, as inter -gender attacks are usually more difﬁcult. (2) W e ev aluate the transferability by attacking the open-source systems with different architecture, training dataset, and parameters, as well as the commer- cial system Microsoft Azure. (3) W e further e valuate the practicability by playing the adversarial voices ov er the air in the physical world. (4) For human-imper ceptibility , we conduct a real human study through Amazon Mechanical T urk platform (MT urk) [30], a cro wdsourcing marketplace for human intelligence. (5) W e ﬁnally evaluate defense meth- ods , local smoothing, quantization, audio squeezing, temporal dependency-based detection, to defend against F A K E B O B . Recall that we demonstrate our attack on 16 representativ e attack scenarios out of 40 (cf. § II-B). In particular , we mainly consider targeted attack which is much more powerful and challenging than untargeted attack [9]. Our experiments suf ﬁce to understand the other four parameters of the attack model, i.e., inter-gender vs. intra-gender, API vs. over -the-air , OSI vs. CSI vs. SV , decision and scores vs. decision-only . The OSI task can be seen as a combination of the CSI and SV tasks (cf. § II). Thus, we sometimes only report and analyze the results on the OSI task due to space limitation, which is much more challenging and representative than the other two. The missing results can be found in Appendix. B. Effectiveness and Efﬁciency T arget model training. T o ev aluate the effecti veness and efﬁcienc y , we train ivector and GMM systems for the OSI, 7 T ABLE III: Six trained SRSs T ask Metrics ivector GMM CSI Accuracy 99.6% 99.3% SV FRR 1.0% 5.0% F AR 11.0% 10.4% OSI FRR 1.0% 4.2% F AR 7.9% 11.2% OSIER 0.2% 2.8% T ABLE IV: Results of threshold estimation ivector GMM θ ´ θ Time (s) θ ´ θ Time (s) 1.45 1.47 628 0.091 0.0936 157 1.57 1.60 671 0.094 0.0957 260 1.62 1.64 686 0.106 0.1072 269 1.73 1.75 750 0.113 0.1141 289 1.84 1.87 804 0.119 0.1193 314 0 1 2 3 4 5 20 40 60 80 100 % ASR UTR Fig. 5: T ransferability rate vs. κ CSI and SV tasks. The performance of these systems is sho wn in T able III, where accuracy is as usual, False Acceptance Rate (F AR) is the proportion of voices that are uttered by imposters but accepted by the system [18], False Rejection Rate (FRR) is the proportion of voices that are uttered by an enrolled speaker but rejected by the system [18], Open-set Identiﬁcation Error Rate (OSIER) is the rate of voices that cannot be correctly classiﬁed [17]. Notice that the threshold θ is 1 . 45 for i vector and 0 . 091 for GMM, so that the F AR is close to 10%. Although the parameter θ in SV and OSI tasks can be tuned using Equal Error Rate, i.e., F AR is equal to FRR, we found that the results for SV and OSI tasks do not vary too much (cf. T able XVII in Appendix). Setting. The parameter  is one of the most critical parameters of our attack. T o ﬁne-tune  , we study ASR, efﬁcienc y and distortion by v arying  from 0.05, 0.01, 0.005, 0.004, 0.003, 0.002, to 0.001, on iv ector and GMM for the CSI task. The results are giv en in Appendix C. W ith decreasing of  , both the attack cost and SNR increase, while ASR decreases. As a trade-off between ASR, efﬁcienc y , and distortion, we set  = 0 . 002 in this experiment. The target speakers are the speakers from the T est Speak er Set (cf. T able I), the source speakers are the speakers, from the T est Speaker Set for CSI, and from the Imposter Speak er Set (cf. T able I) for SV and OSI. Ideally , we will craft 100 adversarial samples using F A K E B O B for each task, where 40 adversarial samples are intra-gender and 60 inter-gender for CSI, and 50 intra-gender and 50 inter -gender for SV and OSI. Note that to di versify experiments, the source speakers of CSI and SV/OSI are designated to be different. Results. The results are shown in T able V. Since the OSI task is more challenging and representative than the other two, we only analyze the results of the OSI task here. W e can observe that F A K E B O B achieves 99.0% ASR for both iv ector and GMM. In terms of SNR, the av erage SNR value is 31.5 (dB) for iv ector and 31.4 (dB) for GMM, indicating that the perturbation is less than 0.071% and 0.072%. Furthermore, the av erage numbers of iterations and execution time are 86 and 38.0 minutes on iv ector . The av erage numbers of iterations and ex ecution time are 38 and 3.8 minutes on GMM, much smaller than that of ivector . Due to space limitation, results of attacking xvector are giv en in Appendix D where we observe similar results. These results demonstrate the effecti veness and efﬁcienc y of F A K E B O B . W e can also observ e that inter -gender attack is much more difﬁcult (more iterations and ex ecution time) than intra-gender attack due to the dif ference between sounds of male and female. Moreover , ASR of inter-gender attack is also lower than that of intra-gender attack. The result unv eils that once the gender of the target speaker is kno wn by attackers, it is much easier to launch an intra-gender attack. For ev aluation of the threshold estimation algorithm, we report the estimated threshold ´ θ in T able IV by setting 5 different thresholds. The estimation error is less than 0.03 for iv ector and less than 0.003 for GMM. This shows that our algorithm is able to effecti vely estimate the threshold in less than 13.4 minutes. Note that our attack is black-box, and the actual thresholds are accessed only for e valuation. Attacking the commercial system T alentedsoft [28]. W e also ev aluate the effecti veness and efﬁcienc y of F A K E B O B on T alentedsoft, dev eloped by the constitutor of the voiceprint recognition industry standard of the Ministry of Public Secu- rity (China). W e query this online platform via the HTTP post (seen as the exposed API). Since T alentedsoft targets Chinese Mandarin, to fairly test T alentedsoft, we use the Chinese Mandarin voice database aishell-1 [72]. Both F AR and FRR of T alentedsoft are 0.15%, tested using 20 speakers and 7,176 voices in total which are randomly chosen from aishell-1. W e enroll 5 randomly chosen speakers from aishell-1 as targeted speakers, resulting in 5 SV systems. Each of them is attacked using another 20 randomly chosen speakers and one randomly chosen voice per speaker . Our attack achieves 100% ASR within 50 iterations (i.e., 2,500 queries) on average. Remark that F A K E B O B is an iterativ e-based method. W e can always set some time slot between iterations or queries so that such amount of queries do not cause hea vy traf ﬁc burden to the server , hence our attack is feasible. This demonstrates the effecti veness and ef ﬁciency of F A K E B O B on commercial systems that are completely black-box. C. T ransfer ability T ransferability [7] is the property that some adv ersarial samples produced to mislead a model (called source system) can mislead other models (called target system) e ven if their architectures, training datasets, or parameters differ . Setting. T o e valuate the transferability , we regard the pre- viously built GMM (A) and i vector (B) as source systems and build another 8 target systems (denoted by C,. . . ,J respec- tiv ely). C,. . . ,I are i vector systems differing in key parameter and training dataset, and J is the xvector system. For details and performance of these systems, refer to T ables XIV and XV 8 T ABLE V: Experimental results of F A K E B O B when  = 0 . 002 , where #Iter refers to #Iteration. T ask System System (Intra-gender attack) System (Inter -gender attack) ivector GMM ivector GMM ivector GMM #Iter Time (s) SNR (dB) ASR (%) #Iter Time (s) SNR (dB) ASR (%) #Iter Time (s) SNR (dB) ASR (%) #Iter Tine (s) SNR (dB) ASR (%) #Iter Time (s) SNR (dB) ASR (%) #Iter Time (s) SNR (dB) ASR (%) CSI 124 2845 30.2 99.0 40 218 29.3 99.0 92 2115 29.3 100.0 25 126 28.8 100.0 146 3340 30.8 98.0 50 278 29.62 98.0 SV 84 2014 31.6 99.0 39 241 31.4 99.0 31 751 31.7 98.0 30 185 31.7 100.0 135 3252 31.6 100.0 48 298 31.2 98.0 OSI 86 2277 31.5 99.0 38 226 31.4 99.0 32 833 31.3 98.0 31 178 31.5 100.0 140 3692 31.6 100.0 45 274 31.2 98.0 Cross ar chit ec t u r e Cross d ataset Cross p ar amet er A → B B → C A → D A → E A → F A → G A → H A → I A → J B → J A → C B → A B → D B → E B → F B → G B → H B → I Fig. 6: Distribution of transferability attacks in Appendix. W e denote by X → Y the transferability attack where X is the source system and Y is the target system. The distribution of the transferability attacks is sho wn in Fig. 6 in terms of architecture, training dataset, and ke y parameters. W e can see that some attacks belong to multiple scenarios. W e set  = 0 . 05 and (1) κ = 0 . 2 (GMM) and κ = 10 (ivector) for the CSI task, (2) κ = 3 (GMM) and κ = 4 (i vector) for the SV task, (3) κ = 3 (GMM) and κ = 5 (i vector) for the OSI task. Remark that κ differs from architectures and tasks due to their different scoring mechanisms. W e ﬁne-tuned the parameter κ for ASR under the max iteration bound 1,000. Results. The results of attacking OSI systems are shown in T able VI. All the attacks (except for B → A ) achiev e 34%- 68% ASR and 40%-100% UTR. For B → D , B → E , B → F , B → G , and B → H (all are iv ector , but differ in one key parameter), F A K E B O B achiev es 100% ASR and UTR, indicating that cross architecture reduces transferability rate. From A → B and A → C (where A is GMM, B and C are iv ector but dif fer in training data), cross dataset also reduces transferability rate. The transferability rate of B → A is the lowest one and less than that of A → B , indicating that transferring from the architecture iv ector ( B ) to GMM ( A ) is more difﬁcult. Compared with A → C (both cross dataset and architecture), B → C (cross dataset) achiev es nearly 20% more ASR and UTR. This reveals that the larger the dif ference between the source and target systems is, the more difﬁcult the transferability attack is. Due to space limitation, the results of attacking the CSI and SV systems are shown in T ables XVI and XVIII in Appendix. W e can observe similar results. The av erage SNR is similar to the one given in T able VII. T o understand ho w the value of κ inﬂuences the transfer- ability rate, we conduct B → F attack (OSI task) by ﬁxing  = 0 . 05 and varying κ from 0.5 to 5.0 with step 0.5. In this experiment, the number of iterations is unlimited. The results are shown in Fig. 5. Both ASR and UTR increase quickly with κ , and reach 100% when κ = 4 . 5 . This demonstrates that increasing the value of κ increases the probability of a successful transferability attack. Attacking the commercial system Microsoft Azure [29]. Microsoft Azure is a cloud service platform with the second largest market share in the world. It supports both the SV and OSI tasks via HTTP REST API. Unlike T alentedsoft, Azure’ s API only returns the decision (i.e., the predicted speaker) along with 3 conﬁdence levels (i.e., low , normal and high) instead of scores, so we attack this platform via transferability . W e enroll 5 speakers from the T est Speaker Set to build an OSI system on Azure (called OSI-Azure for simplicity). Its F AR is 0% tested by the Imposter Speaker Set. For each target speaker , we randomly select 10 source speakers and 2 v oices per source speaker from LibriSpeech, which are rejected by OSI-Azure. W e set  = 0 . 05 and craft 100 adversarial v oices on the GMM system, as it produces high tranferability rate in the above experiment. The ASR, UTR and SNR are 26.0%, 41.0% and 6.8 dB, respecti vely . They become 34.0%, 57.0% and 2.2 dB when we increase  from 0 . 05 to 0 . 1 . W e also demonstrate F A K E B O B on the SV task of Azure (SV -Azure) which is text-dependent with 10 supported texts. W e recruited and asked 2 speakers to read each text 10 times, resulting in 200 voices. For each pair of speaker and text, we randomly select 3 enrollment v oices for both GMM and SV - Azure, and the F ARs of them are 0%. W e attack SV -Azure using 200 adv ersarial samples crafted from GMM (  = 0 . 05 , κ = 3 ). Ho we ver , SV -Azure reports “error, too noisy” instead of “accept” or “reject” for 190 adversarial voices. Among the other 10 voices, one voice is accepted, leading to 10% ASR. T o our knowledge, this is the ﬁrst time that SV -Azure is successfully attack ed. As Azure is proprietary without any publicly av ailable information, it is v ery difﬁcult to kno w the reason why SV -Azure outputs “error, too noisy”. After comparing the SNR of the 190 voices with the other 10 voices (8.8 dB vs. 11.5 dB), we suspect that it checks each input and outputs “error, too noisy” without model classiﬁcation if the noise of the input is too large. This check makes SV -Azure more challenging to attack, but we infer it may also reject normal voices when the background is noisy in practice. D. Practicability for Over-the-Air Attack T o simulate over -the-air attack in the physical world, we ﬁrst craft adversarial samples by directly interacting with API of the system (i.e., over the line), then play and record these adversarial voices via loudspeakers and microphones, and ﬁnally send recorded v oices to the system via API to check their ef fectiv eness. Our experiments are conducted in an indoor room (length, width, and height are 10, 4, 3.5 meters). 9 T ABLE VI: T ransferability rate (%) for OSI task, where S and T denote source and tar get systems respectiv ely . S T A B C D E F G H I J ASR UTR ASR UTR ASR UTR ASR UTR A TR UTR ASR UTR ASR UTR ASR UTR ASR UTR ASR UTR A — — 62.0 64.0 48.0 48.0 55.2 56.9 68.0 68.0 64.0 64.0 52.0 54.0 68.0 68.0 38.0 40.0 34.0 42.0 B 5.0 5.0 — — 67.5 67.5 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 72.5 75.0 40.0 41.7 T ABLE VII: Results of different systems System SNR (dB) Result (%) Normal voices Adversarial voices ivector CSI 6.6 Accuracy: 100 ASR: 80, UTR: 80 SV 9.8 F AR: 0, FRR: 0 ASR: 76 OSI 7.8 F AR: 4, FRR: 0, OSIER: 0 ASR: 100, UTR: 100 GMM CSI 6.1 Accuracy: 85 ASR: 90, UTR: 100 SV 7.9 F AR: 0, FRR: 62 ASR: 100 OSI 8.2 F AR: 0, FRR: 65, OSIER: 0 ASR: 100, UTR: 100 Azure OSI 6.8 F AR: 5, FRR: 2, OSIER: 0 ASR: 70, UTR: 70 T o thoroughly ev aluate F A K E B O B , the over -the-air attacks vary in systems, devices (loudspeakers and microphones), distance between loudspeakers and microphones, and acoustic en vironments. In total, it covers 26 scenarios. The ov erview of different settings is shown in T able XIX in Appendix. W e con- sider all tasks of i vector and GMM, and the OSI-Azure only . W e use the same parameters as in Section V -C, as over-the-air attack is more practical yet more challenging due to the noise introduced from both air channel and electronic devices which probably disrupts the perturbations of adversarial samples. For OSI-Azure, we use the adversarial v oices crafted on GMM in Section V -C that are successfully transferred to OSI-Azure. Results of different systems. W e use portable speaker (JBL clip3 [73]) as the loudspeaker , iPhone 6 Plus (iOS) as the microphone with 1 meter distance between them. W e attack all tasks of i vector and GMM, and the OSI-Azure in a relatively quiet environment. The results are shown in T able VII. W e can observe that the FRR of GMM SV (resp. OSI) is 62% (resp. 65%), re vealing that GMM is less robust than iv ector for normal voices. F A K E B O B achieves (1) for the CSI task, 90% ASR (i.e., the system classiﬁes the adversarial voice as the tar get speaker) and 100% UTR (i.e., the system does not classify the adversarial voice as the source speaker) on the GMM, and achiev es 80% ASR and 80% UTR on the i vector; (2) for the SV task, at least 76% ASR; (3) for the OSI task, 100% ASR on both the GMM and ivector; (4) achie ves 70% ASR on the commercial system OSI-Azure. In terms of SNR, the a verage SNR is no less than 6.1 dB, and the av erage SNR is up to 9.8 dB on the i vector for the SV task, indicating that the power of the signal is 9.5 times greater than that of the noise. Moreov er , the SNR is much better than the over -the-air attack in CommanderSong [10]. Results of different devices. For loudspeakers, we use 3 common devices: laptop (DELL), portable speaker (JBL clip3) and broadcast equipment (Shinco [74]). F or microphones, we use built-in microphones of 2 mobile phones: OPPO (Android) and iPhone 6 Plus (iOS). W e ev aluate F A K E B O B against the OSI task of iv ector with 1 meter distance in a relatively quiet en vironment. The results are sho wn in T able VIII. T ABLE VIII: Results of different de vices (%), where L and M denote loudspeakers and microphones respectiv ely . L M iPhone 6 Plus (iOS) OPPO (Android) Normal voices Adv . voices Normal voices Adv . voices F AR FRR OSIER ASR UTR F AR FRR OSIER ASR UTR DELL 10 0 0 100 100 13 6 0 78 80 JBL clip3 4 0 0 100 100 6 0 0 80 80 Shinco 8 5 0 89 91 14 0 0 75 75 W e can observe that for any pair of loudspeaker and microphone, F A K E B O B can achieve at least 75% ASR and UTR. When JBL clip3 or DELL is the loudspeaker and iPhone 6 Plus is the microphones, F A K E B O B is able to achieve 100% ASR. When the loudspeaker is ﬁxed, the ASR and UTR of attacks using IPhone 6 Plus are higher (at least 14% and 16% more) than that of using OPPO. Possible reason is that the sound quality of iPhone 6 Plus is better than that of OPPO phone. These results demonstrate the ef fectiv eness of F A K E B O B on various devices. Results of different distances. T o understand the impact of the distance between loudspeakers and microphones, we v ary distance from 0.25, 0.5, 1, 2, 4 to 8 meters. W e attack the OSI task of i vector in a relativ ely quiet en vironment by using JBL clip3 as the loudspeaker and iPhone 6 Plus as the microphone. The results are shown in T able IX. W e can observe that F A K E B O B can achie ve 100% ASR and UTR when the distance is no more than 1 meter . When the distance is increased to 2 meters (resp. 4 meters), ASR and UTR drop to 70% (resp. 40% and 50%). Although ASR and UTR drop to 10% when the distance is 8 meters, FRR also increases to 32%. This shows the effecti veness of F A K E B O B under dif ferent distances. Results of different acoustic en vironments . W e attack the OSI task of ivector using JBL clip3 and iPhone 6 Plus with 1 meter distance. T o simulate different acoustic en vironments, we play different types of noises in the background using Shinco broadcast equipment. Speciﬁcally , we select 5 types of noises from Google AudioSet [75]: white noise, bus noise, restaurant noise, music noise, and absolute music noise. White noise is widespread in nature, while bus, restaurant, (absolute) music noises are representati ve of se veral daily life scenarios where F A K E B O B may be launched. F or white noise, we vary its v olume from 45 dB to 75 dB, while the volumes of other noises are 60 dB. Both adversarial and normal voices are played at 65 dB on av erage. The results are shown in T able X. W e can observe that F A K E B O B achie ves at least 48% ASR and UTR when the volume of background noises is no more than 60 dB no matter the type of the noises. Although both ASR and UTR decrease with increasing the v olume of white noises, the FRR also increases quickly . This demonstrates the effecti veness of F A K E B O B in different acoustic en vironments. 10 T ABLE IX: Results of different distances (%) Distance (meter) 0.25 0.5 1 2 4 8 Normal V oices F AR 4 3 4 6 0 0 FRR 0 0 0 5 10 32 OSIER 0 0 0 0 0 0 Adversarial V oices ASR 100 100 100 70 40 10 UTR 100 100 100 70 50 10 T ABLE X: Results of different acoustic en vironments (%) En vironment Quiet White (45 dB) White (50 dB) White (60 dB) White (65 dB) White (75 dB) Bus (60 dB) Rest. (60 dB) Music (60 dB) Abs. Music (60 dB) Normal voices F AR 4 0 6 0 0 10 0 0 0 4 FRR 0 5 12 30 40 97 25 20 10 10 OSIER 0 0 0 0 0 0 0 0 10 0 Adv . voices ASR 100 75 70 57 20 2 50 50 66 48 UTR 100 75 70 60 20 2 50 50 67 48 E. Human-Imperceptibility via Human Study T o demonstrate the imperceptibility of adversarial samples, we conduct a human study on MTurk [30]. The surve y is approv ed by the Institutional Revie w Board (IRB) of our institutes. Setup of human study . W e recruit participants from MT urk and ask them to choose one of the two tasks and ﬁnish the corresponding questionnaire. W e neither rev eal the purpose of our study to the participants, nor record personal information of participants such as ﬁrst language, age and region. The Amazon MT urk has designed Acceptable Use Policy for permitted and prohibited uses of MT urk, which prohibits bots or scripts or other automated answering tools to complete Human Intelligence T asks [76]. Thus, we argue that the number of participants can reasonably guarantee the di versity of participants. The two tasks are described as follo ws. • T ask 1: Clean or Noisy . This task asks participants to tell whether the playing voice is clean or noisy . Speciﬁcally , we randomly select 12 original voices and 15 adversarial voices crafted from other original v oices, among which 12 adversarial voices are randomly selected from the voices which become non-adversarial (called ineffecti ve) when playing ov er the air with  = 0 . 002 and low conﬁdence, and the other 3 are randomly selected from the v oices which remain adversarial (called ef fectiv e) when playing ov er the air with  = 0 . 1 and high conﬁdence. W e ask users to choose whether a voice has any background noise (The three options are clean , noisy , and not sure ). • T ask 2: Identify the Speaker . This task asks participants to tell whether the voices in a pair are uttered by the same speaker . Speciﬁcally , we randomly select 3 speakers (2 male and 1 female), and randomly choose 1 normal voice per speaker (called reference voice). Then for each speaker , we randomly select 3 normal voices, 3 distinct adv ersarial voices that are crafted from other normal voices of the same speaker , and 3 normal voices from other speakers. In summary , we build 27 pairs of voices: 9 pairs are normal pairs (one reference v oice and one normal v oice from the same speaker), 9 pairs are other pairs (one reference voice and one normal voice from another speaker) and 9 pairs are adversarial pairs (one reference voice and one adversarial voice from the same speaker). Among 9 adversarial pairs, 6 pairs contain ef fectiv e adversarial samples when playing ov er the air , and 3 pairs do not. W e ask the participants to tell whether the voices in each pair are uttered by the same speaker (The three options are same , differ ent , and not sure ). T o ensure the quality of our questionnaire and validity of our results, we ﬁlter out the questionnaires that are randomly chosen by participants. In particular , we set three simple questions in each task. For task 1, we insert three silent voices as a concentration test. For task 2, we insert three pairs of voices, where each pair contains one male voice and one female voice as a concentration test. Only when all of them are correctly answered, we regard it as a valid questionnaire, otherwise, we exclude it. Results of human study . W e ﬁnally received 135 question- naires for task 1 and 172 questionnaires for task 2, where 27 and 11 questionnaires are ﬁltered out as they failed to pass our concentration tests. Therefore, there are 108 valid questionnaires for task 1 and 161 valid questionnaires for task 2. The results of the human study are shown in Fig. 7. For task 1, as shown in Fig. 7(a), 10.7% of participants heard noise on normal voices, while 20.2% and 84.8% of participants heard noise on inef fectiv e and effecti ve adversarial voices (when played ov er-the-air) respecti vely . W e can see that 78.8% of participants still believ e that inef fectiv e voices are clean. For effecti ve voices, we found that 84.8% is comparable to the recent white-box adversarial attack (i.e., 83%) that tailors to craft imperceptible voices against speech recognition systems [20]. (W e are not aware of any other adversarial attacks against SRSs that hav e done such human study .) For task 2 which is more interesting (in Fig. 7(b)), 86.5% of participants belie ve that voices in each other pair are uttered by different speakers, indicating the quality of col- lected questionnaires. F or the adversarial pairs , 54.6% of participants believ e that voices in each pair are uttered by the same speaker , very close to the baseline 53.7% of normal pairs , indicating that humans cannot differentiate the speakers of the normal and adversarial voices. The prior work [14] conducted an ABX testing on adversarial samples crafted by white-box attacks against SV systems. The ABX test ﬁrst provides to users two v oices A and B , each being either the original (reconstructed) voice or an adv ersarial voice; then provides the third v oice X which was randomly chosen from { A, B } ; ﬁnally asks the users to decide if X is A or B . The ABX testing of [14] shows that 54% of participants correctly classiﬁed the adversarial voices, which is very close to ours. For the adversarial pairs which contain ineffecti ve adversarial voices, 64.9% of participants believed that the two v oices are from the same speakers, much greater than the baseline 53.7%, thus more imperceptible. For the adversarial pairs which contain ef fecti ve adversarial v oices, 54.0% of participants can deﬁnitely differentiate the speaker , not too larger than the baseline 42.2% of normal pairs . 11 original adver all adver non air adver air 0 20 40 60 80 % clean noisy not sure (a) T ask 1: clean or noisy normal pair other pair adver pair all adver pair non air adver pair air 0 20 40 60 80 % same different not sure (b) T ask 2: identify the speaker Fig. 7: Results of human study , where air (resp. non air) denotes voices that are ef fectiv e (resp. ineffecti ve) for ov er- the-air attack The results un veil that the adv ersarial voices crafted by F A K E B O B can make systems misbehave (i.e., making a deci- sion that the adversarial voice is uttered by the target speaker), while most of ineffecti ve adversarial samples are classiﬁed clean and cannot be differentiated by ordinary users, and the results of effecti ve ones are comparable to existing related works. Hence, our attack is reasonably human-imperceptible. F . Robustness of F A K E B O B against Defense Methods As mentioned in Section III-B, we study four defense methods: local smoothing, quantization, audio squeezing and temporal dependency detection. W e e valuate on the OSI task of the GMM system unless explicitly stated using 100 seed voices. The FRR, F AR, ASR and UTR of the system without defense is 4.2%, 11.2%, 99% and 99%, respecti vely . W e consider two settings: (S1) crafting adversarial voices on the system without defense and attacking the system with defense, and (S2) directly attacking the system with defense. S1 follo ws from CommanderSong [10]. An ef fectiv e defense method should be able to mitigate the perturbation or detect the adversarial voices in S1. Thus, we will use the UTR metric. In S2, an effecti ve defense method should increase the o verhead of the attack and decrease the attack success rate, thus we will use the ASR metric. W e set  = 0 . 002 , a very weak attacker capacity . Increasing  will make F A K E B O B more powerful. W e found that the local smoothing can increase attack cost, but is inef fectiv e in terms of ASR, audio squeezing is ineffecti ve in terms of both attack cost and ASR, while the other two are not suitable for defending our attack. Due to space limitation, details are giv en in Appendix E. V I . D I S C U S S I O N O F T H E P O S S I B L E A R M R AC E This section discusses the potential mitigation of our attacks and possible advanced attacks. Mitigation of F A K E B O B . W e have demonstrated that four defense methods have limited ef fects on F A K E B O B although some of them are reported promising in the speech recognition domain. This re veals that more effecti ve defense methods are needed to mitigate F A K E B O B . W e discuss se veral possible defense methods as follows. V arious liv eness detection methods have been proposed to detect spooﬁng attacks on SRSs. Such methods detect attacks by e xploiting the dif ferent physical characteristics of the voices generated by the human speech production system (i.e., lungs, vocal cords, and vocal tract) and electronic loudspeaker . For instance, Shiota et al. [77] use pop noise caused by human breath, V oiceLive [78] lev erages time-difference-of-arri val of voices to the receiv er , and V oiceGesture [79] leverages the unique articulatory gesture of the user . Adversarial voices also need to be played via loudspeakers, hence liveness detection could be possibly used to detect them. An alternativ e detec- tion method is to train a detector using adversarial voices and normal voices. Though promising in image recognition domain [80], it has a very high false-positi ve rate and does not improve the robustness when the adversary is aware of this defense [81]. Another scheme to mitigate adversarial images is input transformation such as image bit-depth reduction and JPEG compression [82]. W e could mitigate adversarial voices by leveraging input transformations such as bit-depth reduction and MP3 compression. Howe ver , Athalye et al. [83] hav e demonstrated that input transformation on images can be easily circumvented by strong attacks such as Backward Pass Differentiable Approximation. W e conjecture that bit- depth reduction and MP3 compression may become inef fectiv e for high-conﬁdence adversarial voices. Finally , one could also improve the security of SRSs by using a text-dependent system and requiring users to read dynamically and randomly generated sentences. By doing so, the adversary has to attack both the speaker recognition and the speech recognition, hence incurring attack costs. If the set of phrases to be uttered is relati vely small, we could also attack the system by iteratively querying the tar get system using the voice corresponding to the generated phrase. While our attack will fail when the set of phrases to be uttered is very large or ev en inﬁnite. Howe ver , this also brings the challenge for the recognition system, as the training data may not be able to cov er all the possible normal phrases and v oices. In our future work, we will study the abov e methods [77], [78], [79], [82], [83], [84], [85] for adversarial attacks. W e next discuss possible methods on improving adversarial attacks. Possible advanced attacks. For a system that outputs the decision result and scores, F A K E B O B can directly craft ad- versarial voices via interacting with it. Howe ver , for a system that only outputs the decision result, we hav e to attack it by lev eraging transferability . When the gap between source and target systems is lar ger , the transferability rate is limited. One possible solution to improv e F A K E B O B is to le verage the boundary attack, which is proposed to attack decision-only image recognition systems by Brendel et al. [86]. Our human study shows that our attack is reasonably human-imperceptible. Howe ver , many of ef fective adversarial voices are still noisier than original v oices (human study task 1), and some of ef fectiv e adversarial voices can be differenti- ated from different speakers by ordinary users (human study task 2), there still has space for improving imperceptibility in future. One possible solution is to build a psychoacoustic model and limit the maximal difference between the spectrum of the original and adversarial voices to the masking threshold 12 (hearing threshold) of human perception [87], [20]. V I I . R E L A T E D W O R K The security issues of intelligent voice systems hav e been studied in the literature. In this section, we discuss the most related work on attacks ov er the intelligent voice systems, and compare them with F A K E B O B . Adversarial voice attacks. Gong et al. [13] and Kreuk et al. [14] respectiv ely proposed adversarial voice attacks on SRSs in the white-box setting, by le veraging the Fast Gradient Sign Method (FGSM) [22]. The attack in [13] addresses DNN-based gender recognition, emotion recognition and CSI systems, while the attack in [14] addresses a DNN-based SV system. Compared to them: (1) Our attack F A K E B O B is black-box and more practical. (2) F A K E B O B addresses not only the SV and CSI, but also the more general OSI task. (3) W e demonstrate our attack on i vector , GMM and DNN- based systems in the popular open-source platform Kaldi. (4) F A K E B O B is effecti ve on the commercial systems, ev en when playing over the air , which was not considered in [13], [14]. In a concurrent work, Abdullah et al. [60] proposed a poisoning attack on speaker and speech recognition systems, that is demonstrated on the OSI-Azure. There are three key differences: (1) Their attack crafts an adversarial voice from a voice uttered by an enr olled speaker A such that the adversarial voice is neither rejected nor recognized as the speaker A . Thus, their attack neither can choose a speciﬁc source speaker nor a speciﬁc target speaker to be recognized by the system, consequently , they cannot launch tar geted attack or attacks against the SV task. Whereas our attack goes be yond their attack. (2) They craft adv ersarial v oice by decomposing and reconstructing an input voice, hence, achiev ed a limited untargeted success rate and cannot be adapted to launch more interesting and po werful targeted attacks. (3) W e e valuate ov er- the-air attacks in the physical world, b ut they did not. W e cannot compare the performance (i.e., effecti veness and efﬁcienc y) of our attack with the three related works abov e [13], [14], [60] because all of them are not av ailable. W e are the ﬁrst considering the threshold θ in adversarial attack. Adversarial attacks on speech recognition systems also hav e been studied [11], [9], [88]. Carlini et al. [9] attacked DeepSpeech [89] by crafting adv ersarial voices in the white- box setting, but failed to attack when playing o ver the air . In the black-box setting, Rohan et al. [11] combined a genetic algorithm with ﬁnite difference gradient estimation to craft adversarial voices for DeepSpeech, b ut achie ved a limited success rate with strict length restriction ov er the voices. Alzantot et al. [88] presented the ﬁrst black-box adversarial attack on a CNN-based speech command classiﬁcation model by exploiting a genetic algorithm. Howe ver , due to the dif- ference between speaker recognition and speech recognition, these works are orthogonal to our work and cannot be applied to ivector and GMM based SRSs. Other types of voice attacks. Other types of voice attacks include hidden voice attack (both against speech and speak er recognition) and spooﬁng attack (against speaker recognition). Hidden v oice attack aims to embed some information (e.g., command) into an audio carrier (e.g., music) such that the desired information is recognized by the target system without catching victims’ attention. Abdullah et al. [90] proposed such an attack on speaker and speech recognition systems. There are two key differences: (1) Based on characteristics of signal processing and psychoacoustics, their attack perturbed a sample uttered by an enrolled speaker such that it is still correctly classiﬁed as the enr olled speaker by the target system but becomes incomprehensible to human listening. While our attack perturbed a sample uttered by an arbitrary speaker such that it is misclassiﬁed as a target speaker (targeted attack) or another enrolled speaker (untargeted attack) b ut the perturbation is imperceptible to human listening. This means their attack addresses a dif ferent attack scenario compared with ours. (2) They did not demonstrate over -the-air attack on SRSs and their tool is not av ailable, hence it is unclear how effecti ve it is on SRSs. DolphinAttack [91], CommanderSong [10] and the work done by Carlini et al. [34] proposed hidden voice attacks on SRSs. Carlini et al. launched both black-box (i.e., in verse MFCC) and white-box (i.e., gradient decent) attacks on GMM based speech recognition systems. DolphinAttack exploited vulnerabilities of microphones and employed the ultrasound as the carrier of commands to craft inaudible voices. Howe ver , it can be easily defended by ﬁltering out the ultrasound from v oices. CommanderSong launched white- box attacks by exploiting a gradient descent method to embed commands into music songs. Another attack type on SRSs is spooﬁng attack [92] such as mimic [93], replay [94], [95], recorder attack [96], [95], voice synthesis [97], and voice con version [98], [99], [100], [95] attacks. Different from adversarial attack [14], [101], spooﬁng attack aims at obtaining a voice such that it is correctly classiﬁed as the target speaker by the system, and also sound like the targ et speaker listened by ordinary users. When anyone familiar with the victim (including the victim) cannot hear the attack voice, both spooﬁng and adversarial attacks can be launched. Howe ver , if someone familiar with the victim (including the victim) can hear the attack voice, he/she may detect the spooﬁng attack. Whereas, adversarial attack could be launched in this setting as discussed in Section II-B. V I I I . C O N C L U S I O N In this paper, we conducted the ﬁrst comprehensive and systematic study of adversarial attack on SRSs in a practical black-box setting, by proposing a novel practical adversarial attack F A K E B O B . F A K E B O B was thoroughly e valuated in 16 attack scenarios. F A K E B O B can achieve 99% targeted attack success rate on both open-source and the commercial systems. W e also demonstrated the transferability of F A K E B O B on Microsoft Azure. When played ov er the air in the physical world, F A K E B O B is also effecti ve. Our ﬁndings rev eal the security implications of F A K E B O B for SRSs, calling for more robust defense methods to better secure SRSs against such practical adversarial attacks. 13 A C K N O W L E D G M E N T S This research was partially supported by National Natural Science Foundation of China (NSFC) grants (No. 61532019 and No. 61761136011), National Research Foundation (NRF) Singapore, Prime Ministers Ofﬁce under its National Cyber- security R&D Program (A ward No. NRF2014NCR-NCR001- 30 and No. NRF2018NCR-NCR005-0001), National Research Foundation (NRF) Singapore, National Satellite of Excellence in T rustworthy Software Systems under its Cybersecurity R&D Program (A ward No. NRF2018NCR-NSOE003-0001), and National Research Foundation Inv estigatorship Singapore (A ward No. NRF-NRFI06-2020-0001). R E F E R E N C E S [1] T . Kinnunen and H. Li, “ An overvie w of text-independent speaker recognition: From features to supervectors, ” Speech Commun. , 2010. [2] TD Bank voiceprint. https://www .tdbank.com/bank/tdvoiceprint.html. [3] S. Nand, “Forensic and automatic speaker recognition system, ” IJCEE , 2018. [4] H. Ren, Y . Song, S. Y ang, and F . Situ, “Secure smart home: A voiceprint and internet based authentication system for remote access- ing, ” in ICCSE , 2016. [5] D. Ribas and E. V incent, “ An improv ed uncertainty propagation method for robust i-vector based speaker recognition, ” in ICASSP , 2019. [6] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Srndic, P . Lasko v , G. Giacinto, and F . Roli, “Evasion attacks against machine learning at test time, ” in ECML/PKDD , 2013. [7] C. Szegedy , W . Zaremba, I. Sutskev er , J. Bruna, D. Erhan, I. J. Goodfellow , and R. Fergus, “Intriguing properties of neural networks, ” in ICLR , 2014. [8] Y . Lei, S. Chen, L. Fan, F . Song, and Y . Liu, “ Advanced ev asion attacks and mitigations on practical ml-based phishing website classiﬁers, ” arXiv preprint arXiv:2004.06954 , 2020. [9] N. Carlini and D. W agner, “ Audio adversarial examples: T argeted attacks on speech-to-text, ” in IEEE S&P W orkshops , 2018. [10] X. Y uan, Y . Chen, Y . Zhao, Y . Long, X. Liu, K. Chen, S. Zhang, H. Huang, X. W ang, and C. A. Gunter, “Commandersong: A system- atic approach for practical adversarial voice recognition, ” in USENIX Security , 2018. [11] R. T aori, A. Kamsetty , B. Chu, and N. V emuri, “T argeted adversarial examples for black box audio systems, ” in IEEE S&P W orkshops , 2019. [12] S. Khare, R. Aralikatte, and S. Mani, “ Adversarial black-box attacks for automatic speech recognition systems using multi-objectiv e genetic optimization, ” CoRR , vol. abs/1811.01312, 2018. [13] Y . Gong and C. Poellabauer, “Crafting adversarial examples for speech paralinguistics applications, ” in DYN AMICS , 2018. [14] F . Kreuk, Y . Adi, M. Cisse, and J. Keshet, “Fooling end-to-end speaker veriﬁcation with adversarial examples, ” in ICASSP , 2018. [15] T . Liu and S. Guan, “Factor analysis method for text-independent speaker identiﬁcation, ” JSW , 2014. [16] D. A. Reynolds, T . F . Quatieri, and R. B. Dunn, “Speaker veriﬁcation using adapted gaussian mixture models, ” Digit. Signal Process. , 2000. [17] J. Fortuna, P . Siv akumaran, A. Ariyaeeinia, and A. Malegaonkar , “Open-set speaker identiﬁcation using adapted gaussian mixture mod- els, ” in INTERSPEECH , 2005. [18] H. Beigi, Fundamentals of Speaker Recognition . Springer , 12 2011. [19] H. Y akura and J. Sakuma, “Robust audio adversarial example for a physical attack, ” in IJCAI , 2019. [20] Y . Qin, N. Carlini, G. W . Cottrell, I. J. Goodfello w , and C. Raf fel, “Imperceptible, robust, and targeted adversarial examples for automatic speech recognition, ” in ICML , 2019. [21] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin, “Black-box adversarial attacks with limited queries and information, ” in ICML , 2018. [22] I. J. Goodfellow , J. Shlens, and C. Szegedy , “Explaining and harnessing adversarial examples, ” in ICLR , 2015. [23] A. Kurakin, I. J. Goodfellow , and S. Bengio, “ Adversarial examples in the physical world, ” in ICLR , 2017. [24] N. Carlini and D. W agner, “T owards ev aluating the robustness of neural networks, ” in IEEE S&P , 2017. [25] N. Dehak, P . J. Kenny , R. Dehak, P . Dumouchel, and P . Ouellet, “Front- end factor analysis for speaker veriﬁcation, ” IEEE T rans. on Audio, Speech, and Language Pr ocessing , 2010. [26] D. Snyder , D. Garcia-Romero, G. Sell, A. McCree, D. Pov ey , and S. Khudanpur, “Speaker recognition for multi-speaker conv ersations using x-vectors, ” in IEEE ICASSP , 2019. [27] Kaldi. https://github .com/kaldi- asr/kaldi. [28] T alentedsoft. http://www .talentedsoft.com. [29] Microsoft Azure, https://azure.microsoft.com. [30] Amazon Mechanical Turk Platform. https://www .mturk.com. [31] Z. Y ang, B. Li, P . Chen, and D. Song, “Characterizing audio adversarial examples using temporal dependency , ” in ICLR , 2019. [32] Citi uses voice prints to authenticate customers quickly and effortlessly . https://www .forbes.com/sites/tomgroenfeldt/2016/06/27/ citi- uses- voice- prints- to- authenticate- customers- quickly- and- ef fortlessly/ #7b01dea1109c. [33] The voice-enabled car of the future. https://tractica.omdia.com/ user- interface- technologies/the- v oice- enabled- car- of- the- future. [34] N. Carlini, P . Mishra, T . V aidya, Y . Zhang, M. Sherr , C. Shields, D. W agner , and W . Zhou, “Hidden voice commands, ” in USENIX Security , 2016. [35] Fakebob . https://sites.google.com/view/fakebob. [36] MSR Identity . https://www .microsoft.com/en- us/download/details. aspx?id=52279. [37] Amazon Alexa. https://developer .amazon.com/en- US/alexa. [38] Google Home. https://store.google.com/product/google home. [39] Speechpro. https://speechpro- usa.com. [40] NIST . National institute of standards and technology speaker recogni- tion evaluation. https://www .nist.gov/itl/iad/mig/speaker- recognition. [41] L. Muda, M. Begam, and I. Elamv azuthi, “V oice recognition algorithms using mel frequenc y cepstral coef ﬁcient (MFCC) and dynamic time warping (dtw) techniques, ” Journal of Computing , 2010. [42] N. P . H. Thian, C. Sanderson, and S. Bengio, “Spectral subband centroids as complementary features for speaker authentication, ” in ICB , 2004. [43] H. Hermansky , “Perceptual linear predictive (PLP) analysis of speech, ” The Journal of the Acoustical Society of America , vol. 87, no. 4, 1990. [44] M. K. Nandwana, L. Ferrer , M. McLaren, D. Castan, and A. Lawson, “ Analysis of critical metadata factors for the calibration of speaker recognition systems, ” in INTERSPEECH , 2019. [45] P . S. Nidadav olu, V . Iglesias, J. V illalba, and N. Dehak, “In vestigation on neural bandwidth extension of telephone speech for improved speaker recognition, ” in ICASSP , 2019. [46] K. A. Lee, Q. W ang, and T . Koshinaka, “The CORAL+ algorithm for unsupervised domain adaptation of PLD A, ” in ICASSP , 2019. [47] T encent VPR. https://cloud.tencent.com/product/vpr. [48] Fosafer VPR. http://caijing.chinadaily .com.cn/chanye/2018- 06/06/ content 36337667.htm. [49] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text- dependent speaker veriﬁcation, ” in ICASSP , 2016, pp. 5115–5119. [50] S. Sremath T irumala and S. R. Shahamiri, “ A review on deep learning approaches in speaker identiﬁcation, ” in ICSPS , 2016. [51] D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identiﬁcation using gaussian mixture speaker models, ” IEEE T rans. Speech and Audio Pr ocessing , vol. 3, no. 1, pp. 72–83, 1995. [52] V . V estman, D. Gowda, M. Sahidullah, P . Alku, and T . Kinnunen, “Speaker recognition from whispered speech: A tutorial survey and an application of time-varying linear prediction, ” Speech Commun. , 2018. [53] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper , B. Catanzaro, Q. Cheng, G. Chen et al. , “Deep speech 2: End-to-end speech recognition in english and mandarin, ” in International conference on machine learning , 2016, pp. 173–182. [54] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y . Cao, A. Kannan, and Z. Zhu, “Deep speaker: an end-to-end neural speaker embedding system, ” CoRR , vol. abs/1705.02304, 2017. [55] “ Android app which enables unlock of mobile phone via voice print, ” http://app.mi.com/details?id=com.jie.lockscreen. [56] “Social software wechat adds v oiceprint lock login function, ” https: //kf.qq.com/touch/wxappfaq/1208117b2mai141125YZjAra.html. [57] VPR of iFL YTEK. https://www .xfyun.cn/services/isv. [58] Sinovoice voice print recognition. http://doc.aicloud.com/sdk5.2.8. [59] Speakin vpr . http://www .speakin.mobi/devPlatform.html. 14 [60] H. Abdullah, M. S. Rahman, W . Garcia, L. Blue, K. W arren, A. S. Y a- dav , T . Shrimpton, and P . T raynor , “Hear ”no evil”, see ”kenansville”: Efﬁcient and transferable black-box attacks on speech recognition and voice identiﬁcation systems, ” CoRR , vol. abs/1910.05262, 2019. [61] N. Papernot, P . McDaniel, I. Goodfellow , S. Jha, Z. B. Celik, and A. Swami, “Practical black-box attacks against machine learning, ” in AsiaCCS , 2017, pp. 506–519. [62] P .-Y . Chen, H. Zhang, Y . Sharma, J. Y i, and C.-J. Hsieh, “Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models, ” in AISec , 2017, pp. 15–26. [63] M. Sharif, S. Bhagav atula, L. Bauer , and M. K. Reiter , “ Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition, ” in ACM CCS , 2016, pp. 1528–1540. [64] M. Alzantot, Y . Sharma, S. Chakraborty , H. Zhang, C. Hsieh, and M. B. Sriv astav a, “Genattack: practical black-box attacks with gradient-free optimization, ” in GECCO , 2019, pp. 1111–1119. [65] L. M. Rios and N. V . Sahinidis, “Derivati ve-free optimization: a review of algorithms and comparison of software implementations, ” Journal of Global Optimization , vol. 56, no. 3, pp. 1247–1293, 2013. [66] Y . Dong, F . Liao, T . Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum, ” in CVPR , 2018, pp. 9185–9193. [67] A. Madry , A. Makelov , L. Schmidt, D. Tsipras, and A. Vladu, “T ow ards deep learning models resistant to adversarial attacks, ” in ICLR , 2018. [68] Y . Duan, Z. Zhao, L. Bu, and F . Song, “Things you may not know about adversarial example: A black-box adversarial image attack, ” CoRR , vol. abs/1905.07672, 2019. [69] A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large-scale speaker identiﬁcation dataset, ” in INTERSPEECH , 2017. [70] J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition, ” in INTERSPEECH , 2018. [71] V . Panayotov, G. Chen, D. Pove y, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books, ” in ICASSP , 2015. [72] H. Bu, J. Du, X. Na, B. W u, and H. Zheng, “ Aishell-1: An open- source mandarin speech corpus and a speech recognition baseline, ” in 2017 20th Conference of the Oriental Chapter of the International Coor dinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA) , Nov 2017, pp. 1–5. [73] JBL clip3 portable speaker. https://www .jbl.com/bluetooth- speakers/ JBL+CLIP+3.html. [74] Shinco broadcast equipment. https://item.jd.com/5009202.html. [75] J. F . Gemmeke, D. P . W . Ellis, D. Freedman, A. Jansen, W . Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “ Audio set: An ontology and human-labeled dataset for audio ev ents, ” in ICASSP , 2017. [76] Amazon mechanical turk acceptable use policy . https://www .mturk.com/acceptable-use-policy . [77] S. Shiota, F . V illavicencio, J. Y amagishi, N. Ono, I. Echizen, and T . Matsui, “V oice liv eness detection algorithms based on pop noise caused by human breath for automatic speaker veriﬁcation, ” in IN- TERSPEECH , 2015. [78] L. Zhang, S. T an, J. Y ang, and Y . Chen, “V oiceLive: A phoneme localization based liv eness detection for voice authentication on smart- phones, ” in A CM CCS , 2016. [79] L. Zhang, S. T an, and J. Y ang, “Hearing your voice is not enough: An articulatory gesture based liveness detection for voice authentication, ” in ACM CCS , 2017. [80] Z. Gong, W . W ang, and W .-S. Ku, “ Adversarial and clean data are not twins, ” arXiv pr eprint arXiv:1704.04960 , 2017. [81] N. Carlini and D. W agner, “ Adversarial examples are not easily detected: Bypassing ten detection methods, ” in AISec , 2017. [82] C. Guo, M. Rana, M. Cisse, and L. V an Der Maaten, “Counter- ing adversarial images using input transformations, ” arXiv pr eprint arXiv:1711.00117 , 2017. [83] A. Athalye, N. Carlini, and D. W agner, “Obfuscated gradients give a false sense of security: Circumventing defenses to adv ersarial exam- ples, ” arXiv pr eprint arXiv:1802.00420 , 2018. [84] X. Du, X. Xie, Y . Li, L. Ma, Y . Liu, and J. Zhao, “Deepstellar: Model-based quantitativ e analysis of stateful deep learning systems, ” in ESEC/FSE , 2019. [85] X. Zhang, X. Xie, L. Ma, X. Du, Q. Hu, Y . Liu, J. Zhao, and M. Sun, “T owards characterizing adversarial defects of deep learning software from the lens of uncertainty , ” in ICSE , 2020. [86] W . Brendel, J. Rauber , and M. Bethge, “Decision-based adversarial attacks: Reliable attacks against black-box machine learning models, ” arXiv preprint arXiv:1712.04248 , 2017. [87] L. Sch ¨ onherr , K. K ohls, S. Zeiler , T . Holz, and D. K olossa, “ Adversarial attacks against automatic speech recognition systems via psychoacous- tic hiding, ” in NDSS , 2019. [88] M. Alzantot, B. Balaji, and M. B. Srivasta va, “Did you hear that? adversarial examples ag ainst automatic speech recognition, ” CoRR , vol. abs/1801.00554, 2018. [89] A. Y . Hannun, C. Case, J. Casper , B. Catanzaro, G. Diamos, E. Elsen, R. Prenger , S. Satheesh, S. Sengupta, A. Coates, and A. Y . Ng, “Deep speech: Scaling up end-to-end speech recognition, ” CoRR , vol. abs/1412.5567, 2014. [90] H. Abdullah, W . Garcia, C. Peeters, P . Traynor , K. R. B. Butler , and J. Wilson, “Practical hidden voice attacks against speech and speaker recognition systems, ” in NDSS , 2019. [91] G. Zhang, C. Y an, X. Ji, T . Zhang, T . Zhang, and W . Xu, “Dolphinat- tack: Inaudible voice commands, ” in ACM CCS , 2017. [92] Z. W u, N. Evans, T . Kinnunen, J. Y amagishi, F . Alegre, and H. Li, “Spooﬁng and countermeasures for speaker veriﬁcation: A survey , ” Speech Commun. , 2015. [93] R. G. Hautam ¨ aki, T . Kinnunen, V . Hautam ¨ aki, T . Leino, and A.- M. Laukkanen, “I-vectors meet imitators: on vulnerability of speaker veriﬁcation systems against voice mimicry . ” in INTERSPEECH , 2013. [94] Z. Wu, S. Gao, E. S. Cling, and H. Li, “ A study on replay attack and anti-spooﬁng for text-dependent speaker veriﬁcation, ” in APSIP A , 2014. [95] M. Shirvanian, S. V o, and N. Saxena, “Quantifying the breakability of voice assistants, ” in P erCom , 2019. [96] M. Shirvanian and N. Saxena, “Wiretapping via mimicry: Short voice imitation mitm attacks on crypto phones, ” in ACM CCS , 2014. [97] P . L. De Leon, M. Pucher, J. Y amagishi, I. Hernaez, and I. Saratxaga, “Evaluation of speaker veriﬁcation security and detection of hmm- based synthetic speech, ” IEEE/ACM T rans. Audio, Speech & Language Pr ocessing , 2012. [98] Z. Wu and H. Li, “V oice conversion and spooﬁng attack on speaker veriﬁcation systems, ” in APSIP A , 2013. [99] D. Mukhopadhyay , M. Shirv anian, and N. Saxena, “ All your voices are belong to us: Stealing voices to fool humans and machines, ” in ESORICS , 2015. [100] M. Shirvanian, N. Saxena, and D. Mukhopadhyay , “Short voice imi- tation man-in-the-middle attacks on crypto phones: Defeating humans and machines, ” Journal of Computer Security , 2018. [101] S. Chen, M. Xue, L. Fan, S. Hao, L. Xu, H. Zhu, and B. Li, “ Automated poisoning attacks and defenses in malware detection systems: An adversarial machine learning approach, ” Computers & Security , 2018. [102] R. Eberhart and J. Kennedy , “ A new optimizer using particle swarm theory , ” in MHS , 1995. [103] J. Sohn, N. Kim, and W . Sung, “ A statistical model-based voice activity detection, ” IEEE Signal Processing Letters , 1999. [104] A T ensorFlow implementation of Baidu’s DeepSpeech architecture. https://github .com/mozilla/DeepSpeech. A P P E N D I X A. Comparison of our F A K E B O B and PSO-based Method W compare our attack F A K E B O B ov er a PSO-based method. W e reduce the ﬁnding of an adversarial sample as an op- timization problem (cf. § IV -A), then solve the optimization problem via the PSO algorithm. PSO solves the optimization problem by imitating the behaviour of a swarm of birds [102]. Each particle is a candidate solution, and in each iteration, the particle updates itself by the weighted linear combination of three parts, i.e., inertia, local best solution and global best solution. The related weights are initial inertia factor w init , ﬁnal inertia factor w end , acceleration constant c 1 and c 2 . W e implement a PSO-based attack follo wing the algorithm of Sharif et al. [63] which is used to fool face recognition systems. After ﬁne-tuning the above hyper-parameters, we conduct the e xperiment using the PSO-based method with 50 particles for a maximum of 35 epochs, and we set the iteration limitation of each epoch to 30, w init to 0.9, w end to 0.1, c 1 15 T ABLE XI: Our attack F A K E B O B vs. the PSO-based method, where [ S ( x 0 )] t denotes the initial score of input voice of the speaker t , and ∗ denotes that only one adversarial attack succeeds. −∞ < [ S ( x 0 )] t < ∞ [ S ( x 0 )] t ≤ − 0 . 5 − 0 . 5 < [ S ( x 0 )] t ≤ 0 0 < [ S ( x 0 )] t ≤ 0 . 5 0 . 5 < [ S ( x 0 )] t ≤ 1 1 < [ S ( x 0 )] t ≤ 1 . 5 F AK E B OB PSO F A K E B O B PSO F A K E B O B PSO ∗ F AK E B OB PSO F A K E B O B PSO F A K E B O B PSO #Iteration 86 136 187 — 84 72 61 147 17 297 4 24 Time (s) 2277 2524 4409 — 1947 1311 1384 2715 357 5517 77 449 SNR (dB) 31.5 31.9 31.4 — 30.5 22.8 31.5 31.6 32.4 32.3 31.8 32.2 ASR (%) 99.0 33.0 96.3 0.0 100.0 5.3 100.0 17.6 100.0 60.0 94.1 100.0 T ABLE XII: Experimental results of F A K E B O B on xvector system T ask All Intra-gender attack Inter -gender attack T argeted Attack Untargeted Attack T argeted Attack T argeted Attack #Iter Time (s) SNR (dB) ASR (%) #Iter Time (s) SNR (dB) ASR (%) #Iter Time (s) SNR (dB) ASR (%) #Iter Time (s) SNR (dB) ASR (%) CSI 117 575 30.1 100.0 73 499 29.6 100.0 89 444 29.3 100 135 662 30.7 100.0 SV 92 702 31.8 100.0 - - - - 44 340 31.9 100.0 136 1035 31.7 100.0 OSI 95 995 32.0 100.0 26 171 31.5 100.0 51 601 32.0 100.0 138 1380 32.0 100.0 to 1.4961 and c 2 to 1.4961. The experiment is conducted on the ivector system for the OSI task. The results are sho wn in T able XI. F or comparison purposes, we also report the results of our attack F A K E B O B in T able XI. Overall, the PSO-based method achie ves 33% targeted attack success rate (ASR), only one-third of F A K E B O B , indicating that F A K E B O B is much more ef fectiv e than the PSO-based method. Speciﬁcally , the PSO-based method is less effecti ve for input voices whose initial scores are low . • When [ S ( x 0 )] t ≤ − 0 . 5 , the PSO-based method fails to launch attack for all the voices. • When − 0 . 5 < [ S ( x 0 )] t ≤ 0 and 0 < [ S ( x 0 )] t ≤ 0 . 5 , the ASR is very low , i.e., 5.3% and 17.6%, respecti vely . Whereas our attack F A K E B O B is more effecti ve no matter the initial scores of input voices. In terms of efﬁciency , F A K E B O B takes less number of iterations and ex ecution time than the PSO-based method, except for the case − 0 . 5 < [ S ( x 0 )] t ≤ 0 on which the PSO-based method is able to launch a successful attack for one voice only . Speciﬁcally , the higher the initial score of the input voice is, the more ef ﬁcient of our attack F A K E B O B is compared to the PSO-based method. For instance, when 0 . 5 < [ S ( x 0 )] t ≤ 1 , the number of iterations (resp. ex ecution time) of the PSO-based method is 17 times (resp. 15 times) larger than the one of F A K E B O B . In summary , the experimental results demonstrate that our attack F A K E B O B is much more ef fectiv e and efﬁcient than the PSO-based method. B. 16 Attac k Scenarios All of following combinations are ev aluated in this work, where D.&S. denotes decision and scores.                                 targeted untargeted  ×  intra-gender inter-gender  × API ×   OSI CSI SV   × D.&S. + targeted ×   OSI CSI SV   × API × decision-only + targeted ×   OSI CSI SV   × over -the-air × D.&S. + targeted × OSI × over -the-air × decision-only                                C. Results of T uning the P arameter  T able XIII shows the results of tuning the parameter  on both ivector and GMM systems for the CSI task. T o choose a suitable  , we need to trade of f the imperceptibility and the attack cost. Smaller  contributes to less perturbation (i.e, higher SNR), but also giv e rise to the attack cost (i.e, more iterations and execution time and lower success rate). W e found that 0.002 is a more suitable v alue of  for two reasons: (1) compared with other  values, the av erage SNR of adversarial voices when  = 0 . 002 is higher, indicating that  = 0 . 002 introduces less perturbation, while the success rate of 0.002 is merely 1% lo wer than that of other  values. (2)  = 0 . 001 introduce less perturbation than  = 0 . 002 , b ut the success rate of  = 0 . 001 drops to 41% for i vector and 87% for GMM, 58% and 12% lower than that of  = 0 . 002 . Moreover , the attack cost increases more sharply when decreasing  from 0.002 to 0.001 compared with decreasing  from 0.003 to 0.002. That is, the number of iterations and ex ecution time of  = 0 . 002 are 1.6 times and 1.4 times than that of  = 0 . 003 , while the number of iterations and e xecution time of  = 0 . 001 are 2.2 times and 2.4 times than that of  = 0 . 002 . T ABLE XIII: Results of tuning  on the CSI task  ivector GMM #Iter Time (s) SNR (dB) ASR (%) #Iter Time (s) SNR (dB) ASR (%) 0.05 18 422 12.0 100 18 91 16.7 100 0.01 23 549 16.2 100 16 81 19.1 100 0.005 44 1099 21.8 100 19 102 22.3 100 0.004 56 1423 23.8 100 21 104 24.0 100 0.003 76 2059 26.3 100 27 124 26.1 100 0.002 124 2845 30.2 99 40 218 29.3 99 0.001 276 6738 36.4 41 106 551 35.7 87 D. Experiment r esults of F A K E B O B on xvector system W e demonstrate the effecti veness and efﬁcienc y of F A K E - B O B against a state-of-the-art DNN-based SRS [26], called xvector system, in which xvector is extracted from DNN networks. W e use the pre-trained xvector model from SITW 16 T ABLE XIV: Details of source and tar get systems for transferability attacks, where DF denotes Dimension of feature, FL/FS denotes Frame length/Frame step, ] GC denotes the number of gaussian components, D V denotes Dimension of iv ector (xvector), and xvector is a DNN-based SRS from [26]. System ID A B C D E F G H I J Architectur e GMM ivector ivector ivector ivector iv ector iv ector ivector ivector xvector T raining set T rain-1 Set Train-1 Set T rain-2 Set Train-1 Set Train-1 Set Train-1 Set T rain-1 Set Train-1 Set Train-1 Set Train-1 Set Featur e MFCC MFCC MFCC PLP MFCC MFCC MFCC MFCC PLP MFCC DF 24 × 3 24 × 3 24 × 3 24 × 3 13 × 3 24 × 3 24 × 3 24 × 3 13 × 3 30 FL/FS (ms) 25/10 25/10 25/10 25/10 25/10 50/10 25/10 25/10 50/10 25/10  GC 2048 2048 2048 2048 2048 2048 1024 2048 1024 – D V – 400 400 400 400 400 400 600 600 512 T ABLE XV: The performance of the target systems C,...,J T ask System C D E F G H I J CSI Accuracy 99.8% 99.4% 99.2% 99.8% 99.6% 99.8% 99.2% 99.2% SV F AR 10.0% 9.8% 9.4% 10.0% 11.2% 9.8% 10.4% 10.2% FRR 1.2% 0.6% 1.6% 1.2% 0.8% 1.0% 2.2% 0.8% OSI F AR 9.1% 8.8% 10.9% 9.2% 8.5% 8.1% 11.0% 7.7% FRR 1.4% 0.6% 1.6% 1.4% 1.2% 0.8% 2.2% 0.8% OSIER 0.0% 0.2% 0.2% 0.0% 0.2% 0.0% 0.4% 0.2% T ABLE XVI: Results of transferability attack for CSI task (%), where S denotes source system and T denotes target system. S T A B C D E F G H I J ASR UTR ASR UTR ASR UTR ASR UTR A TR UTR ASR UTR ASR UTR ASR UTR ASR UTR ASR UTR A — — 76.9 76.9 89.7 89.7 64.1 71.8 87.2 89.7 84.6 84.6 76.9 87.2 76.9 84.6 48.7 69.2 28.2 38.5 B 30.7 88.0 — — 93.3 96.0 100.0 100.0 100.0 100.0 100.0 100.0 88.0 89.3 100.0 100.0 73.3 80.0 25.3 38.7 T ABLE XVIII: Results of transferability attack for SV task (%), where S: source system and T : target system. S T A B C D E F G H I J ASR ASR ASR ASR ASR ASR ASR ASR ASR ASR A — 57.9 49.1 54.4 64.9 61.4 52.6 66.7 36.8 33.3 B 5.0 — 67.5 100.0 100.0 100.0 100.0 100.0 80.0 38.3 T ABLE XVII: Results of F A K E B O B when θ is tuned based on Equal Error Rate. The Equal Error Rate and corresponding threshold θ for iv ector (resp. GMM) are 2.2% and 1.75 (resp. 5.8% and 0.103), and  = 0 . 002 . T ask ivector GMM #Iter Time (s) SNR (dB) ASR (%) #Iter Time (s) SNR (dB) ASR (%) SV 120 2297 31.7 99.0 46 273 31.4 99.0 OSI 125 2786 32.1 99.0 54 334 31.9 99.0 recipe of Kaldi and construct OSI, CSI and SV systems. W e use the same settings as in Section V -B. The baseline performance of the resulting systems is shown in Column J of T able XV. Moreover , we also conduct untargeted attacks against these systems. The results are shown in T able XII. Our attack is able to achie ve 100% ASR, indicating F A K E B O B is also effecti ve and ef ﬁcient against DNN-based SRSs. E. Robustness of F A K E B O B against Defense Methods Local smoothing. It mitigates attacks by applying the mean, median or gaussian ﬁlter to the wav eform of a voice. Based on the results in [31], we use the median ﬁlter . A median ﬁlter with kernel size k (must be odd) replaces each audio element x k by the median of k values [ x k − k − 1 2 , . . . , x k , . . . , x k + k − 1 2 ]. In S1, we vary k from 1 to 19 with step 2. The results are shown in Fig. 8a. W e can see that the defense is ineffecti ve against high-conﬁdence (hc) adversarial v oices. For lo w-conﬁdence (hc) adversarial voices, though the UTR drops from 99% to nearly 0%, the minimal FRR of normal voices increases to 35%, signiﬁcantly larger than the baseline 4.2%. W e also tested median with k = 3 on iv ector . The FRR of normal voices only increases by 7%. It seems that iv ector is more rob ust than GMM. In S2, we ﬁx k =7 as [31] did. The results are shown in Fig. 9a. Although the median ﬁlter increases the attack cost slightly , F A K E B O B can quickly achiev e 90% ASR using 250 max iteration bound, where the baseline is 90. T o solve other few voices (9%), the max iteration bound should be 15,000. Though i vector is more robust than GMM, the similar result is observed (cf. Fig. 9b). W e conclude that the local smoothing (at least median ﬁlter) can increase attack cost, but is inef fective in terms of ASR. 1 3 5 7 9 11 13 15 17 19 median filter kernel size K 0 20 40 60 80 100 % FRR UTR-lc UTR-hc (a) Median ﬁlter 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 s a m p l i n g f r e q u e n c y r a t i o 0 20 40 60 80 100 % FRR UTR-lc UTR-hc (b) Audio squeezing Fig. 8: Results of median ﬁlter and audio squeezing in S1, where UTR-lc denotes UTR of lo w-conﬁdence adversarial voices ( κ =0), and UTR-hc denotes UTR of high-conﬁdence adversarial voices ( 0 < κ < 5 ). A udio squeezing. It down-samples v oices and applies signal recov ery to disrupt perturbations. In S1, we v ary τ (the ratio between new and original sampling frequency) from 0.1 to 1.0, the same as [10]. The results are shown in Fig. 8b. W e 17 T ABLE XIX: Settings of the ov er-the-air attacks, where x meter (y dB) means when the microphone is kept x meters away from the loudspeaker , the a verage volume of voices reaches y dB, and white noise (z dB) means the acoustic en vironment is degraded with a white-noise generator playing at z dB. System Loudspeaker Microphone Distance Acoustic Environment Different Systems GMM OSI/CSI/SV ivector OSI/CSI/SV Azure OSI JBL clip3 portable speaker IPhone 6 Plus (iOS) 1 meter (65 dB) relativ ely quiet Different Devices iv ector OSI DELL laptop JBL clip3 portable speaker Shinco brocast equipment IPhone 6 Plus (iOS) OPPO (Android) 1 meter (65 dB) relatively quiet Different Distances iv ector OSI JBL clip3 portable speaker IPhone 6 Plus (iOS) 0.25 meter (70 dB) 0.5 meter (68 dB) 1 meter (65 dB) 2 meters (62 dB) 4 meters (60 dB) 8 meters (55 dB) relativ ely quiet Different Acoustic En vironments iv ector OSI JBL clip3 portable speaker IPhone 6 Plus (iOS) 1 meter (65 dB) white noise (45/50/60/65/75 dB) bus noise (60 dB) restaurant noise (60 dB) music noise (60 dB) absolute music noise (60 dB) 0 4000 8000 12000 16000 Max iteration bound 30 40 50 60 70 80 90 100 ASR (%) without defense A u d i o s q u e e z i n g , = 0 . 5 M e d i a n f i l t e r , k = 7 (a) GMM system 0 500 1000 1500 2000 2500 3000 Max iteration bound 40 50 60 70 80 90 100 ASR (%) without defense A u d i o s q u e e z i n g , = 0 . 5 M e d i a n f i l t e r , k = 7 (b) i vector system Fig. 9: Attack cost of median ﬁlter and audio squzzeing can observe that when τ = 0 . 9 , (1) the FRR of normal voices is 6%, close to the baseline 4.2%, (2) the UTR of the lo w-conﬁdence adversarial voices is 17%, smaller than the baseline 99%, (3) howe ver , the UTR of the high-conﬁdence adversarial voices is the same as the baseline. In S2, we ﬁx τ =0.5 as [31] did. The results are shown in Fig. 9a and Fig. 9b. Unexpectedly , the defense decreases the o verhead of attack and increases ASR. For instance, F A K E B O B achieves 100% ASR using 200 max iteration bound on the system with defense, while can only achieve 99% ASR even using 16,000 max iteration bound on the unsecured system. It is possibly because audio squeezing ( τ = 0 . 5 ) sacriﬁces the performance of SRSs. W e conclude that the audio squeezing is ineffecti ve against F A K E B O B in terms of both attack cost and ASR. Quantization. It rounds the amplitude of each sample point of a v oice to the nearest integer multiple of factor q to mitigate the perturbation. In S1, we vary q from 128, 256, 512 to 1024 as [31] did. Howe ver , the system did not output any result on adversarial and normal voices. An in-depth analysis reveals that all the frames of voices are regarded as un voiced frame by the V oice Activity Detection (V AD) [103] component. This demonstrates that quantization is not suitable for defending against F A K E B O B . Due to this, we do not consider S2. T emporal dependency Detection. For a giv en voice v , sup- pose a speech-to-text system produces text t ( v ) . Gi ven a parameter 0 ≤ k ≤ 1 , let v k (resp. t k ) denote the k percent preﬁx of the voice v (resp. text t ). The temporal dependency detection uses the distance between the texts t ( v ) k and t ( v k ) to determine whether v is an adversarial voice, as the distance of adversarial voices is greater than that of normal voices. W e use this method to check adversarial voices crafted by F A K E B O B using k = 4 5 and the Character Error Rate distance metric, the best one in [31]. W e do not test different values of k as the result will not vary too much as mentioned in [31]. W e use Baidu’ s DeepSpeech model as the speech-to-te xt system, which is implemented by Mozilla on Github [104] with more than 13k stars. 0 20 40 60 80 100 False Positive Rate (%) 0 20 40 60 80 100 True Positive Rate (%) lc, AUC=46.7% hc, AUC=50.5% Fig. 10: R OC curves of T emporal Dependence Detection Fig. 10 shows the ROC curves of this method distinguishing low-conﬁdence and high-conﬁdence adversarial samples. It obtains 50% true positive rate at about 50% false positiv e rate. The A UC values are 46.7% and 50.5%, close to random guess, indicating it fails to detect adversarial samples. This is because F A K E B O B does not alter the transcription of the voices, thus the temporal dependency is preserved. Due to this, we do not consider S2. 18

Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment