A Spoofing Benchmark for the 2018 Voice Conversion Challenge: Leveraging from Spoofing Countermeasures for Speech Artifact Assessment

A Spooﬁng Benchmark f or the 2018 V oice Con v ersion Challenge: Lev eraging fr om Spooﬁng Countermeasures f or Speech Artifact Assessment T omi Kinnunen 1 , J aime Lor enzo-T rueba 2 , J unichi Y amagishi 2 , 3 , T omoki T oda 4 , Daisuke Saito 5 , F ernando V illavicencio 6 , Zhenhua Ling 7 1 Uni versity of Eastern Finland, Joensuu, Finland 2 National Institute of Informatics, T okyo, Japan 3 Uni versity of Edinb urgh, UK 4 Nagoya Uni versity , Nagoya, Japan 5 Uni versity of T okyo, T okyo, Japan 6 ObEN, Pasadena, USA 7 Uni versity of Science and T echnology of China, Heifei, China vcc2018@vc-challenge.org Abstract V oice con version (VC) aims at con version of speaker character- istic without altering content. Due to training data limitations and modeling imperfections, it is difﬁcult to achieve belie vable speaker mimicry without introducing processing artifacts; per- formance assessment of VC, therefore, usually in volv es both speaker similarity and quality ev aluation by a human panel. As a time-consuming, e xpensiv e, and non-reproducible pro- cess, it hinders rapid prototyping of new VC technology . W e address artif act assessment using an alternativ e, objectiv e ap- proach lev eraging from prior work on spooﬁng countermea- sures (CMs) for automatic speaker veriﬁcation. Therein, CMs are used for rejecting ‘fake’ inputs such as replayed, synthetic or con verted speech but their potential for automatic speech ar- tifact assessment remains unknown. This study serves to ﬁll that gap. As a supplement to subjective results for the 2018 V oice Con version Challenge (VCC’18) data, we conﬁgure a standard constant-Q cepstral coefﬁcient CM to quantify the extent of pro- cessing artifacts. Equal error rate (EER) of the CM, a confus- ability index of VC samples with real human speech, serves as our artifact measure. T wo clusters of VCC’18 entries are identi- ﬁed: low-quality ones with detectable artif acts (lo w EERs), and higher quality ones with less artif acts. None of the VCC’18 sys- tems, howe ver , is perfect: all EERs are < 30% (the ‘ideal’ value would be 50%). Our preliminary ﬁndings suggest potential of CMs outside of their original application, as a supplemental op- timization and benchmarking tool to enhance VC technology . 1. Introduction V oice conver sion (VC) [ 1 , 2 ] aims at con version of speaker char- acteristic without altering the speech content. T ypical use of VC technology includes applications in entertainment indus- try , such as customizing artiﬁcial voices for audio-books and games. In such applications, the VC samples are optimized for human listeners. In the recent past, thanks to technological ad- vances in both VC and automatic speaker veriﬁcation (ASV) This is a corrected version of the Odyssey 2018 publication with the same title and author list. A bug (partial training-test o verlap in bona ﬁde trials) was identiﬁed afterwards, leading to underestimated EERs on VCC’18 base and VCC’18 (challenge entries) data. The ov erlapped trials were removed and the affected T ables (1, 2, 3 and 4) and Figs. (1 and 2) and their descriptions in the te xt were updated. The overall trends and conclusions remain unchanged. Date of the document: September 5, 2018. technology , VC ﬁnds also frequent use in assessing ASV sys- tem vulnerability against intentional circumvention (spooﬁng) [ 3 ]. In this case, the VC samples are prepared for the ASV sys- tem. In both cases, the goal is that the VC samples manage to make the primary observer (either a human or a machine) to belie ve they are observing a certain targeted speaker that is different from the source speaker . But human perception and machine perception are different, and in the case of ASV and its spooﬁng, machine perception is more relev ant. Even if the VC technology itself has evolv ed steadily over the years [ 4 , 5 , 6 ], the evaluation methods of VC are more varied compared to tasks such as ASV or automatic speech recogni- tion. The primary ev aluation methods are perceptual tests since the target of the above applications is normally human percep- tion. In addition, log-spectral distortion and cepstral distortion (between conv erted and target utterances) are also used as sup- plementary information. At this moment, we lack of uni versally adopted objective performance measures. Furthermore, there was no standard database until recently . Giv en the situation, [ 7 ] launched a V oice Con version Chal- lenge (VCC) series in 2016, with a follow-up in 2018 1 orga- nized by the authors of this study [ 8 ]. The primary method- ology of the evaluation of VC systems, including VCC, is a perceptual test of various VC systems trained on a common cor- pus. The perceptual test usually inv olves both speak er similarity and quality ev aluation by a human panel since it is difﬁcult to achiev e con vincing speaker transformation without introducing processing artifacts due to training data limitations and mod- eling imperfections. The VCC series provide results of large- scale perceptual tests that compare many different types of VC systems on a comparable basis. This is helpful towards un- derstanding human perception and optimization strategies em- ployed by listeners. On the other hand, listening test results do not directly reﬂect spooﬁng capability and we do not know how the results of the perceptual test are related to machine percep- tions, that is, ASV and its spooﬁng. W ith the above motiv ations in mind, the present study ac- companies [ 8 ] that provides details of the 2018 challenge data, analysis of the submitted systems and the perceptual results. W e provide supplemental objective quality results on the degrees of artifacts that each of the submitted systems have. Even if our experiments are framed to the context of the latest VCC’18 challenge, our contribution is that of a novel objectiv e speech 1 http://www.vc- challenge.org/ , data av ailable at http://dx.doi.org/10.7488/ds/2337 since April 10, 2018. artifact assessment lev eraging from the rapidly emerging topic of spooﬁng countermeasures . In the context of ASV , spooﬁng refers to intentional circumvention of the ASV system to obtain illegitimate access as another targeted user [ 9 ] — VC technol- ogy being a representativ e example. ASV vulnerability due to spooﬁng has been known for about two decades [ 10 ] but has gained momentum only relativ ely recently with increased inter- est to wards ASV deployment for user authentication, as well as av ailability of common ev aluation resources [ 11 , 12 ] to enable meaningful comparisons of different spooﬁng countermeasures. There has been continued research towards generalized spooﬁng countermeasures to detect spooﬁng attacks more accu- rately . As a result, sev eral advanced front-end [ 13 , 14 , 15 , 16 ] and machine learning [ 17 ] oriented techniques hav e been dev el- oped for the task of detecting the presence of a spooﬁng attack in a given audio segment. The task is framed as a hypothesis testing problem with bona ﬁde (legitimate human speech) hy- pothesis as the null hypothesis and spoof as an alternative hy- pothesis. The exact deﬁnition of the latter depends on the type of the spooﬁng attack ( e.g . VC or replay attack). If the detector is carefully optimized and the probability dis- tributions of bona ﬁde and spoof classes are sufﬁciently distinct, one can detect the attacks. But if the spoofed samples resemble too closely real human speech, the detector can mistakenly clas- sify them as bona ﬁde speech. Therefore, the number of errors made by a spooﬁng countermeasure for a giv en batch of test ﬁles is associated with how closely the spoof samples resemble the bona ﬁde samples. In speciﬁc, the spooﬁng countermea- sure gauges the amount of speech artifacts that only the spoof samples have (regardless whether the artifacts are audible to a human or not) and tell us how close the spoof samples are com- pared to the bona ﬁde samples. Therefore we hypothesize that this is useful for the automatic assessment of speech artifacts produced by VC process and may be used as one of objective performance measures. Therefore, this study compares the per- formance of the spooﬁng countermeasure with subjective qual- ity ev aluation results obtained in VCC’18 and inv estigate how they are related to each other . 2. Subjective quality of con verted voices Suppose we have a patch of N source speaker utterances X = {X 1 , . . . , X N } processed through S voice conv ersion systems s = 1 , . . . , S . W e denote the utterances conv erted by system s by Y s = {Y s 1 , . . . , Y s N } and use Y = ∪ S s =1 Y s to denote all the N · S con versions. For the purpose of comparing the alternativ e VC systems in terms of speech quality , the samples Y are collectiv ely listened to by a cohort of human observers , O = { O 1 , . . . , O L } , each of who outputs an opinion score for a subset of samples index ed by α i for observer i . Note that not all the observers necessarily listen to the same utterances, nor nec- essarily listen to e ven the same number of samples. The process of obtaining the opinion score of observer i for utterance Y ∈ Y can be thought of as ev aluating an abstract, non-deterministic and possibly time-varying, function h ( Y , O i ) , that models hu- man listening mechanism of O i . It depends on many factors such as the listener’ s life experience and concentration, the lis- tening en vironment, audio equipment used, and familiarity with the language. W e do not hav e access to the internals of h ( · , O i ) but only its observed output, in this study the standard 5-point rating scale ranging from 1 (lowest quality) to 5 (highest qual- ity). Because of random variation in the outputs, caused by dif- ferences in h ( · , O i ) , one represents the results of the listening panel in an av eraged form. The well-known population sum- mary measure, mean opinion scor e (MOS), is computed for sys- tem s by MOS s = (1 /L s ) P N n =1 P L i =1 h ( Y s n , O i ) , where L s is the total number of opinion scores obtained for the samples of system s , and where we assign a dummy v alue h ( Y , O i ) = 0 if listener i did not rate Y . The higher the MOS value, the higher quality the samples of system s . The deﬁnition of ’quality’ is also subjectiv e. No instruction on what high-quality or low- quality means is normally giv en. 3. Proposed objecti ve artifact assessment using spooﬁng countermeasures Here we want to construct a model that automatically scores the amount of speech artifacts. The artifacts may be audible or non-audible 2 . 3.1. Obtaining machine scores W ith the objective artifact estimation, the problem setup is the same as abov e: given Y s , we want to obtain a single numerical value, similar to MOS, that relates to the de gree of artifacts pro- duced by a VC system s . T o this end, we replace the abstract hu- man observer h ( Y , O i ) by a machine observer, m ( Y , θ ) , rep- resented by some model parameters θ . There are sev eral dif- ferences between h and m . First, unlike h , ev aluating m is deterministic and time-in variant — in other words, it yields the same output when repeated on sample X , and is not dependent on the time when inv oked. Second, unlike h where one usually constrains the outputs to be quantized to a small set of ordinal variables, we allow the domain of m to be the entire real line R ; the scale of the output value is arbitrary , but similar to h , higher numerical values in relativ e terms indicate higher speech quality as judged by the observer θ . T o ﬂesh out the above vague idea, in this w ork m ( · , θ ) takes the form of a likelihood ratio detector . Likelihood ratios arise naturally from the Bayes theorem and serve as the starting point for making statistically optimal decisions. F or a giv en input utterance Y ∈ Y , we compute a log-likelihood ratio (LLR) score, ` ( Y | θ ) = log p ( Y | H 0 ) p ( Y | H 1 ) = log p ( Y | θ nat ) p ( Y | θ artif ) , (1) where θ = ( θ nat , θ artif ) . The null hypothesis H 0 , modeled through θ nat , is that Y represents natural human speech without artifacts. The alternativ e hypothesis H 1 , modeled using θ artif , states that Y originates from artiﬁcial speech generation (such as voice conv ersion or speech synthesis). Higher numerical values are therefore associated with speech that appears more ‘human-like’, lacking vocoding artifacts or other problems that VC systems tend to generate. T o train θ nat , we gather a large collection of natural human utterances and train the model from the pooled data; similarly , to train θ artif , we gather a representativ e collection of artiﬁcial speech samples (such as samples from sev eral state-of-the-art VC systems, prior to e valuating a new VC system). In this work, we use Gaussian mixture model (GMM) to model each hypothesis, trained through e xpectation-maximization (EM) al- gorithm. Though more advanced spooﬁng countermeasure backends are av ailable, GMMs produced good results for the ASVspoof ’15 challenge consisting also of high-quality clean samples as here, with the beneﬁt of simplicity . 2 Hence the aim of the measure is not to approximate the subjective quality judgement. 3.2. Error rate of the CM as an objectiv e quality measure The log-likelihood ratio in ( 1 ) outputs a number for a single utterance Y . Now , ho w do we obtain a summary v alue for all the samples of a giv en VC system s ? While it might be appealing to just average ` ( Y n | θ ) o ver the samples Y n ∈ Y s , similar to MOS, this is not recommendable. Unlike the opinion score, ` is unbounded and is therefore more difﬁcult to interpret. Further , the model θ is imperfect version of reality and cannot possibly produce meaningful LLR scores for all human speech and all spooﬁng attacks. As a result, the scale of ` is arbitrary and dependent on the various modeling choices (including that of feature extraction). The LLRs across databases, VC systems and con verted utterances are not in a commensurable scale. In the nomenclature of ASV literature, we might say ` is not well- calibrated [ 18 ]. Hence, it is better to adopt a summary measure that is more intuitiv e and comparable across different ev aluation en viron- ments. Our proposal is the error rate of the detector , as a mea- sure of its ability to differentiate authentic human utterances from those generated by voice conv ersion. Our philosophy is as follo ws. If the samples from a VC system S 1 manage to fool a given artiﬁcial speech detector more often than samples from another competitive VC system S 2 , we can say samples of S 1 appear more human-like in the e yes 3 of the artiﬁcial speech de- tector . Utterances having less processing artifacts are more dif- ﬁcult to discriminativ e from human speech, giving rise to higher error rate of the detector . Even if our inspiration for such a proposal originates from the work in ASV anti-spooﬁng, the viewpoint is now switched from defender to attacker . In the ASV anti-spooﬁng, one keeps improving spooﬁng countermeasures so that they are more ac- curate in detecting advanced speech synthesis and voice con ver - sion attacks (the lower the error rate, the better). But now we consider the anti-spooﬁng system to be ﬁxed with the goal of improving the performance of voice conv ersion systems — the higher the error rate, the better . As for the actual error rate measurement, we adopt the stan- dard metric of equal err or rate (EER) used extensi vely in ASV , anti-spooﬁng and biometrics research. The output scores (esti- mated LLRs) produce two different types of errors, false alarms and misses , that are traded off with respect to each other . Here, false alarm (false acceptance) rate (F AR) is the proportion of artiﬁcial speech samples that the detector falsely accepted as bona ﬁde (human) samples. Miss (or f alse rejection) rate (MR), in turn, is the proportion of falsely rejected bona ﬁde samples. F AR and MR are, respectively , decreasing and increasing func- tions of a detection threshold. The EER, then, is the unique error rate corresponding to the threshold at which F AR and MR equal each other . As the detection task inv olves only two classes, the chance lev el is EER of 50%. This would be our ‘ideal’ value for a successful artifact-free VC system 4 . One might therefore optionally report a scaled version 1 10 EER %, to giv e rise to a continuous version of a 5-point ‘opinion score’, judged by the machine observer . 3 Or ears? 4 T echnically speaking it is possible to do worse than the coin- ﬂipping rate, for instance by swapping the tw o model likelihoods in ( 1 ). EERs much larger than 50 % suggest usually an implementation bugs of the detector , and are not interesting from the perspectiv e of ev aluation. 3.3. Choice of the countermeasure model ( 1 ) T o implement the artifact LLR detector of ( 1 ), we represent speech utterances using a sequence of short-term spectral fea- tures, Y = { y 1 , . . . , y N } with y n ∈ R D , D being the feature dimensionality . At the training stage, we use the pooled fea- ture vectors from each class to independently train two Gaus- sian mixture models (GMMs), θ nat and θ artif , using the standard expectation-maximization (EM) algorithm. W e use diagonal co- variance matrices and consider the number of Gaussian compo- nents, C , as a tuning parameter . It can be used to adjust the balance between ov er- and under -ﬁtting. Choice of the short-term feature representation is criti- cally important in the context of spooﬁng countermeasures [ 13 ]. Findings from the ﬁrst ASVspoof 2015 challenge [ 11 ] high- lighted the importance of spectral and temporal details for the task of discriminating real and spoofed speech. In speciﬁc, con ventional MFCCs, a low-r esolution , low-fr equency focusing feature set that has enjoyed its de facto audio representation sta- tus for almost 40 years, is a suboptimal choice. Discriminating human speech from synthetic or con verted speech and identify- ing the artifacts seems to require more detailed time-frequency representation. The winning system of the ASVspoof ’15 chal- lenge [ 16 ] used MFCCs with a combination of cochlear ﬁlter cepstral coefﬁcients and instantaneous frequency . Later , [ 14 ] introduced a single feature set, constant-Q cepstral coefﬁcients (CQCCs), based on the constant-Q transform [ 19 ]. It lead to the lowest reported EERs on the ASVspoof ’15 corpus at the time. Substantial follo w-up work ( e.g . [ 15 ]) has improv ed feature e x- tractors even further . In this work we use CQCCs due to their high reported detection accuracy , simplicity , and widespread adoption by the research community . W e use an open-source CQCC implementation provided to the second ASVspoof chal- lenge participants 5 , and similarly another public toolkit to train the GMMs [ 20 ]. 3.4. Data-related considerations to enable fair ev aluation Besides speciﬁcation of the front-end features, another ke y con- sideration is the choice of training and development data of the countermeasure. Despite high accuracy of the spooﬁng coun- termeasure front-ends listed abov e, they are notoriously sensi- tiv e to training-test mismatch (cross-corpus performance) [ 21 ], additiv e noise [ 22 ] and channel/bandwidth mismatch [ 23 ]. In short, countermeasures are easy to ov erﬁt for speciﬁc data lead- ing to potentially arbitrarily bad results for a different test data. This makes the selection of both data, and optimization process of the countermeasure parameters important. As challenge organizers, our responsibility is to av oid fa- voring any speciﬁc VC system but provide as unbiased assess- ment of the systems as we possibly can. T o this end, we aim to optimize our countermeasure with the following requirements in mind. 1. Stability across datasets. The countermeasure should show stable enough results when executed on different corpora so that one can trust the result to be less depen- dent on the speciﬁcs of VCC’18 data. 2. Detection of state-of-the-art voice con version. W e should aim at detecting the current state-of-the-art, or otherwise previously known, VC attacks, with sufﬁcient accuracy . 5 http://www.asvspoof.org/data2017/baseline_ CM.zip T able 1: Considered datasets for the construction of spooﬁng countermeasures for objecti ve quality assessment of the VCC’18 samples. The datasets differ in the number of speakers and div ersity of spooﬁng attacks. The ASVspoof ’15 data represented state-of-the-art of SS and VC in 2014-2015, while VCC’16 contains more modern attacks. ASVspoof: Automatic Speaker V eriﬁcation Spooﬁng and Countermeasur es Challenge , VCC: V oice Conver sion Challenge , SS: speech synthesis , VC: voice conver sion , STRAIGHT : Speech T r ansformation and Representation using Adaptive Interpolation of weiGHT ed spectrum , and LPC : linear pr edictive coding . ASVspoof ’15 train ASVspoof ’15 dev ASVspoof ’15 eval VCC’16 VCC’18 base T ypes of attacks SS and VC SS and VC SS and VC VC VC W aveform generation STRAIGHT , STRAIGHT STRAIGHT , diphone STRAIGHT , LPC, W aveform MLSA concatenation Ahocoder ﬁltering # Spooﬁng attacks 5 10 10 18 1 (VCC’18 basel.) # Spoof ﬁles 12,625 49,875 184,000 24,300 2,240 # Speakers ( ♂ + ♀ ) 25 (10 + 15) 35 (15 + 20) 46 (20 + 26) 10 (5+5) 8 (4 + 4) # Human ﬁles 3,750 3,497 9,404 535 464 3. No tweaking using participant submissions. W e should not optimize our proposed measure with feedback of the VCC’18 e v aluation entries. That is, we should not look at the error rates and use the submitted samples to enrich training sets. Instead, we should ﬁx the data and parameters to the best-known values using data that pr e- cedes VCC’18 participant entries . The last requirement might be less obvious to the reader from the voice con version ﬁeld. It reﬂects the viewpoint of the spooﬁng countermeasure as a security gate in real-world de- ployment: one does not know the attacks (here, voice conv er- sion samples) in adv ance but has to use her best knowledge to prepare the countermeasure using attacks a vailable beforehand. In the standard automatic speaker veriﬁcation (ASV) ev alua- tion benchmarks conducted by National Institute of Standards and T echnology (NIST), evaluation participants are similarly expected to process the new e valuation samples completely blindly — they are not allowed to interact with the ev aluation samples in any manner (such as by listening to them, or using them to do modeling decisions), for the same reason. W e ap- ply the same principle in our role as a challenge evaluator , to giv e all the submitted VCC’18 systems an equal opportunity to break down our countermeasure. 4. Experimental data 4.1. The voice con version challenge 2018 The 2018 V oice Conv ersion Challenge (VCC’18) is a follow- up to the VCC series kicked off in 2016 [ 7 ]. It features the task of speaker identity con version. A detailed description of the challenge data, rules, analysis of the submitted systems and extensi ve perceptual results are provided in another paper [ 8 ], with only the key f acts repeated here for completeness. The VCC’18 challenge is designed to promote dev elop- ment of both parallel and nonparallel VC methods; the par- allel ( Hub ) task contains source-target training utterances with matched speech content while in the nonparallel ( Spoke ) task, the contents differ . Both tasks contain the same target speaker data b ut the source speakers are different. The Hub task formed the core (required) task for the registrants, whereas participation to the Spoke task was optional. The participants were provided with training and dev elopment data, and were asked to submit their conv erted audio ﬁles for pre viously unseen source utter- ances. Both the VCC’16 and the VCC’18 data are based on the D APS (Data And Production Speech) dataset [ 24 ] including na- tiv e US English speakers recording in a professional setting. The source and the target speakers across the two challenges are all disjoint. The number of test sentences is 35 and the par- ticipants were asked to submit the con verted voices for a total of 16 source-target speaker pairs in both Spoke and Hub tasks. The results were evaluated subjectiv ely using crowd-sourcing. A total of 267 unique crowdw orkers ev aluated perceptually both naturalness and speaker similarity . The former ranged from 1 (completely unnatural) to 5 (completely natural) while a 4-point scale was used for the latter (“Same, absolutely sure”, “Same, not sure”, “Different, not sure”, “Different, absolutely sure”. The trials consisted of comparisons of VC samples with either the source speaker or the tar get speaker . 4.2. Data for countermeasur e development W ith the above considerations in mind, we in volv e data from sev eral audio collection that contain both synthetic speech and con verted v oice samples. The datasets are summarized in T able 2 . As for the ASVspoof 2015 collection (train, dev , ev al), doc- umented in detail in [ 11 ], we follow the protocol ﬁles provided with the corpus 6 . The VCC’16 data, in turn, consists of the samples of the ﬁrst edition of the VCC series [ 7 ]. In speciﬁc, it contains the participant submissions from 17 different systems plus 1 baseline system, along with samples from 10 speakers (three source males, two source males, two tar get females, three target males). The last data, denoted VCC’18 base , contains only the VCC’18 baseline as its only attack (in speciﬁc, samples of the VCC’18 baseline system). The human trials contain speech from 8 speakers in the VCC’18 challenge. The four source speakers used in the HUB task of VCC’18 have an o verlap with VCC’16 data and are excluded from the trials. Again no sub- mitted VCC’18 system is used for training the countermeasures. 5. Results 5.1. CQCC-GMM countermeasure optimization Giv en the sampling rate mismatches across the prior corpora (16 kHz for ASVspoof ’15 and VCC’16) and the new VCC’18 data (22.05 kHz), we downsample the latter to 16 kHz. In 6 https://datashare.is.ed.ac.uk/handle/10283/ 853 T able 2: Equal error rate (EER, %) for intra-corpus (ASVspoof ’15) and cross-corpus artifact detection experi- ments, the lower the better . Number of Gaussians 32, 29 CQCCs and energy coefﬁcient with deltas and double deltas. The EERs are av erages of attack-speciﬁc EERs. CQCC CQCC T rain T est (raw) (CMVN) (i) TRAIN’15 DEV’15 0.38 1.99 (ii) TRAIN’15 EV AL ’15 1.83 2.21 (iii) TRAIN’15 VCC’16 35.04 34.26 (iv) ALL ’15 VCC’16 34.86 31.52 (v) VCC’16 DEV’15 26.18 33.34 (vi) VCC’16 EV AL ’15 21.48 34.75 T able 3: Equal error rate (EER, %) for CQCC optimization, the lower the better . Number of Gaussians 32, training data VCC’16. Lines (i) to (vii) are based on 29 CQCCs without the energy coefﬁcient while (viii) contains the zeroth coefﬁcient. The EERs are av erages of attack-speciﬁc EERs. DEV’15 VCC’18 base CQCC CQCC CQCC CQCC Front-end (raw) (CMVN) (raw) (CMVN) (i) stat 46.14 32.33 28.50 32.67 (ii) ∆ 12.11 31.20 32.45 33.15 (iii) ∆ 2 13.58 29.39 30.96 31.74 (iv) ∆ , ∆ 2 7.73 18.93 30.30 26.75 (v) stat, ∆ 39.73 36.86 27.54 35.07 (vi) stat, ∆ 2 34.46 31.65 24.36 31.14 (vii) stat, ∆ , ∆ 2 32.09 34.28 25.01 31.04 (viii) z, stat, ∆ , ∆ 2 26.18 33.34 24.15 27.91 18 23 28 33 38 4 8 16 32 64 1 28 2 56 5 12 1 024 2 048 E qu al err or r a t e (E E R, %) Number of G aussians V CC' 18 base Δ+Δ2 s t a t + Δ2 z + s t a t + Δ +Δ2 Figure 1: The results on VCC’18 base data with varied com- plexity of the GMM backend. Training data is VCC’16. our ﬁrst experiment, we study the performance of the CQCC- GMM detector for different selections of training and test data, with the primary goal being selection of our main training data. For this ﬁrst experiment, we ﬁx the CQCC conﬁguration to the default setting of the ASVspoof ’17 challenge baseline (29 base CQCCs plus the zeroth (energy) coefﬁcient, with deltas and double deltas, giving 90-dimensional features. W e study the CQCC conﬁgurations both without any feature normaliza- tion, as well as with cepstral mean and variance normalization (CMVN); it was included since it might be potentially helpful in suppressing con voluti ve mismatch across datasets. Conv oluti ve bias could originate from speaker , recording media or vocoder differences, and might be reducible through feature normaliza- tion techniques. In our implementation, we use utterance-lev el CMVN to obtain zero mean, unit variance features per ﬁle. W e do not apply speech acti vity detection. The number of Gaussian components is set to 32 . The results for the ASVspoof ’15 and VCC’16 data are shown in T able 2 . The ﬁrst tw o ro ws correspond to the standard protocols of the ASVspoof ’15 corpus and reﬂect intra-corpus performance. The results are in line with the published litera- ture. First, the error rates are remarkably low , demonstrating the potential of the CQCC-GMM countermeasure. Second, the error rate of the ev aluation part is higher , due to presence of one unknown attack (S10). CMVN systematically degrades perfor- mance. The last four ro ws of T able 2 sho w the cross-corpus perfor - mance. As expected, the error rates are no w far higher . The training set ALL ’15 indicated in line (i v) was obtained by pool- ing the train, dev and ev al ﬁles of ASVspoof ’15 into a large training set. Comparing experiments (iii) and (iv), the larger training gives no substantial boost for the unnormalized fea- tures (relative decrease of 0.5% in EER) though with some im- prov ement (8% relative decrease) for the normalized features. CMVN is helpful in (iv) only . Comparing experiments (v) and (vi) to (iii) and (i v) of the unnormalized features, the results are not symmetric regarding the roles of training and test corpus. Even if the VCC’16 data is much smaller than ASVspoof ’15 in terms of speaker and ﬁle count, the voice con version samples (attacks) are perhaps more div erse acoustically . Based on the results of T able 2 , for the remainder of the e x- periments we ﬁx VCC’16 as our training data. The next exper- iment concerns the impact delta features that hav e been noted accurate in detecting v ocoded speech. T able 3 sho ws the results for the ASVspoof ’15 dev (the more difﬁcult one from dev and ev al), and VCC’18 base. The results are shown again for raw and CMVN-processed features. For the ASVspoof ’15 dev trials, the outstanding front-end consists of just deltas and double deltas without feature normal- ization. The results highlight usefulness of the dynamic fea- tures, and un usefulness of the static coef ﬁcients — the ﬁrst line consisting of static coefﬁcients only gives performance close to the chance lev el of 50% EER. Comparing experiment (ii) to (v), experiment (iii) to (vi) and experiment (iv) to line (vii), in- clusion of the static coefﬁcients systematically degrades perfor- mance. It is also notew orthy that while the plain delta features are degraded by CMVN, CMVN boosts the performance of the static coefﬁcients in cases (i), (v) and (vi). This might be partly explained noting that both CMVN and deltas are helping in re- ducing con voluti ve mismatch. For the VCC’18 base data, CMVN systematically de grades performance; the only exception is the conﬁguration consisting of deltas and double deltas only . The globally best setups for both trial sets are obtained without CMVN. Comparing the best conﬁgurations from each trial set, 7.73% for the ASVspoof ’15 dev and 24.15% for VCC’18 base, the latter appears harder . This suggests that the state-of-the-art voice conv ersion attack (VCC’18 baseline) is challenging for the CQCC-GMM coun- termeasure. In our last parameter tweaking experiment, we ﬁne-tune the number of Gaussians (that was 32 until now for computa- tional reasons) using just the VCC’18 base data (training with VCC’16). Based on the results of T able 3 , we select three rep- T able 4: Equal error rates (EER %) of the CQCC-GMM spooﬁng countermeasure of the VCC’18 entries on the Hub task. Here “B01” denotes the VCC’18 baseline system. The two countermeasures considered use deltas and double deltas ( ∆ + ∆ 2 ) of 29 CQCCs, and 29 CQCCs plus zeroth coefﬁcient along with deltas and double deltas (All feat.). Training data VCC’16, number of Gaussians 2048. The higher the EER, the better the VC system in terms of quality (less processing artifacts). ?: information unav ailable to the authors. W aveform ∆ + ∆ 2 All feat. W aveform ∆ + ∆ 2 All feat. Sys. generation Sys. generation B01 W aveform ﬁltering 29.77 25.55 N09 Ahocoder 1.24 13.22 D01 W orld direct wav e mod. 8.43 18.75 N10 W avenet 4.64 15.63 D02 W orld 43.94 48.95 N11 STRAIGHT 0.49 1.28 D03 STRAIGHT 1.58 16.07 N12 W aveform ﬁltering 40.98 23.48 D04 W orld 15.74 15.03 N13 W orld 10.69 17.42 D05 W orld 4.03 16.63 N14 W aveform ﬁltering 41.58 25.95 N03 ? 5.80 16.47 N15 W orld 7.24 17.18 N04 W orld 3.74 16.26 N16 Ahocoder 1.18 11.17 N05 SuperVP 41.42 32.23 N17 W avenet 15.19 15.48 N06 W orld 2.72 14.55 N18 Grifﬁn-Lim 32.35 19.43 N07 W orld 4.12 16.75 N19 W orld 11.07 21.97 N08 W aveform ﬁltering 37.20 23.84 N20 W orld 3.54 15.23 resentativ e feature set-ups, the one that produced good results on the ASVspoof ’15 de v data; and two good set-ups (stat + ∆ 2 and z + stat + ∆ + ∆ 2 ) for the VCC’18 base data. The results displayed in Fig. 1 indicate improved performance with larger number of Gaussians as expected. The performance might slightly be improved by increasing the number of Gaussians further; due to resource reasons, we stopped at 2048. Concern- ing the front-end set-up, the full conﬁguration containing static, delta, double delta and the zeroth coefﬁcients yields the best results. 5.2. Results for the VCC’18 samples W e now ﬁx our countermeasure parameters to compare the VCC’18 submissions. From T able 3 , we see that CMVN helps only in 3 (out of 16) cases so we decide to not include it. Based on T able 3 , the ∆ + ∆ 2 is a reasonable choice of features: it yields a clearly outstanding result on the difﬁcult cross-corpus experiment with ASVspoof ’15 dev and, with optimized num- ber of Gaussians (Fig. 1 ), works reasonable well also for the VCC’18 base data. Additionally , we include the full feature set-up consisting of base coefﬁcients (including energy) with deltas and double-deltas, as this yielded the best ov erall results in the detection of the baseline samples. Based on Fig. 1 , we ﬁx the number of Gaussians to 2048 . The results are shown in T able 4 for both of the selected feature setups on the VCC’18 Hub task. For ease of interpre- tation, we show the wav eform generation method of each sub- mission entry . For a case of the ∆ + ∆ 2 conﬁguration, wave- form ﬁltering, SuperVP and Grifﬁn-Lim based waveform gener- ation methods were judged as VC methods that have relativ ely less artifacts compared to STRAIGHT , W orld, and Ahocoder vocoders. This is reasonable since they were proposed for im- proving issues of minimum phase vocoders. One suprising re- sult to e veryone may be that althought N10 was ev aluated as the best VC by human listeners (about 4.1 MOS score), our meth- ods detected its artifacts easily and its EER is lo w as 4.6 %. The N10 does not use any deterministic v ocoders as above, but, uses µ -law quantized waveforms instead [ 25 ]. The µ -law quantiza- tion may cause obvious artif acts (although they are non audible to human and hence they were well evaluated by human listen- ers). One interesting exception to our expectations is system D02. For the ∆ + ∆ 2 conﬁguration, the VC system D02 has achieved highest EER although they hav e used the known vocoder . According to the listening test [ 8 ], the D02 samples sound very similar to source speakers and perfectly dissimilar to target speaker . This suggests that D02 used little modiﬁca- tion of source speaker’ s wav eforms and hence has likely less processing artifacts. W e can also see that the EERs are sensitive to the choice of acoustic features used. Using the full feature conﬁguration, which was optimized to detect the VCC’18 baseline samples, the waveform ﬁltering, SuperVP and Grifﬁn-Lim based wav e- form generation methods still have high EERs, but there are also classic systems, such as N19, that obtain relatively high EERs despite being kno wn technology . W e suspect that the counter- measures are overﬁt and incapable of generalizing beyond the training data. It would be important for us to develop stable and robust models. Having said that, majority of the VC systems do not achiev e EERs close to the chance level (50%) regard- less of the selected features, indicating that all the VC methods produced artifacts more or less and having plenty of mar gin for improving the quality of the samples. Finally , Fig. 2 displays a scatter plot of the EER of the ∆ + ∆ 2 countermeasure against the mean opinion score (MOS) from the perceptual experiments detailed in [ 8 ]. W e do not ob- serve strong association between the two but they are comple- mentary to each other . This is not entirely surprising remem- bering that our objectiv e measure focuses on audible and non- audible artifacts, whereas listeners ev aluated audible natural- ness subjectiv ely . As expected, human perception and machine perception are different. 6. Conclusion W e hav e proposed the use of spooﬁng countermeasure for ob- jectiv e artifact assessment of con verted voices as a supplement to the VCC’18 challenge results. Our approach is reference- free and text-independent in the sense that it does not require access neither to the original source speaker waveform nor any text transcripts. It assigns a single EER number to a batch of con verted speech utterances from a single VC system, and can therefore be used in comparing different VC systems in terms of their artifacts. Although the tested countermeasure utilizing CQCC front- end and GMM backend was found to be sensitiv e to the choices of model parameters and acoustic features, our results indicate clear potential of spooﬁng countermeasure scores as a con- venient and complementary tool to automatically assess the amount of audible and non-audible speech artifacts. The obtained results are in a reasonable agreement with types of wav eform generation methods used for VC systems. Our results indicate that the wav eform ﬁltering, SuperVP and Grifﬁn-Lim methods have relatively less artifacts. Since they are not included in the current ASVspoof datasets, we claim that it is also important to create new spooﬁng materials based on the methods and to train more robust anti-spooﬁng counter- measures in practice. W e also rev ealed that perceptually convincing VC samples based on W avenet [ 25 ] in the VCC’18 hav e detectable artif acts. This implies that the current best VC samples may fool human ears but not necessarily the CM systems. There is no system that would perfectly fool both the humans and CM systems yet. While our study serves as a proof-of-concept, we foresee sev eral possible future directions. Firstly , it would be interest- ing to compare the performance to standard objective artifact measures used in assessing speech codecs and speech enhance- ment methods, and to other spooﬁng countermeasure front- ends besides CQCCs. Some alternativ e features may provide a stronger correlation to the MOS scores. Second, given the obvious issue related to selection and enriching of the training data with the latest VC techniques it might be relev ant to con- sider one-class approaches that require human training speech only . Even if such approaches hav e had only moderate success in anti-spooﬁng [ 26 ] it would be interesting to revisit them in the context of artifact assessment. 7. Acknowledgements W e are grateful to iFlytek Ltd. for sponsoring the ev aluation of the VCC 2018. This work w as partially supported by MEXT KAKENHI Grant Numbers (15H01686, 16H06302, 17H04687, 17H06101), and Academy of Finland (proj. no. 309629). 8. References [1] Seyed Hamidreza Mohammadi and Alexander Kain, “ An ov erview of voice conv ersion systems, ” Speech Commu- nication , vol. 88, pp. 65–82, 2017. [2] Y annis Stylianou, “V oice transformation: A survey , ” in Pr oceedings of the IEEE International Confer ence on Acoustics, Speech, and Signal Processing , ICASSP 2009, 19-24 April 2009, T aipei, T aiwan , 2009, pp. 3585–3588. [3] T omi Kinnunen, Zhizheng Wu, K ong-Aik Lee, Filip Sed- lak, Engsiong Chng, and Haizhou Li, “V ulnerability of speaker veriﬁcation systems against voice con version spooﬁng attacks: The case of telephone speech, ” in 2012 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing, ICASSP 2012, Kyoto, Japan, Mar ch 25-30, 2012 , 2012, pp. 4401–4404. [4] T omoki T oda, Alan W . Black, and K eiichi T okuda, “V oice con version based on maximum-likelihood estimation of spectral parameter trajectory , ” Audio, Speech, and Lan- guage Pr ocessing, IEEE T ransactions on , vol. 15, no. 8, pp. 2222–2235, 2007. [5] Y uki Saito, Shinnosuk e T akamichi, and Hiroshi Saruwatari, “V oice con v ersion using input-to-output highway networks, ” Information and Systems, IEICE T r ansactions on , vol. E100.D, no. 8, pp. 1925–1928, 2017. [6] Kazuhiro Kobayashi, T omoki Hayashi, Akira T amamori, and T omoki T oda, “Statistical voice conversion with wa venet-based waveform generation, ” in Pr oc. Inter- speech 2017 , 2017, pp. 1138–1142. [7] T omoki T oda, Ling-Hui Chen, Daisuke Saito, Fernando V illa vicencio, Mirjam W ester, Zhizheng W u, and Junichi Y amagishi, “The voice con version challenge 2016, ” in Interspeech 2016, 17th Annual Conference of the Inter- national Speech Communication Association, San F ran- cisco, CA, USA, September 8-12, 2016 , 2016, pp. 1632– 1636. [8] Jaime Lorenzo-Trueba, Junichi Y amagishi, T omoki T oda, Daisuke Saito, Fernando V illa vicencio, T omi Kinnunen, and Zhenhua Ling., “The voice conv ersion challenge 2018: Promoting de velopment of parallel and nonparal- lel methods, ” in Accepted to Odyssey 2018 , 2018. [9] Zhizheng W u, Nicholas W . D. Evans, T omi Kinnunen, Junichi Y amagishi, Federico Alegre, and Haizhou Li, “Spooﬁng and countermeasures for speaker veriﬁcation: A survey , ” Speech Communication , vol. 66, pp. 130–153, 2015. [10] Bryan L. Pellom and John H. L. Hansen, “ An experi- mental study of speaker veriﬁcation sensitivity to com- puter voice-altered imposters, ” in Pr oceedings of the 1999 IEEE International Confer ence on Acoustics, Speech, and Signal Pr ocessing, ICASSP ’99, Phoenix, Arizona, USA, Mar ch 15-19, 1999 , 1999, pp. 837–840. [11] Zhizheng W u, Junichi Y amagishi, T omi Kinnunen, Cemal Hanilc ¸ i, Md. Sahidullah, Aleksandr Sizov , Nicholas W . D. Evans, and Massimiliano T odisco, “ Asvspoof: The auto- matic speaker veriﬁcation spooﬁng and countermeasures challenge, ” J. Sel. T opics Signal Pr ocessing , vol. 11, no. 4, pp. 588–604, 2017. [12] Pav el K orshunov , S ´ ebastien Marcel, Hannah Muck enhirn, Andre R. Goncalves, A. G. Souza Mello, Ricardo P . V el- loso V iolato, Fl ´ avio O. Sim ˜ oes, M. U. Neto, Marcus de Assis Angeloni, Jos ´ e Augusto Stuchi, Heinrich Dink el, Nanxin Chen, Y anmin Qian, Dipjyoti P aul, Goutam Saha, and Md. Sahidullah, “Overvie w of BT AS 2016 speaker anti-spooﬁng competition, ” in 8th IEEE International Confer ence on Biometrics Theory , Applications and Sys- tems, BT AS 2016, Nia gara F alls, NY , USA, September 6-9, 2016 , 2016, pp. 1–6. [13] Md. Sahidullah, T omi Kinnunen, and Cemal Hanilc ¸ i, “ A comparison of features for synthetic speech detection, ” in INTERSPEECH 2015, 16th Annual Confer ence of the In- ternational Speec h Communication Association, Dr esden, Germany , September 6-10, 2015 , 2015, pp. 2087–2091. [14] Massimiliano T odisco, H ´ ector Delgado, and Nicholas Evans, “ A new feature for automatic speaker veriﬁca- tion anti-spooﬁng: Constant q cepstral coefﬁcients, ” in Odysse y 2016: The Speaker and Language Recognition W orkshop , pp. 283–290. [15] Kaavya Sriskandaraja, V idhyasaharan Sethu, Eliathamby Ambikairajah, and Haizhou Li, “Front-end for antispoof- ing countermeasures in speaker veriﬁcation: Scattering spectral decomposition, ” J. Sel. T opics Signal Pr ocessing , vol. 11, no. 4, pp. 632–643, 2017. 0 10 20 30 40 50 0 1 2 3 4 5 N10 B01 N17 N08 D02 N12 N13 N04 N15 D04 N11 N05 N18 N20 D05 N09 D03 D01 N14 N07 N03 N16 N06 N19 Countermeasure EER (%) MOS HUB 0 10 20 30 40 50 0 1 2 3 4 5 N10 N13 B01 N17 N18 N12 N11 N04 N05 N03 N06 N16 Countermeasure EER (%) MOS SPOKE Figure 2: Scatter plot of objective vs. subjective quality of the VCC’18 challenge entries for the HUB and SPOKE tasks. The vertical axis represents mean opinion score (MOS) of subjectiv e quality ratings of the samples of each challenge entry , while the horizontal axis represents equal error rate (EER, %) of spooﬁng countermeasure optimized to detect human and artiﬁcial speech. Therefore, the higher the EER value, the more confused the countermeasure is in telling apart the con verted samples from authentic human speech, implying higher quality of the samples. The ‘ideal’ values would be MOS = 5.0 and EER = 50 %. [16] T anvina B. Patel and Hemant A. Patil, “Combining evi- dences from mel cepstral, cochlear ﬁlter cepstral and in- stantaneous frequency features for detection of natural vs. spoofed speech, ” in INTERSPEECH 2015, 16th Annual Confer ence of the International Speech Communication Association, Dr esden, Germany , September 6-10, 2015 , 2015, pp. 2062–2066. [17] Galina Lavrentye v a, Sergey Novoselo v , Egor Ma- lykh, Alexander K ozlo v , Oleg Kudashev , and V adim Shchemelinin, “ Audio replay attack detection with deep learning frameworks, ” in Pr oc. Interspeech 2017 , 2017, pp. 82–86. [18] David A. van Leeuwen and Niko Br ¨ ummer , “The distri- bution of calibrated likelihood-ratios in speaker recogni- tion, ” in INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, L yon, F rance , August 25-29, 2013 , 2013, pp. 1619–1623. [19] Judith Brown, “Calculation of a constant q spectral trans- form, ” Journal of the Acoustical Society of America , vol. 89, no. 1, pp. 425–434, January 1991. [20] Seyed Omid Sadjadi, Malcolm Slaney , and Larry Heck, “MSR identity toolbox: A MA TLAB tool- box for speaker recognition research (v 1.0), ” https://www.microsoft.com/en- us/ download/confirmation.aspx?id=52279 , 2013. [21] Pav el Korshuno v and S ´ ebastien Marcel, “Impact of score fusion on voice biometrics and presentation attack detec- tion in cross-database ev aluations, ” J. Sel. T opics Signal Pr ocessing , vol. 11, no. 4, pp. 695–705, 2017. [22] Cemal Hanilc ¸ i, T omi Kinnunen, Md. Sahidullah, and Aleksandr Sizov , “Spooﬁng detection goes noisy: An analysis of synthetic speech detection in the presence of additiv e noise, ” Speech Communication , vol. 85, pp. 83– 97, 2016. [23] H ´ ector Delgado, Massimiliano T odisco, Nicholas W . D. Evans, Md. Sahidullah, W ei Ming Liu, Federico Alegre, T omi Kinnunen, and Benoit G. B. Fauve, “Impact of band- width and channel variation on presentation attack detec- tion for speaker v eriﬁcation, ” in International Confer ence of the Biometrics Special Inter est Gr oup, BIOSIG 2017, Darmstadt, Germany , September 20-22, 2017 , 2017, pp. 1–6. [24] Gautham J. Mysore, “Can we automatically transform speech recorded on common consumer devices in real- world environments into professional production quality speech? - A dataset, insights, and challenges, ” IEEE Sig- nal Pr ocess. Lett. , vol. 22, no. 8, pp. 1006–1010, 2015. [25] A ¨ aron v an den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol V inyals, Alex Graves, Nal Kalchbrenner , Andrew Senior , and K oray Kavukcuoglu, “W a veNet: A generativ e model for raw audio, ” arXiv pr e-print , 2016. [26] Federico Alegre, Asmaa Amehraye, and Nicholas W . D. Evans, “ A one-class classiﬁcation approach to generalised speaker veriﬁcation spooﬁng countermeasures using lo- cal binary patterns, ” in IEEE Sixth International Confer- ence on Biometrics: Theory , Applications and Systems, BT AS 2013, Arlington, V A, USA, September 29 - October 2, 2013 , 2013, pp. 1–8.

A Spoofing Benchmark for the 2018 Voice Conversion Challenge: Leveraging from Spoofing Countermeasures for Speech Artifact Assessment

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment