A Surprising Density of Illusionable Natural Speech

A Surprising Density of Illusionable Natural Speech Melody Y . Guan Department of Computer Science Stanford Univ ersity Stanford, CA 94305 mguan@stanford.edu Gregory V aliant Department of Computer Science Stanford Univ ersity Stanford, CA 94305 gvaliant@stanford.edu Abstract Recent w ork on adversarial examples has demonstrated that most natural inputs can be perturbed to fool ev en state-of-the-art machine learning systems. But does this happen for humans as well? In this work, we in vestigate what fr action of natural instances of speech can be turned into “illusions” whic h either alter humans’ per ception or r esult in differ ent people having signiﬁcantly dif fer ent perceptions? W e ﬁrst consider the McGurk effect, the phenomenon by which adding a carefully chosen video clip to the audio channel affects the vie wer’ s perception of what is said [ 17 ]. W e obtain empirical estimates that a signiﬁcant fraction of both words and sentences occurring in natural speech ha ve some susceptibility to this effect. W e also learn models for predicting McGurk illusionability . Finally we demonstrate that the Y anny or Laurel auditory illusion [ 24 ] is not an isolated occurrence by generating se veral very diff erent new instances. W e believ e that the surprising density of illusionable natural speech warrants further in vestigation, from the perspectiv es of both security and cognitiv e science. 1 Introduction A gro wing body of work on adversarial examples has identiﬁed that for machine-learning (ML) systems that operate on high-dimensional data, for nearly every natural input there exists a small perturbation of the point that will be misclassiﬁed by the system, posing a threat to its deployment in certain critical settings [ 3 , 7 , 10 , 13 , 16 , 23 , 22 ]. More broadly , the susceptibility of ML systems to adversarial examples has prompted a re-e xamination of whether current ML systems are truly learning or if they are assemblages of tricks that are effecti ve yet brittle and easily fooled [ 25 ]. Implicit in this line of reasoning is the assumption that instances of ”real" learning, such as human cognition, yield extremely rob ust systems. Indeed, at least in computer vision, human perception is regarded as the gold-standard for robustness to adv ersarial examples. Evidently , humans can be fooled by a v ariety of illusions , whether they be optical, auditory , or other; and there is a long line of research from the cogniti ve science and psychology communities in vestigating these [ 12 ]. In general, howe ver , these illusions are vie wed as isolated examples that do not arise frequently , and which are f ar from the instances encountered in ev eryday life. In this work, we attempt to understand ho w susceptible humans’ perceptual systems for natural speech are to carefully designed “adversarial attacks. ” W e inv estigate the density of certain classes of illusion, that is, the fraction of natural language utterances whose comprehension can be affected by the illusion. Our study centers around the McGurk effect, which is the well-studied phenomenon by which the perception of what we hear can be inﬂuenced by what we see [ 17 ]. A prototypical example is that the audio of the phoneme “baa, ” accompanied by a video of someone mouthing “vaa”, can be perceiv ed as “vaa” or “g aa” (Figure 1). This effect persists e ven when the subject is a ware of the setup, though the strength of the effect varies signiﬁcantly across people and languages and with factors such as age, gender , and disorders [2, 8, 14, 18, 20, 21, 27, 28, 31]. Figure 1: Illustration of the McGurk effect. F or some phoneme pairs, when the speaker visibly mouths phoneme A but the auditory stimulus is actually the phoneme B , listeners tend to perceiv e a phoneme C 6 = B . A signiﬁcant density of illusionable instances for humans might present similar types of security risks as adversarial e xamples do for ML systems. Auditory signals such as public service announcements, instructions sent to ﬁrst responders, etc., could be targeted by a malicious agent. Given only access to a screen within eyesight of the intended victims, the agent might be able to signiﬁcantly obfuscate or alter the message perceiv ed by those who see the screen (ev en peripherally). 1.1 Related work Illusionable instances for humans are similar to adversarial examples for ML systems. Strictly speaking, howe ver , our inv estigation of the density of natural language for which McGurk illusions can be created, is not the human analog of adv ersarial examples. The adversarial examples for ML systems are datapoints that are misclassiﬁed, despite being extremely similar to a typical datapoint (that is correctly classiﬁed). Our illusions of misdubbed audio are not extremely close to any typically encountered input, since our McGurk samples hav e auditory signals corresponding to one phoneme/word and visual signals corresponding to another . Also, there is a compelling argument for why the McGurk confusion occurs, namely that human speech perception is bimodal (audio-visual) in nature when lip reading is av ailable [4, 29]. T o the best of our kno wledge, prior to our work, there has been little systematic inv estigation of the extent to which the McGurk ef fect, or other types of illusions, can be made dense in the set of instances encountered in ev eryday life. The closest work is Elsayed et al. [9] , where the authors demonstrate that some adversarial examples for computer vision systems also fool humans when humans were giv en less than a tenth of second to view the image. Ho we ver , some of these examples seem less satisfying as the perturbation acts as a pixel-space interpolation between the original image and the “incorrect” class. This results in images that are visually borderline between two classes, and as such, do not provide a sense of illusion to the vie wer . In general, researchers have not probed the robustness of human perception with the same tools, intent, or perspecti ve, with which the security community is currently interrogating the robustness of ML systems. 2 Problem setup For the McGurk effect, we attempt an illusion for a language token (e.g. phoneme, word, sentence) x by creating a video where an audio stream of x is visually dubbed o ver by a person saying x 0 6 = x . W e stress that the audio portion of the illusion is not modiﬁed and corresponds to a person saying x . The illusion f ( x 0 , x ) affects a listener if they percei ve what is being said to be y 6 = x if they watched the illusory video whereas they perceiv e x if they had either listened to the audio stream without watching the video or had watched the original unaltered video, depending on speciﬁcation. W e call a token illusionable if an illusion can be made for the token that af fects the perception of a signiﬁcant fraction of people. In Section 3, we analyze the extent to which the McGurk effect can be used to create illusions for phonemes, words, and sentences, and analyze the fraction of natural language that is susceptible to such illusionability . W e thereby obtain a lower bound on the density of illusionable natural speech. 2 T able 1: Illusionable phonemes and ef fects based on preliminary phoneme-pair testing. Where a number of lip mov ements were av ailable to affect a phoneme, the most ef fecti ve one is listed. Phoneme Lip Movement P ercei ved Sound b / / w / / v /, / f /, / p / D / / b / / b / f / / z / / T /, / t /, / b / m / / D / / nT /, /n/, /ml/ p / / t / / t /, / k / v / / b / / b / d / / v / / v /, / t / l / / v / / v / T / / v / / d /, / k /, / t /, / f / w / / l / / l / W e ﬁnd that 1) a signiﬁcant fraction of words that occur in e veryday speech can be turned into McGurk-style illusions, 2) such illusions persist when embedded within the context of natural sentences, and in fact af fect a signiﬁcant fraction of natural sentences, and 3) the illusionability of words and sentences can be predicted using features from natural language modeling. 3 McGurk experiments 3.1 Phoneme-level experiments W e began by determining which phoneme sounds can be paired with video dubs of other phonemes to effect a perceiv ed phoneme that is different from the actual sound. W e created McGurk videos for all vo wel pairs preceded with the consonant / n / as well as for all consonant pairs followed by the vowel / a / spoken by a speaker . There are 20 vo wel phonemes and 24 consonant phonemes in American English although / Ã / and / Z / are redundant for our purposes. 1 Based on labels provided by 10 individuals we found that although vowels were not easily confused, there are a number of illusionable consonants. W e note that the illusionable phoneme pairs depend both on the speaker and listener identities. Giv en T able 1 of illusionable phonemes, the goal was then to understand whether these could be lev eraged within words or sentences; and if so, the fraction of natural speech that is susceptible. 3.2 W ord-level experiments W e sampled 200 unique words (listed in T able 2) from the 10,000 most common words in the Project Gutenberg no vels in proportion to their frequency in the corpus. The 10k words collectiv ely hav e a prev alence of 80.6% in the corpus. Of the 200 sampled words, 147 (73.5%) contained phonemes that our preliminary phoneme study suggested might be illusionable. For these 147 words, we paired audio clips spoken by the speaker with illusory video dubs of the speaker saying the words with appropriately switched out phonemes. W e tested these videos on 20 nai ve test subjects who did not participate in the preliminary study . Each subject watched half of the words and listened without video to the other half of the words, and were giv en the instructions: "Write down what you hear . What you hear may or may not be sensical. Also write down if a clip sounds unclear to you. Not that a clip may sound nonsensical but clear ." Subjects were allowed up to three plays of each clip. W e found that watching the illusory videos led to an av erage miscomprehension rate of 24.8%, a relati ve 148% increase from the baseline of listening to the audio alone (T able 3). The illusory videos made people less conﬁdent about their correct answers, with an additional 5.1% of words being heard correctly but unclearly , compared to 2.1% for audio only . F or 17% of the 200 words, the illusory videos increased the error rates by more than 30% abov e the audio-only baseline. 1 Refer to www.macmillanenglish.com/pronunciation/phonemic- chart- in- american- english/ for a list of the phoneme symbols of the International Phonetic Alphabet that are used in American English. 3 T able 2: The 200 unique words sampled from the Project Gutenberg nov el corpus. The 147 of those for which an illusory video was created are listed on top. Ordering is otherwise alphabetical. Illusion Attempted about addressed all also and anyone arms away bad be been before behind besides blind bought box brothers b ut by call called calling came child close coming could days dead did die direction done else end ev en e verything far features fell few ﬁghting ﬂy for formed from game gathered gav e general generally god good half hands happened hath hav e him himself idea information july large let letter life like list made many mass may me meet men months more Mrs my myself nev er nothing of off old one open opinion ordinary other outside passion perhaps please plenty point possessed present put questions roof said sav e seized shall sharp ship should slo w some speech still successful summer terms than that the their them themselves there the y things though time top upon used v ery waited was water we went what when which will wisdom with working world would wounded No Attempt a act air an any are as at change city country eyes go going hair has he heart her higher his house i in into is it its king know nature new no not now on or our out rest sa w see seen she sorro w strange take talking to turn who writing your T able 3: T est results for word-le vel McGurk illusions among the 147 words predicted to be illusionable. Sho wn are av erage error rates for watching the illusory video vs listening to the audio only , as well as the percentage of words that are correctly identiﬁed b ut sound ambiguous to the listener . A udio Only Illusory V ideo Relativ e Increase W ords Incorrectly Identiﬁed 10.0 24.8 +148% W ords Correct with Low Conﬁdence 2.1 5.1 +144% 3.2.1 Prediction model for word-lev el illusionability T o create a predictiv e model for word-le vel illusionability , we used illusionable phonemes enriched with positional information as features. Explicitly , for each of the 10 illusionable phonemes, we created three features from the phoneme being in an initial position (being the ﬁrst phoneme of the word), a medial position (ha ving phonemes come before and after), or a ﬁnal position (being the last phoneme of the word). W e then represented each word with a binary bag-of-words model [ 15 ], giving each of the 30 phonemes-with-phonetic-context features a v alue of 1 if present in the word and 0 otherwise. W e performed ridge regression on these features with a constant term. W e searched for the optimal l 2 regularization constant among the v alues [0.1, 1, 10, 100] and picked the optimal one based on training set performance. The train:test split was in the proportion 85%:15% and was randomly chosen for each trial. Across 10k randomized trials, we obtain average training and test set correlations of 91 . 1 ± 0 . 6% and 44 . 6 ± 28 . 9% respectively . Our ﬁnal model achieves an out-of-sample correlation of 57% between predicted and observ ed illusionabilites. Here, the observ ed illusionability of the w ords is calculated as the difference between the accuracy of watchers of the illusory videos and the accuracy of listeners, where “accuracy” is deﬁned as the fraction of respondents who were correct. For each word, the predicted illusionability is calculated from doing inference on that word using the a veraged regression coefﬁcients of the regression trials where the word is not in the training set. 4 Our predicted illusionability is also calibr ated , in the sense that for the words predicted to ha ve an illusionability <0.1, the mean empirical illusionability is 0.04; for words with predicted illusionability in the interv al [0.1, 0.2] the mean empirical illusionability is 0.14; for predicted illusionability between [0.2, 0.3] the mean observed is 0.27; and for predicted illusionability >0.3, the mean observed is 0.50. Figure 2 visually depicts the match between the observed and predicted word illusionabilities. Figure 2: Predicted w ord illusionability closely matches observed word illusionability , with out-of- sample correlation of 57%. The words are sorted by increasing predicted w ord illusionability (and the observed illusionability of each word w as not used in calculating the prediction of that word). 3.3 Sentence-level experiments W e set up the following experiment on naturally occurring sentences. W e randomly sampled 300 sentences of lengths 4-8 words inclusive from the nov el Little W omen [ 1 ] from the Project Gutenberg corpus. From this reduced sample, we selected and perturbed 32 sentences that we expected to be illusionable (listed in T able 4). W ith the speaker , we prepared two formats of each sentence: original video (with original audio), and illusory video (with original audio).W e then ev aluated the perception of these on 1306 naiv e test subjects on Amazon Mechanical T urk. The T urkers were shown videos for six randomly selected sentences, three illusory and three original, and were giv en the prompt: "Press any ke y to begin video [index #] of 6. W atch the whole video, and then you will be prompted to write do wn what the speaker said." Only one vie wing of any clip was allo wed, to simulate the natural setting of observing a liv e audio/video stream. Each T urker was limited to six videos to reduce respondent fatigue. T urkers were also asked to report their lev el of conﬁdence in what they heard on a scale from no uncertainty (0%) to complete uncertainty (100%). One hundred and twelv e T urkers (8.6%) did not adhere to the prompt, writing unrelated responses, and their results were omitted from analysis. W e found that watching the illusory videos led to an av erage miscomprehension rate of 32.8%, a relativ e 145% increase from the baseline of watching the original videos (T able 5). The illusory videos made people less conﬁdent about their correct answers. T urkers who correctly identiﬁed the audio message in an illusory video self-reported an average uncertainty of 42.9%, which is a relati ve 123% higher than the av erage reported by the T urkers who correctly understood the original videos. Examples of mistakes made by listeners of the illusory videos are shown in T able 6. Ov erall we found that for 11.5% of the 200 sampled sentences (23 out of the 30 videos we created), the illusory videos increased the error rates by more than 10%. 3.3.1 Comparing word-lev el and sentence-level illusionabilities W e obtained a sentence-level illusionability prediction model with an out-of-sample correlation of 33% between predicted and observed illusionabilities. Here, the observ ed illusionability of the sentences was calculated as the dif ference between the accurac y of watchers of the illusory videos and the accuracy of w atchers of original videos, where “accuracy” is deﬁned as the fraction of respondents who were correct. W e obtained predicted illusionabilities by simply using the maximum word 5 T able 4: The sentences randomly sampled from Little W omen for which we made illusory videos. Those that had observed illusionability >0.1 are listed on top. Illusionable I’m not too proud to beg for F ather . No need of that. I’ d like to wear them, Mother can I? It’ s no harm, Jo Now do be still, and stop bothering. W ell, I like that! How man y did you hav e out? Open the door , Amy! Nonsense, that’ s of no use. I am glad of that! Of course that settled it. I can’t bear saints. Serves me right for trying to be ﬁne. Of course I am! There’ s gratitude for you! That’ s my good girl. Capital boys, aren’ t they? Y ou’ ve got me, anyho w . Brown, that is, sometimes. I’ll tell you some day . What do you know about him? He won’t do that. The plan of going ov er was not forgotten. Not Illusionable On the way get these things. I’m glad of it. Aren’t you going with him? Then I’ll tell you. That was the question. That’ s why I do it. I hate to hav e odd gloves! W e don’t kno w him! What good times they had, to be sure T able 5: T est results for sentence-level McGurk illusions, sho wing average error rates for w atching the illusory video vs watching the original video, and a verage self-reported uncertainty for correctly identiﬁed words. Original V ideo Illusory V ideo Relative Incr ease Sentences Incorrectly Identiﬁed 13.4% 32.8% +145% Uncertainty of Correct Sentences 19.4% 42.9% +121% T able 6: Sample verbatim mistakes for sentence-lev el McGurk illusions. Spelling mistakes were ignored for accuracy calculations. Sample illusory videos are provided in the supplementary ﬁles. Sentence Sample Listener Per ceptions Serves me right for trying to be ﬁne. Serv es thee right for trying to be thine Serve you right for trying to be kind serbes ye right shine by rhine Of course that settled it. up course bat saddle it Of course Max settled it. I can’t bear saints. I can’t wear skates I can’t hear saints. I can’t spare sink How man y did you hav e out? How man y did you knock out? How man y did you help out I’ll tell you some day . I’ ll tell you Sunday . Ill tell you something I’m not too proud to beg for F ather . I’m not too proud to beg or bother I want the cro wd to beg for father Ine not too proud to beg for father W ell, I like that! W ell I like pets W ell, I like baths Now do be still, and stop bothering. no w do we still end stop bothering And do we still end sauce watering? Of course I am! Of course I an Of course I can Of course I anth. Y ou’ ve got me, anyho w . you got knee anyho w Y ou got the an yhow . There’ s gratitude for you! Bears gratitude for you. Pairs gratitudes or you. Bears gratitude lore you 6 illusionability prediction amongst the words in each sentence, with word predictions obtained from the word-le vel model. W e attempted to improve our sentence-le vel predictiv e model by incorporating ho w likely the words appear under a natural language distrib ution, considering three classes of words: words in the sentence for which no illusion w as attempted, the words for which an illusion was attempted, and the potentially percei ved words for words for which an illusion was attempted. W e used log word frequencies obtained from the the top 36.7k most common words from the Project Gutenberg corpus. 2 This approach could not attain better out-of-sample correlations than the nai ve method. This implies that conte xt is important for sentence-le vel illusionability , and more complex language models should be used. Finally , comparing word-lev el and sentence-lev el McGurk illusionabilities in natural speech, we observe that that the former is signiﬁcantly higher . A greater McGurk effect at the w ord le vel is to be expected–sentences provide context with which the viewer could ﬁll in confusions and misunderstandings. Furthermore, when w atching a sentence video compared to a short word video, the viewer’ s attention is more likely to stray , both from the visual component of the video, which evidently reduces the McGurk ef fect, as well as the from the audio component, which likely prompts the viewer to rely e ven more heavily on context. Nev ertheless, there remains a signiﬁcant amount of illusionability at the sentence-lev el. 4 Future Directions This work is an initial step tow ards exploring the density of illusionable phenomena for humans. There are many natural directions for future work. In the vein of further understanding McGurk-style illusions, it seems worth building more accurate predictiv e models for sentence-level ef fects, and further in vestigating the security risks posed by McGurk illusions. For example, one concrete ne xt step in understanding McGurk-style illusions would be to actually implement a system which takes an audio input, and outputs a video dub resulting in signiﬁcant misunderstanding. Such a system would need to combine a high-quality speech-to-video-synthesis system [ 30 , 32 ], with a ﬂeshed-out language model and McGurk prediction model. There is also the question of ho w to guard against “attacks” on human perception. For example, in the case of the McGurk effect, ho w can one rephrase a passage of text in such a way that the meaning is unchanged, but the rephrased text is signiﬁcantly more rob ust to McGurk style manipulations? The central question in this direction is what fraction of natural language can be made robust without signiﬁcantly changing the semantics. A better understanding of when and why certain human perception systems are nonrobust can also be applied to make ML systems more robust. In particular, neural networks hav e been found to be susceptible to adversarial e xamples in automatic speech recognition [ 26 , 5 ] and to the McGurk effect [ 19 ], and a rudimentary approach to making language robust to the latter problem would be to use a reduced v ocabulary that av oids words that score highly in our word-le vel illusionability prediction model. Relatedly , at the interface of cognitiv e science and adversarial examples, there has been work suggesting that humans can anticipate when or ho w machines will misclassify , including for adversarial e xamples [6, 11, 33]. More broadly , as the tools for probing the weaknesses of ML systems dev elop further , it seems like a natural time to reexamine the supposed rob ustness of human perception. W e anticipate unexpected ﬁndings. T o provide one e xample, we summarize some preliminary results on audio-only illusions. 4.1 A uditory Illusions An audio clip of the w ord “Laurel" gained widespread attention in 2018, with coverage by notable news outlets such as The Ne w Y ork T imes and T ime . Roughly half of listeners perceive “Laurel” and the other half percei ve “Y ann y” or similar-sounding words, with high conﬁdence on both sides [ 24 ]. One of the reasons the public was intrigued is because examples of such phenomena are vie wed as rare, isolated instances. In a preliminary attempt to in vestigate the density of such phenomena, we identiﬁed ﬁ ve additional distinct examples (T able 7). The supplementary ﬁles include 10 v ersions of one of these examples, where listeners tend to perceiv e either “worlds” or “yikes. ” Across the 2 https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists#Project_Gutenberg 7 different audio clips, one should be able to hear both interpretations. The threshold for switching from one interpretation to another differs from person to person. T able 7: T est results for Y anny/Laurel style illusions. Each row displays a cluster of similar-sounding reported words, with counts. Emdashes indicate unintelligible. W ord Slowdo wn Per ceived Sound N worlds 1.5x worlds 5 yikes/yites 4 nights/lights 6 bologna 1.7x bologna 2 alarming 2 alarmy/ayarmy/ignore me 3 uoomi/ayomi/wyoming 3 anomi/anolli/amomi 3 – 2 growing 1.3x growing 8 pearling 3 curling 3 crowing 1 potent 1.7x potent/poatin/poden 4 pogie/bowie/po-ee 3 pone/paam/paan 3 power/poder/pair 3 tana 1 – 2 prologue 1.9x prologue 2 prelude/pro-why/pinelog 3 kayak/kayank/kayan 6 turnip/tienap/tarzan 3 – 1 These examples were generated by examining 5000 words, and selecting the 50 whose spectrograms contain a balance of high and low frequency components that most closely matched those for the word “Laurel". Each audio ﬁle corresponded to the Google Cloud T ext-to-Speech API synthesis of a word, after lo w frequencies were damped and the audio was slowed 1 . 3 - 1 . 9 x. After listening to these top 50 candidates, we ev aluated the most promising ﬁve on a set of 15 indi viduals (3 female, 12 male, age range 22-33). W e found multiple distributional modes of perceptions for all ﬁv e audio clips. For example, a clip of “worlds” with the high frequencies damped and slo wed down 1.5x w as perceiv ed by ﬁ ve listeners as “worlds”, four a s “yikes/yites” and six as “nights/lights”. While these e xperiments do not demonstrate a density of such examples with respect to the set of all words—and it is unlikely that illusory audio tracks in this style can be created for the majority of words—they illustrate that ev en the surprising Y anny or Laurel phenomenon is not an isolated occurrence. It remains to be seen how dense such phenomena can be, gi ven the right sort of subtle audio manipulation. ‘ 5 Conclusion Our work suggests that for a signiﬁcant fraction of natural speech, human perception can be altered by using subtle, learnable perturbations. This is an initial step to wards exploring the density of illusionable phenomenon for humans, and examining the e xtent to which human perception may be vulnerable to security risks like those that adversarial e xamples present for ML systems. W e hope our work inspires future inv estigations into the discovery , generation, and quantiﬁcation of multimodal and unimodal audiovisual and auditory illusions for humans. There exist many open research questions on when and why humans are susceptible to various types of illusions, how to model the illusionability of natural language, and how natural language can be made more robust to illusory perturbations. Additionally , we hope such inv estigations inform our interpretations of the strengths and weaknesses of current ML systems. Finally , there is the possibility that some vulnerability to carefully crafted adv ersarial examples may be inherent to all comple x learning systems that interact with high-dimensional inputs in an en vironment with limited data; any thorough in vestigation of this question must also probe the human cogniti ve system. 8 6 Acknowledgements This research was supported in part by NSF a ward AF:1813049, an ONR Y oung In vestigator a ward (N00014-18-1-2295), and a grant from Stanford’ s Institute for Human-Centered AI. The authors would like to thank Jean Betterton, Shi vam Gar g, Noah Goodman, Kelvin Guu, Michelle Lee, Percy Liang, Aleksander Makelov , Jacob Plachta, Jacob Steinhardt, Jim T erry , and Alexander Whatley for useful feedback on the work. The research was done under Stanford IRB Protocol 46430. References [1] Louisa May Alcott. Little W omen (1868-9) . Roberts Brothers, 1994. [2] Mireille Bastien-T oniazzo, Aurelie Stroumza, and Christian Cave. Audio-visual perception and integration in developmental dyslexia: An exploratory study using the mcgurk effect. Current psychology letters. Behaviour , brain & co gnition , 25(3), 2010. [3] V ahid Behzadan and Arslan Munir . V ulnerability of deep reinforcement learning to policy induction attacks. In International Conference on Machine Learning and Data Mining in P attern Recognition , pages 262–275. Springer , 2017. [4] Julien Besle, Alexandra F ort, Claude Delpuech, and Marie-H Giard. Bimodal speech: early suppressive visual effects in human auditory corte x. [5] Nicholas Carlini and David W agner . Audio adversarial e xamples: T argeted attacks on speech-to-te xt. In 2018 IEEE Security and Privacy W orkshops (SPW) , pages 1–7. IEEE, 2018. [6] Arjun Chandrasekaran, Deshraj Y adav , Prithvijit Chattopadhyay , V iraj Prabhu, and Devi P arikh. It takes two to tango: T owards theory of ai’ s mind. arXiv pr eprint arXiv:1704.00717 , 2017. [7] Pin-Y u Chen, Y ash Sharma, Huan Zhang, Jinfeng Y i, and Cho-Jui Hsieh. Ead: elastic-net attacks to deep neural networks via adversarial e xamples. Thirty-second AAAI confer ence on artiﬁcial intelligence , 2018. [8] Xavier Delbeuck, F abienne Collette, and Martial V an der Linden. Is alzheimer’ s disease a disconnection syndrome?: Evidence from a crossmodal audio-visual illusory experiment. Neur opsychologia , 45(14): 3315–3323, 2007. [9] Gamaleldin F Elsayed, Shreya Shankar , Brian Cheung, Nicolas Papernot, Ale x Kurakin, Ian Goodfello w , and Jascha Sohl-Dickstein. Adversarial examples that fool both human and computer vision. Advances in Neural Information Pr ocessing Systems , 2018. [10] Ivan Evtimo v , Ke vin Eykholt, Earlence Fernandes, T adayoshi Kohno, Bo Li, Atul Prakash, Amir Rahmati, and Dawn Song. Robust physical-world attacks on machine learning models. IEEE Confer ence on Computer V ision and P attern Recognition , 2018. [11] Samuel M Harding, Prashanth Raji van, Bennett I Bertenthal, and Cleotilde Gonzalez. Human decisions on targeted and non-tar geted adversarial samples. Pr oc. of the 40th Annual Confer ence of the Cognitive Science Society , pages 451–456, 2018. [12] James M Hillis, Marc O Ernst, Martin S Banks, and Michael S Landy . Combining sensory information: mandatory fusion within, but not between, senses. Science , 298(5598):1627–1630, 2002. [13] Sandy Huang, Nicolas Papernot, Ian Goodfello w , Y an Duan, and Pieter Abbeel. Adversarial attacks on neural network policies. International Confer ence on Learning Repr esentations , 2017. [14] Julia R Irwin, DH Whalen, and Carol A Fowler . A sex difference in visual inﬂuence on heard speech. P erception & Psyc hophysics , 68(4):582–592, 2006. [15] Thorsten Joachims. T ext categorization with support v ector machines: Learning with many relev ant features. In Eur opean confer ence on machine learning , pages 137–142. Springer , 1998. [16] Alex ey Kurakin, Ian Goodfello w , and Samy Bengio. Adversarial e xamples in the physical world. arXiv pr eprint arXiv:1607.02533 , 2016. [17] Harry McGurk and John MacDonald. Hearing lips and seeing voices. Natur e , 264(5588):746, 1976. [18] Elizabeth A Mongillo, Julia R Irwin, DH Whalen, Cheryl Klaiman, Alice S Carter , and Robert T Schultz. Audiovisual processing in children with and without autism spectrum disorders. Journal of autism and developmental disor ders , 38(7):1349–1358, 2008. 9 [19] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andre w Y Ng. Multimodal deep learning. In Pr oceedings of the 28th international confer ence on machine learning , pages 689–696, 2011. [20] Linda W Norrix, Elena Plante, and Rebecca V ance. Auditory–visual speech integration by adults with and without language-learning disabilities. Journal of Communication Disor ders , 39(1):22–36, 2006. [21] Linda W Norrix, Elena Plante, Rebecca V ance, and Carol A Boliek. Auditory-visual inte gration for speech by children with and without speciﬁc language impairment. J ournal of Speech, Langua ge, and Hearing Resear ch , 50(6):1639–1651, 2007. [22] Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael W ellman. T owards the science of security and priv acy in machine learning. arXiv preprint , 2016. [23] Nicolas Papernot, P atrick McDaniel, Ian Goodfellow , Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 A CM on Asia Confer ence on Computer and Communications Security , pages 506–519, 2017. [24] Daniel Pressnitzer , Jackson Graves, Claire Chambers, V incent De Gardelle, and Paul Egré. Auditory perception: Laurel and yanny together at last. Curr ent Biology , 28(13):R739–R741, 2018. [25] Laasya Samhita and Hans J Gross. The “cle ver hans phenomenon” re visited. Communicative & inte grative biology , 6(6):e27122, 2013. [26] Lea Schönherr , Katharina Kohls, Stef fen Zeiler , Thorsten Holz, and Dorothea Kolossa. Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. arXiv preprint , 2018. [27] Kaoru Sekiyama. Cultural and linguistic factors in audio visual speech processing: The mcgurk effect in chinese subjects. P erception & Psychophysics , 59(1):73–80, 1997. [28] Kaoru Sekiyama, T akahiro Soshi, and Shinichi Sakamoto. Enhanced audiovisual integration with aging in speech perception: a heightened mcgurk effect in older adults. F r ontiers in Psycholo gy , 5:323, 2014. [29] Quentin Summerﬁeld. Lipreading and audio-visual speech perception. Phil. T rans. R. Soc. Lond. B , 335 (1273):71–78, 1992. [30] Supasorn Suwajanakorn, Ste ven M Seitz, and Ira Kemelmacher -Shlizerman. Synthesizing obama: learning lip sync from audio. ACM T ransactions on Graphics , 36(4):95, 2017. [31] Kathleen M Y ouse, Kathleen M Cienkowski, and Carl A Coelho. Auditory-visual speech perception in an adult with aphasia. Brain injury , 18(8):825–834, 2004. [32] Egor Zakharov , Aliaksandra Shysheya, Egor Burko v , and V ictor Lempitsky . Few-shot adversarial learning of realistic neural talking head models. arXiv pr eprint arXiv:1905.08233 , 2019. [33] Zhenglong Zhou and Chaz Firestone. Humans can decipher adversarial images. Natur e communications , 10(1):1334, 2019. 10

A Surprising Density of Illusionable Natural Speech

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment