Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech

VISION AS AN INTERLINGU A: LEARNING MUL TILINGU AL SEMANTIC EMBEDDINGS OF UNTRANSCRIBED SPEECH David Harwath, Galen Chuang, and J ames Glass Computer Science and Artiﬁcial Intelligence Laboratory Massachusetts Institute of T echnology Cambridge, MA 02139, USA ABSTRA CT In this paper , we explore the learning of neural network em- beddings for natural images and speech wa veforms describing the content of those images. These embeddings are learned directly from the wa veforms without the use of linguistic transcriptions or con v entional speech recognition technology . While prior w ork has in vestig ated this setting in the monolin- gual case using English speech data, this work represents the ﬁrst ef fort to apply these techniques to languages beyond En- glish. Using spoken captions collected in English and Hindi, we show that the same model architecture can be successfully applied to both languages. Further, we demonstrate that train- ing a multilingual model simultaneously on both languages offers improved performance o v er the monolingual models. Finally , we sho w that these models are capable of performing semantic cross-lingual speech-to-speech retriev al. Index T erms — V ision and language, unsupervised speech processing, cross-lingual speech retriev al 1. INTRODUCTION The majority of humans acquire the ability to communicate through spoken natural language before they ev en learn to read and write. Many people e v en learn to speak in multiple languages, simply by growing up in a multilingual environ- ment. While strongly supervised Automatic Speech Recog- nition (ASR), Machine T ranslation (MT), and Natural Lan- guage Processing (NLP) algorithms are rev olutionizing the world, the y still rely on expertly curated t raining data follo w- ing a rigid annotation scheme. These datasets are costly to create, and are a bottleneck preventing technologies such as ASR and MT from ﬁnding application to all 7,000 human lan- guages spoken worldwide [1]. Recently , researchers ha v e in vestigated models of spoken language that can be trained in a weakly-supervised fashion by augmenting the ra w speech data with multimodal context [2, 3, 4, 5, 6, 7, 8]. Rather than learning a mapping between speech audio and text, these models learn associations be- tween the speech audio and visual images. F or e xample, such a model is capable of learning to associate an instance of the spoken word “bridge” with images of bridges. What makes these models compelling is the fact that their training data does not need to be annotated or transcribed; simply record- ing people talking about images is sufﬁcient. Although these models ha v e been trained on English speech, it is reasonable to assume that they would work well on any other spoken language since they do not make use of linguistic annota- tion or have any language-speciﬁc inducti ve bias. Here, we demonstrate that this is indeed the case by training speech-to- image and image-to-speech retriev al models in both English and Hindi with the same architecture. W e then train multilin- gual models that share a common visual component, with the goal of using the visual domain as an interlingua or “Rosetta Stone” that serves to provide the languages with a common grounding. W e show that the audio-visual retriev al perfor- mance of a multilingual model exceeds that of the monolin- gual models, suggesting that the shared visual conte xt allows for cross-pollination between the representations. Finally , we use these models to directly perform cross-lingual audio-to- audio retriev al, which we belie ve could be a promising direc- tion for future work exploring visually grounded speech-to- speech translation without the need for text transcriptions or directly parallel corpora. 2. RELA TION TO PRIOR WORK Unsupervised Speech Processing . State-of-the-art ASR systems are close to reaching human parity within certain domains [9], but this comes at an enormous resource cost in terms of text transcriptions for acoustic model training, phonetic lexicons, and large text corpora for language model training. These exist for only a small number of languages, and so a growing body of work has focused on processing speech audio with little-to-no supervision or annotation. The difﬁculty faced in this paradigm is the enormous variability in the speech signal: background and en vironmental noise, mi- crophones and recording equipment, and speaker character- istics (gender , age, accent, vocal tract shape, mood/emotion, etc.) are all manifested in the acoustic signal, making it dif ﬁ- cult to learn in variant representations of linguistic units with- out strong guidance from labels. The dominant approaches cast the problem as joint segmentation and clustering of the speech signal into linguistic units at various granularities. Segmental Dynamic T ime W arping (S-DTW) [10, 11, 12] attempts to discov er repetitions of the same words in a col- lection of untranscribed acoustic data by ﬁnding repeated regions of high acoustic similarity . Other approaches use Bayesian generati ve models at multiple levels of linguistic abstraction [13, 14, 15]. Neural network models have also been used to learn acoustic feature representations which are more robust to undesirable v ariation [16, 17, 18, 19]. V ision and Language. Modeling correspondences be- tween vision and language is a rapidly growing ﬁeld at the intersection of computer vision, natural language processing, and speech processing. Most existing work has focused on still-frame images paired with text. Some have studied cor- respondence matching between cate gorical abstractions, such as words and objects [20, 21, 22, 23, 24, 25]. Recently , inter - est in caption generation has grown, popularized by [26, 27, 28]. New problems within the intersection of language and vision continue to be introduced, such as object discov ery via multimodal dialog [29], visual question answering [30], and text-to-image generation [31]. Other work has studied joint representation learning for images and speech audio in the ab- sence of te xt data. The ﬁrst major ef fort in this vein was [32], but until recently little progress was made in this direction as the text-and-vision approaches have remained dominant. In [3], embedding models for images and audio captions were shown to be capable of performing semantic retriev al tasks, and more recent works ha ve studied word and object discov- ery [4] and ke yword spotting [8]. Other work has analyzed these models, and provided evidence that linguistic abstrac- tions such as phones and words emerge in their internal rep- resentations [4, 5, 6, 7]. Machine T ranslation. Automatically translating text from one language into another is a well-established prob- lem. At ﬁrst dominated by statistical methods combining count-based translation and language models [33], the cur- rent paradigm relies upon deep neural network models [34]. New ideas continue to be introduced, including models which take adv antage of shared visual conte xt [35], but the majori ty of MT research has focused on the text-to-te xt case. Re- cent work has mo ved beyond that paradigm by implementing translation between speech audio in the source language and written text in the target language [36, 37, 38]. Howe ver , it still relies upon e xpert-crafted transcriptions, and would still require a text-to-speech post-processing module for speech- to-speech translation. 3. MODELS W e assume that our data takes the form of a collection of N triples, ( I i , A E i , A H i ) , where I i is the i th image, A E i is the acoustic wa veform of the English caption describing the Fig. 1 : Depiction of models and training scheme. Only one impostor point is shown for visual clarity . image, and A H i is the acoustic waveform of the Hindi cap- tion describing the same image. W e consider a mapping F ( I i , A E i , A H j ) 7→ ( e I i , e E i , e H i ) where e I i , e E i , e H i ∈ R d ; in other words, a mapping of the image and acoustic captions to vectors in a high-dimensional space. W ithin this space, our hope is that visual-linguistic semantics are manifested as arithmetic relationships between vectors. W e implement this mapping with a set of conv olutional neural networks (CNNs) similar in architecture to those presented in [3, 4]. W e uti- lize three networks: one for the image, one for the English caption, and one for the Hindi caption (Figure 1). The image network is formed by taking all layers up to conv5 from the pre-trained VGG16 network [39]. For a 224x224 pixel, RGB input image, the output of the network at this point would be a do wnsampled image of width 14 and height 14, with 512 feature channels. W e need a means of transforming this tensor into a vector e I of dimension d (2048 in our experiments), so we apply a linear 3x3 con volution with d ﬁlters, followed by global meanpooling. Our input images ﬁrst resized so that its smallest dimension is 256 pix- els, then a random 224x224 pixel crop is chosen (for training; at test time, the center crop is taken), and ﬁnally each pix el is mean and v ariance normalized according to the of f-the-shelf VGG mean and v ariance computed o ver ImageNet [40]. The audio architecture we use the same as the one pre- sented in [4], but with the addition of a BatchNorm [41] layer at the very front of the network, enabling us to do away with any data-space mean and variance normalization; our inputs are simply raw log mel ﬁlterbank energies. Our data pre- processing follows [4], where each wa veform is represented by a series of 25ms frames with a 10ms shift, and each frame is represented by a vector of 40 log mel ﬁlterbank energies. For the sake of enforcing a uniform tensor size within mini- batches, the resulting spectrograms are ﬁnally truncated or zero-padded to 1024 frames (approx. 10 seconds). 4. EXPERIMENTS 4.1. Experimental Data W e make use of the Places 205 [42] English audio caption dataset, the collection of which was described in detail in [3], as well as a similar dataset of Hindi audio caption data that we collect via Amazon Mechanical T urk. W e use a subset of 85,480 images from the Places data for which we hav e both an English caption and a Hindi caption. W e divide this dataset into a training set of 84,480 image/caption triplets, and a v al- idation set of 1,000 triplets. All captions are completely free- form and unprompted. Turk ers are simply shown an image from the Places data and asked to describe the salient objects in sev eral sentences. The English captions on average contain approx. 19.3 words and have an average duration of approx. 9.5 seconds, while the Hindi captions contain an average of 20.4 words and ha ve an a verage duration of 11.4 seconds. 4.2. Model T raining Procedur e The objective functions we use to train our models are all based upon a margin ranking criterion [43]: rank ( a, p, i ) = max(0 , η − s ( a, p ) + s ( a, i )) (1) where a is the anchor vector , p is a vector “paired” with the anchor vector , i is an “imposter” vector , s () denotes a sim- ilarity function, and η is the margin hyperparameter . For a ( a, p, i ) triplet, the loss is zero when the similarity between a and p is at least η greater than the similarity between a and i ; otherwise, a loss proportional to s ( a, i ) is incurred. This ob- jectiv e function therefore encourages the anchor and its paired vector to be “close together , ” and the the anchor to be “far away” from the imposter . In all of our experiments, we ﬁx η = 1 and let s ( x, y ) = x T y Giv en that we have images, English captions, and Hindi captions, we can apply the margin ranking criterion to their neural embedding vectors 6 different ways: each input type can serv e as either the anchor point, or as the paired and im- poster points. For example, an image embedding may serve as the anchor point, its associated English caption would be the paired point, and an unrelated English caption for some other image would be the imposter point. W e can e v en form composite objectiv e functions by performing multiple kinds of ranking simultaneously . W e consider sev eral different training scenarios in T able 1. In each scenario, ↔ denotes a bidirectional application of the ranking loss function to ev ery tuple within a minibatch of size B , e.g. “English ↔ Image” indicates that the terms P B j =1 rank ( e I j , e E j , e E k ) and P B j =1 rank ( e E j , e I j , e I l ) are added to the ov erall loss, where k 6 = j and l 6 = j are randomly sampled indices within a minibatch. This is similar to the criteria used in [44] for multilingual image/te xt retrie v al, e xcept we randomly sample only a single imposter per ( a, p ) pair . W e trained all models with stochastic gradient descent us- ing a batch size of 128 images with their corresponding cap- tions. All models except the audio-to-audio (no image) were trained with the same learning rate of 0.001, decreased by a factor of 10 ev ery 30 epochs. The audio-to-audio network used an initial learning rate of 0.01, which resulted in insta- bility for the other scenarios. W e di vided training into two “rounds” of 90 epochs (for a total of 180 epochs), where the learning rate is reset back to its initial v alue starting at epoch 91, and then allowed to decay again. W e found this schedule achiev ed better performance than a single round of 90 epochs, especially for the training scenarios in volving simultaneous audio/image and audio/audio retriev al. 4.3. A udio-V isual and A udio-A udio Retrieval For ev aluation, we assume a library L of M target vectors, L = t 1 , t 2 , . . . , t M . Assume we are given a query vector q which is known to be associated with some t , but we do not know which; our goal is to retrie ve this target from L . Given a similarity function s ( q , t ) (we use s ( q , t ) = q T t ), we rank all of the target vectors by their similarity to q , and retrieve the top scoring 1, 5, and 10.. If the correct target vector is retriev ed, a hit is counted; otherwise, we count the result as a miss. W ith a set of query vectors covering all of L (a set of M vectors containing a q for ev ery t ), we compute recall scores over the entire query set. Recall that the ﬁve training scenarios consider 6 distinct pairwise directions of ranking; for example, we can consider the case in which an English caption is the query and its associated image is the target, or the case in which a Hindi caption is the query and the English caption associated with the same image is the target. W e ap- ply the retriev al task to those same directions, and for each model report the rele v ant recall at the top 1, 5, and 10 results. Retriev al recall scores for each training scenario are dis- played in T able 1. W e found that a small amount of relativ e weighting was necessary for the H ↔ E ↔ I ↔ H loss function in order to pre v ent the training from completely fa v oring au- dio/image or audio/audio ranking ov er the other; weighting the E ↔ H ranking loss 5 times higher than that of the E ↔ I and H ↔ I losses produced good results. In all cases, the model trained with the H ↔ E ↔ I ↔ H loss function is the top per- former by a signiﬁcant margin. This suggests that the addi- tional constraint of fered by ha ving two separate linguistic ac- counts of an image’ s visual semantics can improve the learned representations, even across languages. Howe ver , the fact that the E ↔ I ↔ H model offered only marginal improvements over the E ↔ I and H ↔ I models suggests that to tak e advantage of this additional constraint, it is necessary to enforce semantic similarity between the captions associated with a given image. Perhaps most interesting are our results on cross-lingual speech-to-speech retriev al. W e were surprised to ﬁnd that the E ↔ H model was able to work at all, giv en that the retriev al was performed directly on the acoustic lev el without an y lin- T able 1 : Summary of retriev al recall scores for all models. “English caption” is abbreviated as E, “Hindi caption” as H, and “Image” as I. All models were trained with two rounds of 90 epochs, though in all cases the y con v erged before epoch 180. Even though the E ↔ I ↔ H conﬁguration is not speciﬁcally trained for the English/Hindi audio-to-audio retriev al tasks, we perform the ev aluation anyway for the sake of comparison. Random chance recall scores are R@1=.001, R@5=.005, R@10=.01. E → I I → E H → I I → H E → H H → E Model 1 5 10 1 5 10 1 5 10 1 R5 10 1 5 10 1 5 10 E ↔ I .065 .236 .367 .086 .222 .343 - - - - - - - - - - - - H ↔ I - - - - - - .061 .185 .303 .064 .186 .277 - - - - - - E ↔ H - - - - - - - - - - - - .011 .042 .075 .013 .059 .104 E ↔ I ↔ H .062 .248 .360 .077 .247 .350 .066 .205 .307 .078 .208 .306 .005 .012 .018 .004 .016 .027 H ↔ E ↔ I ↔ H .083 .282 .424 .080 .252 .365 .080 .25 .356 .074 .235 .354 .034 .114 .182 .033 .121 .203 guistic supervision. Even more surprising w as the ﬁnding that the addition of visual conte xt by the H ↔ E ↔ I ↔ H model ap- proximately doubled the audio-to-audio recall scores across the board, as compared to the E ↔ H model. This suggests that the information contained within the visual modality pro- vides a strong semantic grounding signal that can act as an “interlingua” for cross-lingual learning. T o giv e an example of our model’ s capabilities, we show the text transcriptions of three randomly selected Hindi captions, along with the tran- scriptions of their top-1 retriev ed English captions using the H ↔ E ↔ I ↔ H model. The English result is denoted by “E:”, and the approximate translation of the query from Hindi to English is denoted by “HT :”. Note that the model has no knowledge of an y of the ASR te xt. HT : “There is big beautiful house. There is a garden in front of the house. There is a slender road” E:“ A small house with a stone chimney and a porch” HT : ”This is a picture next to the seashore. T wo beautiful girls are laying on the sand, talking to each other” E:“ A sandy beach and the entrance to the ocean the detail in the sky is v ery vi vid” HT : “There are many windmills on the green grass” E:“There is a large windmill in a ﬁeld” T o examine whether individual word translations are in- deed being learned, we remo ved the ﬁnal pooling layer from the acoustic networks and computed the matrix product of the outputs for the Hindi and English captions associated with the same image. An example of this is sho wn in Figure 5, which seems to indicate that the model is learning approx- imately word-lev el translations directly between the speech wa veforms. W e plan to perform a more objective analysis of this phenomenon in future work. Fig. 5 : Similarity matrix between unpooled embeddings of Hindi and English captions. Transcripts are time-aligned along the axes. Regions of high similarity (red) reﬂect align- ments between the speech signals, which correspond to rea- sonable translations of the underlying words (bottom left). 5. CONCLUSION In this paper, we applied neural models to associate visual images with raw speech audio in both English and Hindi. These models learned cross-modal, cross-lingual semantics directly at the signal level without any form of ASR or lin- guistic annotation. W e demonstrated that multilingual vari- ants of these models can outperform their monolingual coun- terparts for speech/image association, and also provided evi- dence that a shared visual context can dramatically improve the ability of the models to learn cross-lingual semantics. W e also provided anecdotal e vidence that meaningful word-le vel translations are being implicitly learned, which we plan to in- vestigate further . W e believe that our approach is a promis- ing early step tow ards speech-to-speech translation models that would not require any form of annotation beyond ask- ing speakers to provide narrations of images, videos, etc. in their nativ e language. 6. REFERENCES [1] M. P . Lewis, G. F . Simon, and C. D. Fennig, Ethnologue: Languages of the W orld, Nineteenth edition , SIL International. Online version: http://www .ethnologue.com, 2016. [2] D. Harwath and J. Glass, “Deep multimodal semantic embeddings for speech and images, ” in ASR U , 2015. [3] D. Harwath, A. T orralba, and J. R. Glass, “Unsupervised learning of spoken language with visual context, ” in NIPS , 2016. [4] D. Harwath and J. Glass, “Learning word-like units from joint audio- visual analysis, ” in ACL , 2017. [5] J. Drexler and J. Glass, “ Analysis of audio-visual features for unsu- pervised speech recognition, ” in Gr ounded Langua ge Understanding W orkshop , 2017. [6] G. Chrupala, L. Gelderloos, and A. Alishahi, “Representations of lan- guage in a model of visually grounded speech signal, ” in ACL , 2017. [7] A. Alishahi, M. Barking, and G. Chrupala, “Encoding of phonology in a recurrent neural model of grounded speech, ” in CoNLL , 2017. [8] H. Kamper, A. Jansen, and S. Goldwater , “ A segmental framework for fully-unsupervised large-v ocabulary speech recognition, ” in Computer Speech and Language , 2017. [9] W . Xiong, J. Droppo, X. Huang, F . Seide, M. Seltzer , A. Stolcke, D. Y u, and G. Zweig, “ Achie ving human parity in conv ersational speech recognition, ” CoRR , vol. abs/1610.05256, 2016. [10] A. Park and J. Glass, “Unsupervised pattern discovery in speech, ” in IEEE T ransactions on A udio, Speech, and Language Pr ocessing vol. 16, no.1, pp. 186-197 , 2008. [11] A. Jansen, K. Church, and H. Hermansky , “T oward spoken term dis- covery at scale with zero resources, ” in Inter speech , 2010. [12] A. Jansen and B. V an Durme, “Efﬁcient spoken term discovery using randomized algorithms, ” in ASR U , 2011. [13] C. Lee and J. Glass, “ A nonparametric Bayesian approach to acoustic model discovery , ” in ACL , 2012. [14] L. Ondel, L. Bur get, and J. Cernocky , “V ariational inference for acous- tic unit discovery , ” in 5th W orkshop on Spoken Language T echnology for Under-r esour ced Language , 2016. [15] H. Kamper , A. Jansen, and S. Goldwater , “Unsupervised word seg- mentation and lexicon discovery using acoustic word embeddings, ” IEEE/ACM T rans. Audio, Speech and Lang. Proc. , vol. 24, no. 4, pp. 669–679, Apr . 2016. [16] Y . Zhang, R. Salakhutdinov , H.-A. Chang, and J. Glass, “Resource conﬁgurable spoken query detection using deep boltzmann machines, ” in ICASSP , 2012. [17] D. Renshaw , H. Kamper , A. Jansen, and S. Goldwater, “ A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge, ” in Interspeech , 2015. [18] H. Kamper , M. Elsner , A. Jansen, and S. Goldwater , “Unsupervised neural network based feature extraction using weak top-down con- straints, ” in ICASSP , 2015. [19] R. Thiolliere, E. Dunbar, G. Synnaeve, M. V ersteegh, and E. Dupoux, “ A h ybrid dynamic time w arping-deep neural net- work architecture for unsupervised acoustic modeling, ” in Interspeech , 2015. [20] K. Barnard, P . Duygulu, D. Forsyth, N. DeFreitas, D. M. Blei, and M. I. Jordan, “Matching words and pictures, ” in Journal of Machine Learning Researc h , 2003. [21] R. Socher and F . Li, “Connecting modalities: Semi-supervised seg- mentation and annotation of images using unaligned text corpora, ” in CVPR , 2010. [22] C. Matuszek, N. Fitzgerald, L. Zettlemo yer , L. Bo, and D. Fox, “ A joint model of language and perception for grounded attribute learning, ” in ICML , 2012. [23] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T . Mikolov , “Devise: A deep visual-semantic embedding model, ” in NIPS , 2013. [24] D. Lin, S. Fidler, C. Kong, and R. Urtasun, “V isual semantic search: Retrieving videos via comple x textual queries, ” in CVPR , 2014. [25] C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler, “What are you talking about? text-to-image coreference, ” in CVPR , 2014. [26] A. Karpathy and F . Li, “Deep visual-semantic alignments for generat- ing image descriptions, ” in CVPR , 2015. [27] O. V inyals, A. T oshev , S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator , ” in CVPR , 2015. [28] H. Fang, S. Gupta, F . Iandola, S. Rupesh, L. Deng, P . Dollar, J. Gao, X. He, M. Mitchell, P . J. C., C. L. Zitnick, and G. Zweig, “From cap- tions to visual concepts and back, ” in CVPR , 2015. [29] H. de Vries, F . Strub, S. Chandar , O. Pietquin, H. Larochelle, and A. C. Courville, “Guesswhat?! visual object discovery through multi-modal dialogue, ” in CVPR , 2017. [30] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, Z. Lawrence, and D. Parikh, “VQA: V isual question answering, ” in ICCV , 2015. [31] S. E. Reed, Z. Akata, X. Y an, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis, ” CoRR , vol. abs/1605.05396, 2016. [32] D. Roy , “Grounded spoken language acquisition: Experiments in word learning, ” in IEEE T ransactions on Multimedia , 2003. [33] P . K oehn, F . Och, and D. Marcu, “Statistical phrase-based translation, ” in NAA CL , 2013. [34] D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate, ” in ICLR , 2015. [35] L. Specia, S. Frank, K. Simaan, and D. Elliott, “ A shared task on multi- modal machine translation and crosslingual image description, ” in ACL Conf. on Machine Tr anslation , 2016. [36] R. W eiss, J. Chorowski, N. Jaitly , Y . Wu, and Z. Chen, “Sequence-to- sequence models can directly translate foreign speech, ” in Interspeech , 2017. [37] L. Duong, A. Anastasopoulos, D. Chiang, S. Bird, and T . Cohn, “ An attentional model for speech translation without transcription, ” in NAA CL , 2016. [38] A. L. S. G. Sameer Bansal, Herman Kamper, “T o wards speech-to-text translation without speech recognition, ” in EACL , 2017. [39] K. Simon yan and A. Zisserman, “V ery deep conv olutional networks for large-scale image recognition, ” CoRR , v ol. abs/1409.1556, 2014. [40] J. Deng, W . Dong, R. Socher , L. Li, K. Li, and F . Li, “Imagenet: A large scale hierarchical image database, ” in CVPR , 2009. [41] S. Ioffe and C. Szegedy , “Batch normalization: Accelerating deep net- work training by reducing internal cov ariate shift, ” in ICML , 2015. [42] B. Zhou, A. Lapedriza, J. Xiao, A. T orralba, and A. Oli va, “Learning deep features for scene recognition using places database, ” in NIPS , 2014. [43] J. Bromley , I. Guyon, Y . LeCun, E. S ¨ ackinger , and R. Shah, “Signature veriﬁcation using a ”siamese” time delay neural network, ” in Advances in Neural Information Pr ocessing Systems 6 , J. D. Cowan, G. T esauro, and J. Alspector , Eds., pp. 737–744. Morgan-Kaufmann, 1994. [44] S. Gella, R. Sennrich, F . Keller , and M. Lapata, “Image piv oting for learning multilingual multimodal representations, ” in EMNLP , 2017.

Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment