On the difficulty of a distributional semantics of spoken language

On the difﬁculty of a distrib utional semantics of spoken language Grzegorz Chrupała T ilb ur g Uni versity g.chrupala@uvt.nl Lieke Gelderloos T ilb ur g Uni versity l.j.gelderloos@uvt.nl ´ Akos K ´ ad ´ ar T ilb ur g Uni versity a.kadar@uvt.nl Afra Alishahi T ilb ur g Uni versity a.alishahi@uvt.nl Abstract In the domain of unsupervised learning most work on speech has focused on discovering low-le v el constructs such as phoneme in v en- tories or word-like units. In contrast, for writ- ten language, where there is a lar ge body of work on unsupervised induction of semantic representations of w ords, whole sentences and longer texts. In this study we examine the challenges of adapting these approaches from written to spoken language. W e conjecture that unsupervised learning of the semantics of spoken language becomes feasible if we ab- stract from the surface variability . W e simulate this setting with a dataset of utterances spo- ken by a realistic but uniform synthetic voice. W e ev aluate two simple unsupervised models which, to varying degrees of success, learn semantic representations of speech fragments. Finally we present inconclusi v e results on hu- man speech, and discuss the challenges inher- ent in learning distrib utional semantic repre- sentations on unrestricted natural spoken lan- guage. 1 Introduction In the realm of NLP for written language, unsuper - vised approaches to inducing semantic representa- tions of words hav e a long pedigree and a history of substantial success (Landauer et al., 1998; Blei et al., 2003; Mikolov et al., 2013b). The core idea behind these models is to build word representa- tions that can predict their surrounding context. In search for similarly generic and versatile rep- resentations of whole sentences, v arious composi- tion operators have been applied on word repre- sentations (e.g. Socher et al., 2013; Kalchbrenner et al., 2014; Kim, 2014; Zhao et al., 2015). Alter- nati vely , sentence representations are induced via the objectiv e to predict the surrounding sentences (e.g. Le and Mikolov, 2014; Kiros et al., 2015; Arora et al., 2016; Jernite et al., 2017; Logeswaran and Lee, 2018). Such representations capture as- pects of the meaning of the encoded sentences, which can be used in a variety of tasks such as semantic entailment or text understanding. In the case of spoken language, unsupervised methods usually focus on discovering relativ ely lo w-le vel constructs such as phoneme in v entories or word-like units. This is mainly due to the fact that the key insight from distributional semantics that “you shall know the word by the company it keeps” (Firth, 1957) is hopelessly confounded in the case of spoken language. In te xt two words are considered semantically similar if they co-occur with similar neighbors. Ho we ver , speech seg- ments which occur in the same utterance or situ- ation often hav e man y other features in addition to similar meaning, such as being uttered by the same speaker or accompanied by similar ambient noise. In this study we sho w that if we can abstract aw ay from speaker and background noise, we can ef fecti vely capture semantic characteristics of spo- ken utterances in an unsupervised way . W e present SegMatch, a model trained to match segments of the same utterance. SegMatch utterance encod- ings are compared to those in Audio2V ec, which is trained to decode the context that surrounds an utterance. T o in vestigate whether our represen- tations capture semantics, we e v aluate on speech and vision datasets where photographic images are paired with spoken descriptions. Our experiments sho w that for a single synthetic voice, a simple model trained only on image captions can capture pairwise similarities that correlate with those in the visual space. Furthermore we discuss the factors pre venting ef fecti ve learning in datasets with multiple human speakers: these include confounds between se- mantic and situational factors as well as artifacts in the datasets. 2 Related work Studies of unsupervised learning from speech typ- ically aim to discover the phonemic or lexical building blocks of the language signal. Park and Glass (2008) show that words and phrase units in continuous speech can be disco vered using algo- rithms based on dynamic time warping. v an den Oord et al. (2017) introduce a V ector Quantised- V ariational AutoEncoder model, in which a con- volutional encoder trained on raw audio data gi ves discrete encodings that are closely related to phonemes. Recently several unsupervised speech recognition methods were proposed that segment speech and cluster the resulting word-like seg- ments (Kamper et al., 2017a) or encode them into segment embeddings containing phonetic infor- mation (W ang et al., 2018). Scharenborg et al. (2018) show that word and phrase units arise as a by-product in end-to-end tasks such as speech-to- speech translation. In the current work, the aim is to directly extract semantic, rather than word form information from speech. Semantic information encoded in speech is used in studies that ground speech to the visual context. Datasets of images paired with spoken captions can be used to train multimodal models that extract visually salient semantic information from speech, without access to textual information (Harwath and Glass, 2015; Harwath et al., 2016; Kamper et al., 2017b; Chrupała et al., 2017; Alishahi et al., 2017; Harwath and Glass, 2017). This form of se- mantic supervision, through contextual informa- tion from another modality , has its limits: it can only help to learn to understand speech describing the here and no w . On the other hand, the success of word embed- dings deri ved by distributional semantic princi- ples has sho wn ho w rich the semantic information within the structure of language itself is. Semantic representations of words obtained through Latent Semantic Analysis hav e prov en to closely resem- ble human semantic knowledge (Blei et al., 2003; Landauer et al., 1998). W ord2vec models produce semantically rich word embeddings by learning to predict the surrounding words in text (Mikolov et al., 2013a,b) and this principle is extended to sentences in the Skip-thought model (Kiros et al., 2015) and several subsequent works (Arora et al., 2016; Jernite et al., 2017; Logeswaran and Lee, 2018). In the realm of spoken language, in Chung and Glass (2017) the sequence-to-sequence Au- dio2vec model learns semantic embeddings for audio segments corresponding to words, by pre- dicting the audio segments around it. Chung and Glass (2018) further experiment with this model and rename it to Speech2vec. Chen et al. (2018) train semantic word embeddings from word-se gmented speech as part of their method of training an ASR system from non-aligned speech and te xt. These works are closely related to our current study , b ut crucially , unlike them we do not assume that speech is already segmented into dis- crete words. 3 Models 3.1 Encoder All the models in this section use the same encoder architecture. The encoder is loosely based on the architecture of Chrupała et al. (2017), i.e. it con- sists of a 1-dimensional con volutional layer which subsamples the input, followed by a stack of recur- rent layers, followed by a self-attention operator . Unlike Chrupała et al. (2017) we use GRU layers (Chung et al., 2014) instead of RHN layers (Zilly et al., 2017), and do not implement residual con- nections. These modiﬁcations are made in order to exploit the fast nativ e CUDNN implementation of a GR U stack and thus speed up experimenta- tion in this exploratory stage of our research. The encoder Enc is deﬁned as follo ws: Enc( x ) = unit(Attn(GR U ` (Con v s , d , z ( x )))) (1) where Con v is a con v olutional layer with length s , d channels, and stride z , GRU ` is a stack of ` GR U layers, A ttn is self-attention and unit is L2- normalization. The self-attention operator com- putes a weighted sum of the RNN acti vations at all timesteps: A ttn( x ) = X t α t x t (2) where the weights α t are determined by an MLP with learned parameters U and W , and passed through the time wise softmax function: α t = exp( U tanh( Wx t )) P t 0 exp( U tanh( Wx t 0 )) (3) 3.2 A udio2vec Firstly we deﬁne a model inspired by Chung and Glass (2017) which uses the multilayer GR U en- coder described above, and a single-layer GR U de- coder , conditioned on the output of the encoder . The model of Chung and Glass (2017) works on word-segmented speech: the encoder encodes the middle word of a ﬁv e word sequence, and the decoder decodes each of the surrounding words. Similarly , the Skip-thought model of (Kiros et al., 2015) w orks with a sequence of three sentences, encoding the middle one and decoding the previ- ous and next one. In our fully unsupervised setup we do not hav e access to word segmentation, and thus our Audio2vec models work with arbitrary speech segments. W e split each utterance into three equal sized chunks: the model encodes the middle one, and decodes the ﬁrst and third one. The decoder predicts the MFCC features at time t + 1 based on the state of the hidden layer at time t . From reading Chung and Glass (2017) it is not clear whether in addition to the hidden state their decoder also receives the MFCC frame at t as in- put. W e thus implemented two versions, one with and one without this input. A udio2vec-C The decoder receiv es the output of the encoder as the initial state of the hidden layer , and the frame at t as input as it predicts the next frame at t + 1 . ˆ x ﬁrst t +1 = Fh t (4) h t = gru( h t − 1 , x ﬁrst t ) (5) h 0 = Enc  x middle  (6) where x ﬁrst t are the MFCC features of the pre vious chunk at time t , ˆ x ﬁrst t +1 are the predicted features at the next time step, F is a learned projection matrix, gru( · , · ) is a single step of the GR U recurrence, and x middle is the sequence of the MFCC features of the input. The decoder for the third chunk x third is deﬁned in the same way . A udio2vec-U The decoder receiv es the output of the encoder as the input at each time step, b ut does not hav e access to the frame at t . ˆ x ﬁrst t +1 = Fh t (7) h t = gru( h t − 1 , Enc( x middle )) (8) In this v ersion h 0 is a learned parameter . There are two separate decoders: i.e. the weights of the decoder for the ﬁrst chunk and for the third chunk are not shared. For both versions of A udio2vec the loss func- tion is the Mean Squared Error . 3.3 SegMatch This model works with segments of utterances also: we split each utterance approximately in half, while erasing a short portion in the center in order to prev ent the model from ﬁnding triv- ial solutions based on matching local patterns at the edges of the segments. The encoder is as de- scribed abo ve. After encoding the segments, we project the initial and ﬁnal segments via separate learned projection matrices: b = B Enc( x 0: m ) (9) e = E Enc( x m + k : n ) (10) where x 0: n is the sequence of MFCC frames for an utterance, k is the size of the erased segment, Enc( · ) is the encoder and B and E are the projec- tion matrices for the beginning and end segment respecti vely . That is, there is a single shared en- coder for both types of speech segments (begin- ning and end), but the projections are separate. There is no decoding, but rather the model learns to match encoded segments from the same utter - ance and distinguish them from encoded segments from different utterances within the same mini- batch. The loss function is similar to the one for matching spoken utterances to images in Chrupała et al. (2017), with the difference that here we are matching utterance segments to each other: L = X b , e X b 0 max[0 , α + d ( b , e ) − d ( b 0 , e )] + X e 0 max[0 , α + d ( b , e ) − d ( b , e 0 )] ! (11) where ( b , e ) are beginning and end segments from the same utterance, and ( b 0 , e ) and ( b , e 0 ) are be- ginning and end segments from two different utter- ances within a batch, while d ( · , · ) is the cosine dis- tance between encoded segments. The loss func- tion thus attempts to make the cosine distance be- tween encodings of matching segments less than the distance between encodings of mismatching segment pairs, by a mar gin. Note that the speciﬁc way we segment speech is not a crucial component of either of the models: it is mostly dri ven by the fact that we run our e xperi- ments on speech and vision datasets, where speech consists of isolated utterances. For data consisting of longer narrativ es, or dialogs, we could use dif- ferent segmentation schemes. 4 Experimental setup 4.1 Datasets In order to facilitate ev aluation of the semantic as- pect of the learned representations, we work with speech and vision datasets, which couple pho- tographic images with their spoken descriptions. Thanks to the structure of these data we can use the e v aluation metrics detailed in section 4.2. Synthetically spoken COCO This dataset was created by Chrupała et al. (2017), based on the original COCO dataset (Lin et al., 2014), using the Google TTS API. The captions are spoken by a single synthetic voice, which is realistic but sim- pler than human speakers, lacking variability and ambient noise. There are 300,000 images, each with ﬁv e captions. Fiv e thousand images each are held out for v alidation and test. Flickr8k A udio Caption Corpus This dataset (Harwath and Glass, 2015) contains the captions in the original Flickr8K corpus (Hodosh et al., 2013) read aloud by crowdwork ers. There are 8,000 images, each image with ﬁ ve descriptions. One thousand images are held out for validation, and another one thousand for the test set. Places A udio Caption Corpus This dataset was collected by (Harwath et al., 2016) using cro wd- workers. Here each image is described by a single spontaneously spoken caption. There are 214,585 training images, and 1000 validation images (there are no separate test data). 4.2 Evaluation metrics W e ev aluate the quality of the learned semantic speech representations according to the follo wing criteria. Paraphrase retrieval F or the Synthetically Spo- ken COCO dataset as well as for the Flickr8k Au- dio Caption Corpus each image is described via ﬁ ve independent spoken captions. Thus captions describing the same image are effecti vely para- phrases of each other . This structure of the data allo ws us to use a paraphrasing retriev al task as a measure of the semantic quality of the learned speech embeddings. W e encode each of the spo- ken utterances in the validation data, and rank the others according to the cosine similarity . W e then measure: (a) Median rank of the top-ranked paraphrase; and (b) recall@K: the proportion of paraphrases among K top-ranked utterances, for K ∈ { 1 , 5 , 10 } . Representational similarity to image space Representational similarity analysis (RSA) is a way of ev aluating how pairwise similarities between objects are correlated in two object representation spaces (Kriegeskorte et al., 2008). Here we compare cosine similarities among encoded utterances versus cosine simi- larities among vector representations of images. Speciﬁcally , we create two pairwise N × N similarity matrices: (a) among encoded utterances from the v alidation data, and (b) among images corresponding to each utterance in (a). Note that since there are ﬁve descriptions per image, each image is replicated ﬁ ve times in matrix (b). W e then take the upper triangulars of these matrices (excluding the diagonal) and compute Pearson’ s correlation coefﬁcient between them. The image features for this ev aluation are obtained from the ﬁnal fully connected layer of VGG-16 (Simonyan and Zisserman, 2014) pre-trained on Imagenet (Russako vsky et al., 2014) and consist of 4096 dimensions. 4.3 Settings W e preprocess the audio by extracting 12- dimensional mel-frequency cepstral coefﬁcients (MFCC) plus log of the total energy . W e use 25 milisecond windows, sampled ev ery 10 milisec- onds. Audio2vec and SegMatch models are trained for a maximum of 15 epochs with Adam, with learning rate 0.0002, and gradient clipping at 2.0. SegMatch uses margin α = 0 . 2 . The en- coder GR U has 5 layers of 512 units. The con- volutional layer has 64 channels, size of 6 and stride 3. The hidden layer of the attention MLP is 512. The GR U of the Audio2vec decoder has 512 hidden units; the size of the output of the pro- jections B and E in Se gMatch is also 512 units. For SegMatch the size of the erased center portion of the utterance is 30 frames. W e apply early stop- ping and report all the results of each model after the epoch for which it scored best on recall@10. When applying SegMatch on human data, each mini-batch includes utterances spoken only by one speaker: this is in order to discourage the model from encoding speaker -speciﬁc features. 5 Results 5.1 Synthetic speech T able 1 sho ws the ev aluation results on synthetic speech. Representations learned by Audio2vec and SegMatch are compared to the performance of random vectors, mean MFCC vectors, as well as visually supervised representations (V GS, model from Chrupała et al. (2017)). Audio2vec works better than chance and mean MFCC on paraphrase retrie v al, but does not correlate with the visual space. SegMatch works much better than Au- dio2vec according to both criteria. It does not come close to VGS on paraphrase retriev al, but it does correlate with the visual modality e ven better . 5.2 Human speech Places This dataset only features a single cap- tion per image and thus we only e v aluate accord- ing to RSA: with both Se gMatch and Audio2vec we found the correlations to be zero. Flickr8K Initial e xperiments with Flickr8K were similarly unsuccessful. Analysis of the learned SegMatch representations rev ealed that in spite of partitioning the data by speaker for train- ing, speaker identity can be decoded from them. Enf orcing speaker in variance W e thus imple- mented a version of SegMatch where an auxiliary speaker classiﬁer is connected to the encoder via a gradient re versal operator (Ganin and Lempitsky , 2015). This architecture optimizes the main loss, while at the same time pushing the encoder to re- mov e information about speaker identity from the representation it outputs. In preliminary experi- ments we saw that this addition was able to prevent speaker identity from being encoded in the rep- resentations during the ﬁrst few epochs of train- ing. Evaluating this speaker-in variant representa- tion ga ve contradictory results, shown in T able 2: very good scores on paraphrase retriev al, but zero correlation with visual space. Further analysis showed that there seems to be an artifact in the Flickr8K data where spoken cap- tions belonging to consecutively numbered images share some characteristics, e ven though the im- ages do not. As a side effect, this causes cap- tions belonging to the same image to also share some features, independent of their semantic con- tent, leading to high paraphrasing scores. The arti- fact may be due to changes in data collection pro- cedure which af fected some aspect of the captions in w ays which correlate with their sequential or- dering in the dataset. If we treat the image ID number as a regression target, and the ﬁrst two principal components of the Se gMatch representation of one of its captions as the predictors, we can account for about 12% of the holdout variance in IDs using a non-linear model (using either K-Nearest Neighbors or Ran- dom Forest). This effect disappears if we arbitrar- ily relabel images. 6 Conclusion For synthetic speech the SegMatch approach to inducing utterance embeddings shows very promising performance. Like wise, previous work has shown some success with word-segmented speech. There remain challenges in carrying ov er these results to natural, unsegmented speech. W ord segmentation is a highly non-tri vial research problem in itself and the variability of spoken lan- guage is a serious and intractable confounding fac- tor . Even when controlling for speaker identity there are still superﬁcial features of the speech signal which make it easy for the model to ignore the se- mantic content. Some of these may be due to arti- facts in datasets and thus care is needed when e val- uating unsupervised models of spoken language: for example use of multiple ev aluation criteria may help spot spurious results. In spite of these challenges, in future we want to further explore the effecti veness of enforcing desired inv ariances via auxiliary classiﬁers with gradient re versal. References Afra Alishahi, Marie Barking, and Grzegorz Chrupała. 2017. Encoding of phonology in a recurrent neu- ral model of grounded speech. In Pr oceedings of the 21st Conference on Computational Natur al Lan- guage Learning (CoNLL 2017) , pages 368–378. As- sociation for Computational Linguistics. Sanjeev Arora, Y ingyu Liang, and T engyu Ma. 2016. A simple but tough-to-beat baseline for sentence em- beddings. In ICLR . David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. J ournal of ma- chine Learning r esear ch , 3(Jan):993–1022. Recall@10 (%) Median rank RSA image VGS 27 6 0.4 SegMatch 10 37 0.5 Audio2vec-U 5 105 0.0 Audio2vec-C 2 647 0.0 Mean MFCC 1 1,414 0.0 Chance 0 3,955 0.0 T able 1: Results on Synthetically Spoken COCO. The row labeled VGS is the visually supervised model from Chrupała et al. (2017). Recall@10 (%) Median rank RSA image VGS 15 17 0.2 SegMatch 12 17 0.0 Mean MFCC 0 711 0.0 T able 2: Results on Flickr8K. The row labeled VGS is the visually supervised model from Chrupała et al. (2017). Y i-Chen Chen, Chia-Hao Shen, Sung-Feng Huang, and Hung-yi Lee. 2018. T ow ards unsuper - vised automatic speech recognition trained by un- aligned speech and text only . arXiv pr eprint arXiv:1803.10952 . Grzegorz Chrupała, Lieke Gelderloos, and Afra Al- ishahi. 2017. Representations of language in a model of visually grounded speech signal. In Pr o- ceedings of the 55th Annual Meeting of the Associa- tion for Computational Linguistics . Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Y oshua Bengio. 2014. Empirical ev aluation of gated recurrent neural networks on sequence model- ing. In NIPS 2014 Deep Learning and Representa- tion Learning W orkshop . Y u-An Chung and James Glass. 2017. Learning word embeddings from speech. In NIPS ML4Audio W ork- shop . Y u-An Chung and James Glass. 2018. Speech2v ec: A sequence-to-sequence framew ork for learning word embeddings from speech. arXiv preprint arXiv:1803.08976 . John Rupert Firth. 1957. A synopsis of linguistic the- ory 1930-1955, volume 1952-59. The Philological Society . Y aroslav Ganin and V ictor Lempitsky . 2015. Unsu- pervised domain adaptation by backpropagation. In Pr oceedings of the 32nd International Conference on Machine Learning , volume 37 of Pr oceedings of Mac hine Learning Resear ch , pages 1180–1189, Lille, France. PMLR. David Harwath and James Glass. 2015. Deep multi- modal semantic embeddings for speech and images. In IEEE A utomatic Speech Reco gnition and Under - standing W orkshop . David Harwath and James R Glass. 2017. Learn- ing word-like units from joint audio-visual analysis. arXiv pr eprint arXiv:1701.07481 . David Harwath, Antonio T orralba, and James Glass. 2016. Unsupervised learning of spoken language with visual conte xt. In Advances in Neural Infor- mation Pr ocessing Systems , pages 1858–1866. Micah Hodosh, Peter Y oung, and Julia Hockenmaier . 2013. Framing image description as a ranking task: Data, models and e valuation metrics. Journal of Ar - tiﬁcial Intelligence Resear ch , 47:853–899. Y acine Jernite, Samuel R Bowman, and David Son- tag. 2017. Discourse-based objecti ves for fast un- supervised sentence representation learning. arXiv pr eprint arXiv:1705.00557 . Nal Kalchbrenner , Edward Grefenstette, and Phil Blunsom. 2014. A con volutional neural net- work for modelling sentences. arXiv pr eprint arXiv:1404.2188 . Herman Kamper, Aren Jansen, and Sharon Goldwa- ter . 2017a. A segmental framework for fully- unsupervised large-v ocabulary speech recognition. Computer Speech & Languag e , 46:154–174. Herman Kamper, Shane Settle, Gregory Shakhnarovich, and Karen Li vescu. 2017b. V isually grounded learning of keyword prediction from untranscribed speech. In Proc. Interspeech 2017 , pages 3677–3681. Y oon Kim. 2014. Con volutional neural net- works for sentence classiﬁcation. arXiv pr eprint arXiv:1408.5882 . Ryan Kiros, Y ukun Zhu, Ruslan R Salakhutdinov , Richard Zemel, Raquel Urtasun, Antonio T orralba, and Sanja Fidler . 2015. Skip-thought v ectors. In Ad- vances in Neur al Information Pr ocessing Systems , pages 3276–3284. Nikolaus Kriegeskorte, Marieke Mur , and Peter A Ban- dettini. 2008. Representational similarity analysis- connecting the branches of systems neuroscience. F rontier s in systems neuroscience , 2:4. Thomas K Landauer , Peter W Foltz, and Darrell La- ham. 1998. An introduction to latent semantic anal- ysis. Discourse pr ocesses , 25(2-3):259–284. Quoc Le and T omas Mikolov . 2014. Distributed rep- resentations of sentences and documents. In Inter- national Confer ence on Machine Learning , pages 1188–1196. Tsung-Y i Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Dev a Ramanan, Piotr Doll ´ ar , and C Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Computer V ision– ECCV 2014 , pages 740–755. Springer . Lajanugen Logesw aran and Honglak Lee. 2018. An efﬁcient framew ork for learning sentence represen- tations. arXiv pr eprint arXiv:1803.02893 . T omas Mikolov , Kai Chen, Greg Corrado, and Jef- frey Dean. 2013a. Ef ﬁcient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 . T omas Mik olov , Ilya Sutske ver , Kai Chen, Gre g S Cor- rado, and Jeff Dean. 2013b. Distributed representa- tions of words and phrases and their compositional- ity . In Advances in Neural Information Pr ocessing Systems , pages 3111–3119. A ¨ aron van den Oord, Oriol V inyals, and K oray Kavukcuoglu. 2017. Neural discrete representation learning. CoRR , abs/1711.00937. Alex S Park and James R Glass. 2008. Unsuper- vised pattern discov ery in speech. IEEE T ransac- tions on Audio, Speech, and Language Pr ocessing , 16(1):186–197. Olga Russakovsk y , Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An- drej Karpathy , Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2014. ImageNet large scale visual recognition challenge. Odette Scharenborg, Laurent Besacier , Alan Black, Mark Hasegaw a-Johnson, Florian Metze, Graham Neubig, Sebastian St ¨ uker , Pierre Godard, Markus M ¨ uller , Lucas Ondel, et al. 2018. Linguistic unit discov ery from multi-modal inputs in unwritten lan- guages: Summary of the Speaking Rosetta JSAL T 2017 workshop. arXiv pr eprint arXiv:1802.05092 . Karen Simonyan and Andrew Zisserman. 2014. V ery deep con volutional netw orks for large-scale image recognition. CoRR , abs/1409.1556. Richard Socher , Alex Perelygin, Jean W u, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality o ver a sentiment tree- bank. In Pr oceedings of the 2013 confer ence on empirical methods in natural language pr ocessing , pages 1631–1642. Y u-Hsuan W ang, Hung-yi Lee, and Lin-shan Lee. 2018. Segmental audio w ord2vec: Representing ut- terances as sequences of vectors with applications in spoken term detection. In 2018 IEEE International Confer ence on Acoustics, Speech, and Signal Pro- cessing. Pr oceedings . Han Zhao, Zhengdong Lu, and Pascal Poupart. 2015. Self-adaptiv e hierarchical sentence model. In IJCAI , pages 4069–4076. Julian Georg Zilly , Rupesh Kumar Sriv astav a, Jan K outn ´ ık, and J ¨ urgen Schmidhuber . 2017. Recurrent highway networks. In Pr oceedings of the 34th In- ternational Confer ence on Machine Learning , v ol- ume 70 of Pr oceedings of Machine Learning Re- sear ch , pages 4189–4198, International Conv ention Centre, Sydney , Australia. PMLR.

On the difficulty of a distributional semantics of spoken language

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment