Audio-Linguistic Embeddings for Spoken Sentences

A UDIO-LINGUISTIC EMBEDDINGS FOR SPOKEN SENTENCES Albert Haque 1 Michelle Guo 1 Prateek V erma 2 Li F ei-F ei 1 1 Department of Computer Science, Stanford Uni versity 2 Center for Computer Research in Music and Acoustics, Stanford Uni versity ABSTRA CT W e propose spoken sentence embeddings which capture both acoustic and linguistic content. While existing works op- erate at the character , phoneme, or word lev el, our method learns long-term dependencies by modeling speech at the sen- tence le vel. F ormulated as an audio-linguistic multitask learn- ing problem, our encoder-decoder model simultaneously re- constructs acoustic and natural language features from audio. Our results show that spoken sentence embeddings outper- form phoneme and word-lev el baselines on speech recogni- tion and emotion recognition tasks. Ablation studies show that our embeddings can better model high-lev el acoustic con- cepts while retaining linguistic content. Overall, our work illustrates the viability of generic, multi-modal sentence em- beddings for spoken language understanding. 1. INTR ODUCTION During verbal communication, humans process multiple words sequentially , often waiting for a full sentence to be completed. Y et, many written and spoken language systems depend on individual character and word-le vel representa- tions either implicitly or explicitly . This lack of sentence- lev el context can make it difﬁcult to understand sentences containing conjunctions [1], negations [2], and vocal pitch variations [3]. In this work, we in vestigate embeddings in the domain of spoken language processing and propose spoken sentence embeddings, capable of modeling both acoustic and linguistic content in a single latent code. Machine representation of words dates back to ASCII. This one-hot representation encodes each character using a mixture of dummy or indicator v ariables. While this was slowly extended to words, the lar ge v ocabulary size of lan- guages made it dif ﬁcult. Learned, or distributed, word vector representations [4] replaced one-hot encodings. These word vectors are able to capture semantic information including context from neighboring words. Even today , the community continues to build better contextual word embeddings such as ELMo [5], ULMFit [6], and BER T [7]. W ord, phoneme, and grapheme embeddings like Speech2V ec [8] and Char2W av [9] hav e also been proposed for speceh, following techniques from natural language understanding. While word-le vel embeddings are promising, they are of- ten insufﬁcient for speech-related tasks for se veral reasons. First, word and phoneme embeddings capture a narrow tem- poral context, often a few hundred milliseconds at most. As a result, these embeddings cannot capture long-term depen- dencies required for higher-le vel reasoning (e.g., paragraph or song-lev el understanding). Almost all of the systems for speech recognition focus on the correctness of local context (e.g., letters, words, and phonemes) rather than overall se- mantics. Second, for speech recognition, an external language model is often used to correct character and word-lev el pre- dictions. This requires the addition of complex, multiple hy- pothesis generation methods [10]. Sentence-lev el embeddings offer adv antages over word and character embeddings. A sentence-lev el embedding can capture latent factors across words. This is directly useful for higher-le vel audio tasks such as emotion recognition, prosody modeling, and musical style analysis. Furthermore, most ex- ternal language models operate at the sentence-lev el. By ha v- ing a single sentence-lev el embedding, the embedding can capture both acoustic and linguistic content at longer contex- tual window sizes – thus alle viating the need for an external language model entirely , by learning the temporal structure. Contributions. In this work, our contributions are tw o- fold. First, we propose moving from phoneme, character , and word-le vel representations to sentence-lev el understanding by learning spoken sentence embeddings. Second, we design this embedding to capture both linguistic and acoustic con- tent in order to learn latent codes which can be applicable to a variety of speech and language tasks. W e verify the quality of the embedding in our ablation studies, where we assess the generality of sentence-le vel embeddings when used for auto- matic speech recognition and emotion classiﬁcation. W e be- liev e this work will inspire future work in speech processing, semantic understanding, and multi-modal transfer learning. 2. METHOD Our method allows us to learn spoken sentence embeddings that capture both acoustic and linguistic content. In this sec- tion, we discuss (i) how we handle long sequences with a tem- poral con volutional network [11] and (ii) how to learn audio- linguistic content under a multitask learning framew ork [12]. 2.1. T emporal Con volutional Network (TCN) Our goal is to learn a spoken sentence embedding, which can be used for a variety of speech tasks. Recurrent models are often the default starting point for sequence modeling tasks [11]. For most applications, the state-of-the-art approach to start with is very often a recurrent model. This is evide nt in machine translation [13, 14], automatic speech recognition [15, 16, 17], and speech synthesis tasks [18, 19, 9]. Howe ver , recurrent models such as recurrent neural net- works (RNNs) are notoriously difﬁcult to train [20]. For years, machine learning researchers hav e tried to mak e RNNs easier to train through novel training strategies [21] and ar - chitectures [22]. In [11], the authors sho w that fully conv olu- tional networks can outperform recurrent netw orks, without the training complexities [11]. In addition, they can better capture long term dependencies [23] required for such tasks. Motiv ated by these ﬁndings, in this work, we opt for a fully con volutional sequence model (Figure 1). Similar to W aveNet [23], we use a temporal conv olutional network (TCN) [11]. While we use the TCN in this work, any causal model will sufﬁce (e.g., T ransformer [24]). Causal Con volutions. T o begin, we introduce some quick notation. The sequence modeling task is deﬁned as follows. Gi ven an input sequence x of length T , we have x = x 1 , ..., x T , where each x t is an observation for timestep t . Suppose we wish to make a prediction y t at each timestep, then we hav e y = y 1 , ..., y T . The causal constraint states that when predicting y t , it should depend only on past ob- servations x

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment