Efficient keyword spotting using dilated convolutions and gating

EFFICIENT KEYWORD SPO TTING USING DILA TED CONV OLUTIONS AND GA TING Alice Couck e, Mohammed Chlieh, Thibault Gisselbr echt, David Ler oy , Mathieu P oumeyr ol, Thibaut Lavril Snips, Paris, France ABSTRA CT W e explore the application of end-to-end stateless tempo- ral modeling to small-footprint ke yword spotting as opposed to recurrent networks that model long-term temporal depen- dencies using internal states. W e propose a model inspired by the recent success of dilated con volutions in sequence mod- eling applications, allowing to train deeper architectures in resource-constrained conﬁgurations. Gated acti vations and residual connections are also added, following a similar con- ﬁguration to W av eNet. In addition, we apply a custom target labeling that back-propagates loss from speciﬁc frames of i n- terest, therefore yielding higher accuracy and only requiring to detect the end of the keyw ord. Our experimental results show that our model outperforms a max-pooling loss trained recurrent neural netw ork using LSTM cells, with a signiﬁcant decrease in false rejection rate. The underlying dataset – “He y Snips” utterances recorded by ov er 2.2K different speakers – has been made publicly av ailable to establish an open refer- ence for wake-w ord detection. Index T erms — end-to-end ke yword spotting, wak e-word detection, dilated con volution, open dataset 1. INTR ODUCTION Ke yword spotting (KWS) aims at detecting a pre-deﬁned key- word or set of keywords in a continuous stream of audio. In particular, wake-word detection is an increasingly impor- tant application of KWS, used to initiate an interaction with a voice interface. In practice, such systems run on low-resource devices and listen continuously for a speciﬁc wake word. An effecti ve on-device KWS therefore requires real-time response and high accuracy for a good user experience, while limiting memory footprint and computational cost. T raditional approaches in keyword spotting tasks in volv e Hidden Marko v Models (HMMs) for modeling both ke yword and background [1, 2, 3]. In recent years, Deep Neural Net- works (DNNs) have prov en to yield ef ﬁcient small-footprint solutions, as shown ﬁrst by the fully-connected networks in- troduced in [4]. More adv anced architectures hav e been suc- cessfully applied to KWS problems, such as Conv olutional Neural Networks (CNNs) exploiting local dependencies [5, 6]. They ha ve demonstrated efﬁcienc y in terms of inference speed and computational cost but fail at capturing large pat- terns with reasonably small models. Recent w orks ha ve sug- gested RNN based keyword spotting using LSTM cells that can le verage longer temporal context using gating mechanism and internal states [7, 8, 9]. Howe ver , because RNNs may suf- fer from state saturation when facing continuous input streams [10], their internal state needs to be periodically reset. In this work we focus on end-to-end stateless temporal modeling which can take advantage of a large context while limiting computation and avoiding saturation issues. By end- to-end model, we mean a straight-forward model with a bi- nary tar get that does not require a precise phoneme alignment beforehand. W e explore an architecture based on a stack of dilated conv olution layers, effecti vely operating on a broader scale than with standard con volutions while limiting model size. W e further improve our solution with gated activ ations and residual skip-connections, inspired by the W aveNet style architecture explored previously for text-to-speech applica- tions [11] and voice activity detection [10], but nev er applied to KWS to our knowledge. In [12], the authors explore Deep Residual Netw orks (ResNets) for KWS. ResNets differ from W av eNet models in that the y do not lev erage skip-connections and gating, and apply conv olution kernels in the frequency domain, drastically increasing the computational cost. In addition, the long-term dependency our model can cap- ture is exploited by implementing a custom “end-of-ke yword” target labeling, increasing the accurac y of our model . A max- pooling loss trained LSTM initialized with a cross-entropy pre-trained network is chosen as a baseline, as it is one of the most effecti ve models taking advantage of longer temporal contexts [8]. The rest of the paper is organized in two main parts. Section 2 describes the different components of our model as well as our labeling. Section 3 focuses on the e xper- imental setup and performance results obtained on a publicly av ailable “Hey Snips” dataset 1 . 2. MODEL IMPLEMENT A TION 2.1. System description The acoustic features are 20-dimensional log-Mel ﬁlterbank energies (LFBEs), extracted from the input audio e very 10ms 1 https://research.snips.ai/datasets/keyw ord-spotting Fig. 1 : W av eNet architecture [11]. ov er a window of 25ms. A binary tar get is used, see Section 2.4 for more details about labeling. During decoding, the sys- tem computes smoothed posteriors by a veraging the output of a sliding context window containing w smooth frames, a pa- rameter chosen after experimental tuning. End-to-end mod- els such as the one presented here do not require any post- processing step besides smoothing, as opposed to multi-class models such as [4, 5]. Indeed, the system triggers when the smoothed ke yword posterior exceeds a pre-deﬁned threshold. 2.2. Neural network architecture W av eNet was initially proposed in [11], as a generati ve model for speech synthesis and other audio generation tasks. It con- sists in stacked causal conv olution layers wrapped in a resid- ual block with gated acti vation units as depicted in Figure 1. 2.2.1. Dilated causal con volutions Standard conv olutional netw orks cannot capture long tempo- ral patterns with reasonably small models due to the increase in computational cost yielded by larger receptive ﬁelds. Di- lated conv olutions skip some input v alues so that the con- volution kernel is applied over a larger area than its own. The network therefore operates on a larger scale, without the downside of increasing the number of parameters. The recep- tiv e ﬁeld r of a network made of stack ed con volutions indeed reads: r = X i d i ( s i − 1) , where d i refers to the dilation rate ( d i = 1 for normal con- volutions) and s i the ﬁlter size of the i th layer . Additionally , causal con volutions k ernels ensure a causal ordering of input frames: the prediction emitted at time t only depends on pre- vious time stamps. It allo ws to reduce the latenc y at inference time. 2.2.2. Gated activations and r esidual connections As mentioned in [11], gated acti vations units – a combination of tanh and sigmoid activ ations controlling the propagation Fig. 2 : Dilated con volution layers with an exponential dila- tion rate of 1, 2, 4, 8 and ﬁlter size of 2. Blue nodes are input frame vectors, orange nodes are cached intermediate v ectors used for streaming inference, green nodes are output vectors which are actually computed.  refers to background. of information to the next layer – prov e to efﬁciently model audio signals. Residual learning strategies such as skip con- nections are also introduced to speed up conv ergence and ad- dress the issue of v anishing gradients posed by the training of models of higher depth. Each layer yields two outputs: one is directly fed to the next layer as usual, but the second one skips it. All skip-connections outputs are then summed into the ﬁnal output of the network. A lar ge temporal dependency , can therefore be achie ved by stacking multiple dilated con vo- lution layers. By inserting residual connections between each layer , we are able to train a netw ork of 24 layers on relati vely small amount of data, which corresponds to a receptiv e ﬁeld of 182 frames or 1.83s. The importance of gating and residual connections is analyzed in Section 3.3.2. 2.3. Streaming inference In addition to reducing the model size, dilated conv olutions allow the network to run in a streaming fashion during in- ference, drastically reducing the computational cost. When receiving a ne w input frame, the corresponding posteriors are recov ered using previous computations, kept in memory for efﬁcienc y purposes as described in Figure 2. This cached im- plementation allows to reduce the amount of Floating Point Operations per Second (FLOPS) to a lev el suiting production requirements. 2.4. End-of-keyword labeling Our approach consists in associating a target 1 to frames within a gi ven time interv al ∆ t before and after the end of the key- word. The optimal v alue for ∆ t is tuned on the dev set. Addi- tionally , a masking scheme is applied, discarding background frames outside of the labeling windo w in positive samples. A traditional labeling approach, howe ver , associates a target 1 T rain Dev T est utterances 5876 2504 2588 Hey Snips speakers 1179 516 520 max / speaker 10 10 10 utterances 45344 20321 20821 Negati ve speakers 3330 1474 1469 max / speaker 30 30 30 T able 1 : Dataset statistics. to all frames aligned with the k eyw ord. In this conﬁguration, the model has a tendency to trigger as soon as the keyword starts, whether or not the sample contains only a fraction of the ke yword. One adv antage of our approach is that the net- work will trigger near the end of ke yword, once it has seen enough context. Moreover , our labeling does not need any phoneme alignment, but only to detect the end of the keyword, which is easily obtained with a V AD system (only needed for labeling and not used for inference). Furthermore, thanks to masking, the precise frontiers of the labeling windo w are not learned, making the network more robust to labeling impre- cisions. The relativ e importance of end-of-ke yword labeling and masking are analyzed in Section 3.3.2. 3. EXPERIMENTS 3.1. Open dataset The proposed approach is e valuated on a crowdsourced close- talk dataset. The chosen k eyword is “Hey Snips” pronounced with no pause between the two words. The dataset contains a large variety of English accents and recording en vironments. Around 11K wake-word utterances and 86.5K ( ∼ 96 hours) negati ve examples ha ve been recorded, see T able 1 for more details. Note that negati ve samples have been recorded in the same conditions than wak e-word utterances, therefore arising from the same domain (speaker , hardware, environment, etc.). It thus prev ents the model from discerning the two classes based on their domain-dependent acoustic features. Positiv e data has been cleaned by automatically removing samples of extreme duration, or samples with repeated oc- currences of the wake word. Positi ve dev and test sets have been manually cleaned to discard any mispronunciations of the wake word (e.g. “Hi Snips” or “Hey Snaips”), leaving the training set untouched. Noisy conditions are simulated by augmenting samples with music and noise background au- dio from Musan [13]. The positive de v and test datasets are augmented at 5dB of Signal-to-noise Ratio (SNR). The full dataset and its metadata are a vailable for research purposes 2 . Although some keyword spotting datasets are freely av ailable, such as the Speech Commands dataset [14] for v oice 2 https://research.snips.ai/datasets/keyw ord-spotting commands classiﬁcation, there is no equi valent in the speciﬁc wake-w ord detection ﬁeld. By establishing an open reference for wake-w ord detection, we hope to contribute to promote transparency and reproducibility in a highly concurrent ﬁeld where datasets are often kept pri vate. 3.2. Experimental setup The network consists in an initial causal con volution layer (ﬁlter size of 3) and 24 layers of gated dilated con volutions (ﬁlter size of 3). The 24 dilation rates are a repeating se- quence of { 1 , 2 , 4 , 8 , 1 , 2 , 4 , 8 ... } . Residual connections are created between each layer and skip connections are accumu- lated at each layer and are e ventually fed to a DNN followed by a softmax for classiﬁcation as depicted in Figure 1. W e used projection layers of size 16 for residual connections and of size 32 for skip connections. The optimal duration of the end-of-keyw ord labeling interval as deﬁned in Section 2.4 is ∆ t = 160 ms (15 frames before and 15 frames after the end of the ke yword). The posteriors are smoothed o ver a sliding context windo w of w smooth = 30 frames, also tuned on the dev set. The main baseline model is a LSTM trained with a max- pooling based loss initialized with a cross-entropy pre-trained network, as it is another example of end-to-end temporal model [8]. The idea of the max-pooling loss is to teach the network to ﬁre at its highest conﬁdence time by back-propagating loss from the most informati ve k eyw ord frame that has the maxi- mum posterior for the corresponding ke yword. More specif- ically , the network is a single layer of unidirectional LSTM with 128 memory blocks and a projection layer of dimen- sion 64, follo wing a similar conﬁguration to [8] but matching the same number of parameters than the proposed architecture (see Section 3.3.1). 10 frames in the past and 10 frames in the future are stacked to the input frame. Standard frame labeling is applied, but with the frame masking strategy described in Section 2.4. The authors of [8] mentioned back-propagating loss only from the last few frames, but said that the LSTM network performed poorly in this setting. The same smooth- ing strategy is applied on an window w smooth = 8 frames, after tuning on dev data. For comparison, we also add as a CNN variant the base architecture trad-fpool3 from [5], a multi-class model with 4 output labels (“hey”, “sni”, “ps”, and background). Among those proposed in [5], this is the ar - chitecture with the lowest amount of FLOPS while having a similar number of parameters as the tw o other models studied here (see Section 3.3.1). The Adam optimization method is used for the three mod- els with a learning rate of 10 − 3 for the proposed architecture, 10 − 4 for the CNN, and 5 · 10 − 5 for the LSTM baseline. Ad- ditionally , gradient norm clipping to 10 is applied. A scaled uniform distribution for initialization [15] (or “Xavier” ini- tialization) yielded the best performance for the three models. W e also note that the LSTM network is much more sensitive Model Params FLOPS FRR clean FRR noisy W av eNet 222 K 22 M 0 . 12 1 . 60 LSTM 257 K 26 M 2 . 09 11 . 21 CNN 244 K 172 M 2 . 51 13 . 18 T able 2 : Number of parameters, multiplications per second, and false rejection rate in percent on clean (FRR clean) and 5dB SNR noisy (FRR noisy) positive samples, at 0.5 false alarms per hour . to the chosen initialization scheme. 3.3. Results 3.3.1. System performance The performance of the three models is ﬁrst measured by ob- serving the False Rejection Rate (FRR) on clean and noisy (5dB SNR) positives samples at the operating threshold of 0.5 F alse Alarms per Hour (F AH) computed on the collected negati ve data. Hyper parameters are tuned on the dev set and results are reported on the test set. T able 2 displays these quantities as well as the number of parameters and multipli- cations per second performed during inference. The proposed architecture yields a lo wer FRR than the LSTM (resp. CNN) baseline with a 94% (resp. 95%) and 86% (resp. 88%) de- crease in clean and noisy conditions. The number of param- eters is similar for the three architectures, but the amount of FLOPS is higher by an order of magnitude for the CNN base- line while resulting in a poorer FRR in a noisy environment. Figure 3 provides the Detection Error T radeoff (DET) curves and sho ws that the W aveNet model also outperforms the base- lines on a whole range of triggering thresholds. (a) clean (b) noisy (5dB SNR) Fig. 3 : DET curves for the proposed architecture (green) com- pared to the LSTM (dotted yellow) and CNN (dashed blue) baselines in clean (a) and noisy (b) en vironments. 3.3.2. Ablation analysis T o assess the relativ e importance of some characteristics of the proposed architecture, we study the difference in FRR ob- served once each of them is removed separately , all things FRR clean FRR noisy Default labeling +0 . 36 +1 . 33 No masking +0 . 28 +0 . 46 No gating +0 . 24 +2 . 57 T able 3 : V ariation in FRR (absolute) for the proposed archi- tecture when removing dif ferent characteristics separately , all things being equal. being equal. T able 3 shows that the end-of-keyword label- ing is particularly helpful in improving the FRR at a ﬁxed F AH, especially in noisy conditions. Masking background frames in positiv e samples also helps, but in a lower magni- tude. Similarly to what is observed in [10], gating contrib utes to improving the FRR especially in noisy conditions. W e ﬁ- nally observed that removing either residual or skip connec- tions separately has little ef fect on the performance. Howe ver , we could not properly train the proposed model without any of these connections. It seems to conﬁrm that implementing at least one bypassing strategy is key for constructing deeper network architectures. 4. CONCLUSION This paper introduces an end-to-end stateless modeling for keyw ord spotting, based on dilated conv olutions coupled with residual connections and gating encouraged by the success of the W av eNet architecture in audio generation tasks [11, 10]. Additionally , a custom frame labeling is applied, asso- ciating a target 1 to frames located within a small time in- terval around the end of the keyword. The proposed archi- tecture is compared against a LSTM baseline, similar to the one proposed in [8]. Because of their binary targets, both the proposed model and the LSTM baseline do not require any phoneme alignment or post-processing besides posterior smoothing. W e also added a multi-class CNN baseline [5] for comparison. W e have shown that the presented W aveNet model signiﬁcantly reduces the false rejection rate at a ﬁxed false alarm rate of 0.5 per hour, in both clean and noisy en- vironments, on a crowdsourced dataset made publicly avail- able for research purposes. The proposed model seems to be very efﬁcient in the speciﬁc domain deﬁned by this dataset and future work will focus on domain adaptation in terms of recording hardware, accents, or far-ﬁeld settings, to be de- ployed easily in ne w environments. 5. A CKNO WLEDGEMENTS W e thank Oleksandr Olgashko for his contribution in devel- oping the training frame work. W e are grateful to the cro wd of contributors who recorded the dataset. W e are indebted to the users of the Snips V oice Platform for valuable feedback. 6. REFERENCES [1] Richard C Rose and Douglas B Paul, “ A hidden markov model based keyw ord recognition system, ” in Acoustics, Speech, and Signal Pr ocessing, 1990. ICASSP-90., 1990 International Confer ence on . IEEE, 1990, pp. 129–132. [2] Jay G W ilpon, Lawrence R Rabiner , C-H Lee, and ER Goldman, “ Automatic recognition of keywords in unconstrained speech using hidden markov models, ” IEEE T ransactions on Acoustics, Speech, and Signal Pr ocessing , vol. 38, no. 11, pp. 1870–1878, 1990. [3] JG Wilpon, LG Miller , and P Modi, “Improv ements and applications for ke y word recognition using hidden markov modeling techniques, ” in Acoustics, Speech, and Signal Pr ocessing, 1991. ICASSP-91., 1991 Inter- national Confer ence on . IEEE, 1991, pp. 309–312. [4] Guoguo Chen, Carolina Parada, and Georg Heigold, “Small-footprint keyw ord spotting using deep neural networks, ” in Acoustics, speech and signal pr ocessing (icassp), 2014 ieee international conference on . IEEE, 2014, pp. 4087–4091. [5] T ara N Sainath and Carolina Parada, “Con volutional neural networks for small-footprint keyw ord spotting, ” in Sixteenth Annual Confer ence of the International Speech Communication Association , 2015. [6] Y undong Zhang, Nav een Suda, Liangzhen Lai, and V ikas Chandra, “Hello edge: Ke yword spotting on mi- crocontrollers, ” arXiv pr eprint arXiv:1711.07128 , 2017. [7] Santiago Fern ´ andez, Alex Grav es, and J ¨ urgen Schmid- huber , “ An application of recurrent neural networks to discriminativ e keyw ord spotting, ” in International Con- fer ence on Artiﬁcial Neural Networks . Springer , 2007, pp. 220–229. [8] Ming Sun, Anirudh Raju, Geor ge T ucker , Sankaran Pan- chapagesan, Gengshen Fu, Arindam Mandal, Spyros Matsoukas, Nikko Strom, and Shi v V italadevuni, “Max- pooling loss training of long short-term memory net- works for small-footprint keyword spotting, ” in Spo- ken Languag e T echnology W orkshop (SLT), 2016 IEEE . IEEE, 2016, pp. 474–480. [9] Pallavi Baljekar , Jill Fain Lehman, and Rita Singh, “On- line word-spotting in continuous speech with recurrent neural networks, ” in Spoken Language T echnolo gy W orkshop (SLT), 2014 IEEE . IEEE, 2014, pp. 536–541. [10] Shuo-Y iin Chang, Bo Li, Gabor Simko, T ara N Sainath, Anshuman T ripathi, A ¨ aron van den Oord, and Oriol V inyals, “T emporal modeling using dilated conv olution and gating for voice-acti vity-detection, ” in 2018 IEEE International Confer ence on Acoustics, Speec h and Sig- nal Pr ocessing (ICASSP) . IEEE, 2018, pp. 5549–5553. [11] A ¨ aron V an Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol V inyals, Alex Grav es, Nal Kalchbrenner , Andrew W Senior , and Koray Kavukcuoglu, “W avenet: A generati ve model for raw audio., ” in SSW , 2016, p. 125. [12] Raphael T ang and Jimmy Lin, “Deep residual learning for small-footprint keyword spotting, ” arXiv pr eprint arXiv:1710.10361 , 2017. [13] David Snyder , Guoguo Chen, and Daniel Pove y , “Mu- san: A music, speech, and noise corpus, ” arXiv pr eprint arXiv:1510.08484 , 2015. [14] Pete W arden, “Speech commands: A dataset for limited-vocab ulary speech recognition, ” arXiv pr eprint arXiv:1804.03209 , 2018. [15] Xavier Glorot and Y oshua Bengio, “Understanding the difﬁculty of training deep feedforward neural networks, ” in Pr oceedings of the thirteenth international confer- ence on artiﬁcial intelligence and statistics , 2010, pp. 249–256.

Efficient keyword spotting using dilated convolutions and gating

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment