Temporal Convolution for Real-time Keyword Spotting on Mobile Devices

T emporal Con volution f or Real-time K eyword Spotting on Mobile De vices Seungwoo Choi ∗ , Seokjun Seo ∗ , Beomjun Shin ∗ , Hyeongmin Byun, Martin K ersner , Beomsu Kim, Dongyoung Kim † , Sungjoo Ha † Hyperconnect, Seoul, South K orea { seungwoo.choi, seokjun.seo, beomjun.shin, hyeongmin.byun } @hpcnt.com { martin.kersner, beomsu.kim, dongyoung.kim, shurain } @hpcnt.com Abstract Ke yword spotting (KWS) plays a critical role in enabling speech-based user interactions on smart devices. Recent de- velopments in the ﬁeld of deep learning have led to wide adop- tion of con volutional neural networks (CNNs) in KWS systems due to their exceptional accuracy and robustness. The main challenge faced by KWS systems is the trade-off between high accuracy and lo w latency . Unfortunately , there has been lit- tle quantitativ e analysis of the actual latency of KWS models on mobile devices. This is especially concerning since con ven- tional con volution-based KWS approaches are kno wn to require a large number of operations to attain an adequate lev el of per- formance. In this paper, we propose a temporal conv olution for real-time KWS on mobile devices. Unlike most of the 2D con volution-based KWS approaches that require a deep archi- tecture to fully capture both low- and high-frequency domains, we e xploit temporal con volutions with a compact ResNet archi- tecture. In Google Speech Command Dataset, we achiev e more than 385x speedup on Google Pixel 1 and surpass the accuracy compared to the state-of-the-art model. In addition, we release the implementation of the proposed and the baseline models in- cluding an end-to-end pipeline for training models and ev aluat- ing them on mobile devices. Index T erms : ke yword spotting, real-time, con volutional neu- ral network, temporal con volution, mobile de vice 1. Introduction Ke yword spotting (KWS) aims to detect pre-deﬁned keyw ords in a stream of audio signals. It is widely used for hands-free control of mobile applications. Since its use is commonly con- centrated on recognizing wake-up words (e.g., “Hey Siri” [1], “ Ale xa” [2, 3], and “Okay Google” [4]) or distinguishing com- mon commands (e.g., “yes” or “no”) on mobile devices, the re- sponse of KWS should be both immediate and accurate . How- ev er, it is challenging to implement f ast and accurate KWS mod- els that meet the real-time constraint on mobile devices with restricted hardware resources. Recently , with the success of deep learning in a variety of cognitiv e tasks, neural network based approaches have become popular for KWS [5, 6, 7, 8, 9, 10]. Especially , KWS stud- ies based on conv olutional neural networks (CNNs) sho w re- markable accuracy [6, 7, 8]. Most of CNN-based KWS ap- proaches receive features, such as mel-frequency cepstral co- efﬁcients (MFCC), as a 2D input of a conv olutional network. Even though such CNN-based KWS approaches offer reliable accuracy , they demand considerable computations to meet a ∗ Equal contributions, listed in alphabetical order . † Shared corresponding authors. performance requirement. In addition, inference time on mo- bile de vices has not been analyzed quantitatively , but instead, indirect metrics hav e been used as a proxy to the latency . Zhang et al . [7] presented the total number of multiplications and ad- ditions performed by the whole network. T ang and Lin [8] re- ported the number of multiplications of their network as a surro- gate for inference speed. Unfortunately , it has been pointed out that the number of operations such as additions and multiplica- tions, is only an indirect alternative for the direct metric such as latency [11, 12, 13]. Neglecting the memory access costs and different platforms being equipped with v arying degrees of optimized operations are potential sources for the discrepancy . Thus, we focus on the measurement of actual latency on mobile devices. In this paper , we propose a temporal con volutional neural network for real-time KWS on mobile devices, denoted as TC- ResNet . W e apply temporal con volution , i.e., 1D con volution along the temporal dimension, and treat MFCC as input chan- nels. The proposed model utilizes advantages of temporal con- volution to enhance the accurac y and reduce the latency of mo- bile models for KWS. Our contributions are as follo ws: • W e propose TC-ResNet which is a fast and accurate con- volutional neural network for real-time KWS on mo- bile de vices. According to our experiments on Google Pixel 1, the proposed model sho ws 385x speedup and a 0.3%p increase in accuracy compared to the state-of- the-art CNN-based KWS model on Google Speech Com- mands Dataset [14]. • W e release our models 1 for KWS and implementations of the state-of-the-art CNN-based KWS models [6, 7, 8] together with the complete benchmark tool to e valuate the models on mobile devices. • W e empirically demonstrate that temporal conv olution is indeed responsible for reduced computation and in- creased performance in terms of accuracy compared to 2D con volutions in KWS on mobile devices. 2. Network Architectur e 2.1. T emporal Con volution for KWS Figure 1 is a simpliﬁed example illustrating the dif ference be- tween 2D conv olution and temporal conv olution for KWS ap- proaches utilizing MFCC as input data. Assuming that stride is one and zero padding is applied to match the input and the output resolution, gi ven input X ∈ R w × h × c and weight W ∈ R k w × k h × c × c 0 , 2D conv olution outputs Y ∈ R w × h × c 0 . 1 Source code can be found at the following link: https:// github.com/hyperconnect/TC- ResNet MFCC is widely used for transforming raw audio into a time- frequency representation, I ∈ R t × f , where t represents the time axis ( x -axis in Figure 1a) and f denotes the feature axis extracted from frequency domain ( y -axis in Figure 1a). Most of the previous studies [7, 8] use input tensor X ∈ R w × h × c where w = t , h = f (or vice versa), and c = 1 ( X 2d ∈ R t × f × 1 in Figure 1b). CNNs are known to perform a successi ve transformation of low-le vel features into higher le vel concepts. Ho wev er, since modern CNNs commonly utilize small kernels, it is difﬁcult to capture informati ve features from both low and high frequen- cies with a relatively shallow network (colored box in Figure 1b only covers a limited range of frequencies). Assuming that one naiv ely stacks n conv olutional layers of 3 × 3 weights with a stride of one, the receptive ﬁeld of the network only grows up to 2 n + 1 . W e can mitigate this problem by increasing the stride or adopting pooling, attention, and recurrent units. Howe ver , many models still require a large number of operations, ev en if we apply these methods, and has a hard time running real-time on mobile devices. In order to implement a fast and accurate model for real- time KWS, we reshape the input from X 2d in Figure 1b to X 1d in Figure 1c. Our main idea is to treat per -frame MFCC as a time series data, rather than an intensity or grayscale image, which is a more natural way to interpret audio. W e consider I as one-dimensional sequential data whose features at each time frame are denoted as f . In other words, rather than transform- ing I to X 2d ∈ R t × f × 1 , we set h = 1 and c = f , which results in X 1d ∈ R t × 1 × f , and feed it as an input to temporal con volu- tion (Figure 1c). The advantages of the proposed method are as follows: Large recepti ve ﬁeld of audio features. In the proposed method, all lower -lev el features always participate in forming the higher-le vel features in the next layer . Thus, it takes advan- tage of informative features in lo wer layers (colored box in Fig- ure 1c covers a whole range of frequencies), thereby av oiding stacking many layers to form higher-le vel features. This enables us to achiev e better performance ev en with a small number of layers. Small footprint and low computational complexity . Ap- plying the proposed method, a two-dimensional feature map shrinks in size if we keep the number of parameters the same as illustrated in Figure 1b and 1c. Assuming that both conv en- tional 2D con volution, W 2d ∈ R 3 × 3 × 1 × c , and proposed tem- poral conv olution, W 1d ∈ R 3 × 1 × f × c 0 , have the same number of parameters (i.e., c 0 = 3 × c f ), the proposed temporal conv olu- tion requires a smaller number of computations compared to the 2D con volution ( 2  is smaller than 1  in Figure 1). In addition, the output feature map (i.e., the input feature map of the next layer) of the temporal conv olution, Y 1d ∈ R t × 1 × c 0 , is smaller than that of a 2D conv olution, Y 2d ∈ R t × f × c . The decrease in feature map size leads to a dramatic reduction of the computa- tional b urden and footprint in the following layers, which is key to implementing fast KWS. 2.2. TC-ResNet Architectur e W e adopt ResNet [15], one of the most widely used CNN archi- tectures, b ut utilize m × 1 kernels ( m = 3 for the ﬁrst layer and m = 9 for the other layers) rather than 3 × 3 k ernels (Figure 2). None of the con volution layers and fully connected layers hav e biases, and each batch normalization layer [16] has trainable parameters for scaling and shifting. The identity shortcuts can be directly used when the input and the output hav e matching time (  ) feature (  ) MFCC     (a)  Input featur e map              Output feature map       ① MACs                      (b) Input featur e map      Output feature map      󰆓 ② MACs             󰆒     (c) 2D convolutution Temporal convolution      󰆒       󰆓 Figure 1: A simpliﬁed example illustrating the dif fer ence be- tween 2D con volution and temporal con volution. (a) MFCC. (b) 2D con volution for con ventional CNN-based KWS appr oaches. (c) Pr oposed temporal convolution. Note that both the param- eters of a conventional 2D convolution and that of the tempo- ral con volution have the same size in this example by setting t = 98 , f = 40 , c = 160 , and c 0 = 12 . dimensions (Figure 2a), otherwise, we use an extra conv-BN- ReLU to match the dimensions (Figure 2b). T ang and Lin [8] also adopted the residual network, but they did not employ a temporal conv olution and used a con ventional 3 × 3 kernel. In addition, they replaced strided con volutions with dilated con- volutions of stride one. Instead, we employ temporal con volu- tions to increase the ef fective recepti ve ﬁeld and follow the orig- inal ResNet implementation for other layers by adopting strided con volutions and excluding dilated con volutions. W e select TC-ResNet8 (Figure 2c), which has three residual blocks and { 16 , 24 , 32 , 48 } channels for each layer including the ﬁrst con volution layer , as our base model. TC-ResNet14 (Figure 2d) expands the network by incorporating twice as much residual blocks compared to TC-ResNet8 . W e introduce width multiplier [17] ( k in Figure 2c and Fig- ure 2d) to increase (or decrease) the number of channels at each layer , thereby achieving ﬂexibility in selecting the right capacity model for giv en constraints. For example, in TC- ResNet8 , a width multiplier of 1 . 5 expands the model to have { 24 , 36 , 48 , 72 } number of channels respecti vely . W e denote such a model by appending a multiplier sufﬁx such as TC- ResNet8-1.5 . TC-ResNet14-1.5 is created in the same manner . 3. Experimental Framework 3.1. Experimental Setup Dataset. W e ev aluated the proposed models and baselines [6, 8, 7] using Google Speech Commands Dataset [14]. The dataset contains 64,727 one-second-long utterance ﬁles which are recorded and labeled with one of 30 tar get categories. Follo wing Google’ s implementation [14], we distinguish 12 classes: “yes”, “no”, “up”, “down”, “left”, “right”, “on”, “off ”, “stop”, “go”, silence , and unknown . Using SHA-1 hashed name of the audio ﬁles, we split the dataset into training, validation, and test sets, with 80% training, 10% v alidation, and 10% test, respectiv ely . Data augmentation and preprocessing . W e followed Google’ s preprocessing procedures which apply random shift and noise injection to training data. First, in order to generate background noise, we randomly sample and crop background (a) Block ( s = 1) (b) Block ( s = 2) . . BN + ReLU BN ReLU . . BN + + ReLU BN ReLU conv 9 × 1 , 𝑠 = 1 conv 9 × 1 , 𝑠 = 1 conv 9 × 1 , 𝑠 = 2 conv 9 × 1 , 𝑠 = 1 conv 1 × 1 , 𝑠 = 2 BN Re L U ReLU (d) TC-ResNet14 Block , 𝑠 = 2 , c = 24𝑘 Block , 𝑠 = 1 , c = 24𝑘 Block , 𝑠 = 2, c = 32𝑘 Block , 𝑠 = 1, c = 32𝑘 Block , 𝑠 = 2, c = 48𝑘 Block , 𝑠 = 1, c = 48𝑘 Average pooling FC Softmax (c) TC-ResNet8 Block , 𝑠 = 2 , c = 24𝑘 conv 3 × 1 , 𝑠 = 1 , c = 16𝑘 Block , 𝑠 = 2, c = 32𝑘 Block , 𝑠 = 2, c = 48𝑘 Average pooling FC Softmax conv 3 × 1 , 𝑠 = 1 , c = 16𝑘 Figure 2: The building bloc k (denoted Bloc k) of TC-ResNet when (a) stride = 1 and (b) stride = 2. (c) Ar chitectur e for TC- ResNet8 and (d) TC-ResNet14. Each of them utilizes ResNet8 and ResNet14 as the backbone-CNN, r espectively . BN and FC denote batch normalization and fully connected layer . Note that ‘s’, ‘c’, and ‘k’ indicates stride, channel size, and width multi- plier , r espectively . noises provided in the dataset, and multiply it with a random co- efﬁcient sampled from uniform distrib ution, U (0 , 0 . 1) . The au- dio ﬁle is decoded to a ﬂoat tensor and shifted by s seconds with zero padding, where s is sampled from U ( − 0 . 1 , 0 . 1) . Then, it is blended with the background noise. The raw audio is decom- posed into a sequence of frames following the settings of the previous study [8] where the window length is 30 ms and the stride is 10 ms for feature extraction. W e use 40 MFCC features for each frame and stack them ov er time-axis. T raining. W e trained and ev aluated the models using T en- sorFlow [18]. W e use a weight decay of 0.001 and dropout with a probability of 0.5 to alleviate overﬁtting. Stochastic gradient descent is used with a momentum of 0.9 on a mini-batch of 100 samples. Models are trained from scratch for 30k iterations. Learning rate starts at 0.1 and is divided by 10 at e very 10k iterations. W e employ early stopping [19] with the validation split. Evaluation. W e use accuracy as the main metric to ev al- uate how well the model performs. W e trained each model 15 times and report its average performance. Receiver operating char acteristic (R OC) curves , of which the x -axis is the false alarm rate and the y -axis is the false reject rate, are plotted to compare dif ferent models. T o extend the ROC curve to multi- classes, we perform micro-av eraging over multiple classes per experiment, then vertically av erage them over the experiments for the ﬁnal plot. W e report the number of operations and parameters which faithfully reﬂect the real-world en vironment for mobile deplo y- ment. Unlike pre vious works which only reported the num- bers for part of the computation such as the number of mul- tiply operations [8] or the number of multiplications and ad- ditions only in the matrix-multiplication operations [7], we in- clude FLOPs [20], computed by T ensorFlow proﬁling tool [21], and the number of all parameters instead of only trainable pa- rameters reported by previous studies [8]. Inference speed can be estimated by FLOPs but it is well known that FLOPs are not always proportional to speed. There- fore, we also measure infer ence time on a mobile de vice using the T ensorFlow Lite Android benchmark tool [22]. W e mea- sured inference time on a Google Pixel 1 and forced the model to be executed on a single little core in order to emulate the always-on nature of KWS. The benchmark program measures the inference time 50 times for each model and reports the av- erage. Note that the inference time is measured from the ﬁrst layer of models that receives MFCC as input to focus on the performance of the model itself. 3.2. Baseline Implementations W e carefully selected baselines and veriﬁed advantages of the proposed models in terms of accuracy , the number of parame- ters, FLOPs, and inference time on mobile devices. Belo w are the baseline models: • CNN-1 and CNN-2 [6]. W e followed the implementa- tions of [7] where window size is 40 ms and the stride is 20 ms using 40 MFCC features. CNN-1 and CNN- 2 represent cnn-trad-fpool3 and cnn-one-fstride4 in [6], respectiv ely . • DS-CNN-S , DS-CNN-M , and DS-CNN-L [7]. DS-CNN utilizes depthwise con volutions. It aims to achieve the best accuracy when memory and computation resources are constrained. W e follo wed the implementation of [7] which utilizes 40 ms window size with 20 ms stride and only uses 10 MFCCs to reduce the number of opera- tions. DS-CNN-S , DS-CNN-M , and DS-CNN-L represent small-, medium-, and large-size model, respecti vely . • Res8 , Res8-Narro w , Res15 , and Res15-Narrow [8]. Res -variants employ a residual architecture for keyword spotting. The number following Res (e.g., 8 and 15) de- notes the number of layers and the -Narr ow sufﬁx rep- resents that the number of channels is reduced. Res15 has shown the best accuracy with Google Speech Com- mands Dataset among the KWS studies which are based on CNNs. The window size is 30 ms , the stride is 10 ms , and MFCC feature size is 40. W e release our end-to-end pipeline codebase for training, ev alu- ating, and benchmarking the baseline models and together with the proposed models. It consists of T ensorFlow implementation of models, scripts to conv ert the models into the T ensorFlow Lite models that can run on mobile devices, and the pre-built T ensorFlow Lite Android benchmark tool. 4. Experimental Results 4.1. Google Speech Command Dataset T able 1 shows the experimental results. Utilizing advantages of temporal con volutions, we improve the inference time measured on mobile device dramatically while achieving better accuracy compared to the baseline KWS models. TC-ResNet8 achiev es 29x speedup while improving 5.4%p in accuracy compared to CNN-1 , and improv es 11.5%p in accuracy while maintaining a comparable latency to CNN-2 . Since DS-CNN is designed for the resource-constrained en vironment, it shows better accu- racy compared to the naive CNN models without using lar ge number of computations. Ho wev er, TC-ResNet8 achieves 1.5x / 4.7x / 15.3x speedup, and improves 1.7%p / 1.2%p / 0.7%p accuracy compared to DS-CNN-S / DS-CNN-M / DS-CNN-L , respectiv ely . In addition, the proposed models show better accu- racy and speed compared to Res which sho ws the best accuracy among baselines. TC-ResNet8 achieves 385x speedup while im- proving 0.3%p accurac y compared to deep and comple x Res Model Acc. T ime FLOPs Params (%) ( ms ) CNN-1 90.7 ? 32 76.1M 524K CNN-2 84.6 ? 1.2 1.5M 148K DS-CNN-S 94.4 ? 1.6 5.4M 24K DS-CNN-M 94.9 ? 5.2 19.8M 140K DS-CNN-L 95.4 ? 16.8 56.9M 420K Res8-Narrow 90.1 ? 47 143.2M 20K Res8 94.1 ? 174 795.3M 111K Res15-Narrow 94.0 ? 107 348.7M 43K Res15 95.8 ? 424 1950.0M 239K TC-ResNet8 96.1 1.1 3.0M 66K TC-ResNet8-1.5 96.2 2.8 6.6M 145K TC-ResNet14 96.2 2.5 6.1M 137K TC-ResNet14-1.5 96.6 5.7 13.4M 305K T able 1: Comparison of the baseline models and the proposed models. The numbers marked with ? are taken fr om the pa- per . The best r esult (accuracy and latency) among differ ent ap- pr oaches ar e displayed in bold. 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 False alarm rate (false positive) 0.00 0.02 0.04 0.06 0.08 0.10 False reject rate (false negative) CNN-1 (AUC: 5.22e-03) DS-CNN-L (AUC: 1.68e-03) Res15 (AUC: 1.13e-03) TC-ResNet14-1.5 (AUC: 9.02e-04) Figure 3: ROC curves for selected models with corr esponding values of A UC. baseline, Res15 . Compared to a slimmer Res baseline, Res8- Narr ow , proposed TC-ResNet8 achieves 43x speedup while im- proving 6%p accuracy . Note that our wider and deeper mod- els (e.g., TC-ResNet8-1.5 , TC-ResNet14 , and TC-ResNet14-1.5 ) achiev e better accuracy at the expense of inference speed. W e also plot the ROC curves of models which depict the best accurac y among their variants: CNN-1 , DS-CNN-L , Res15 , and TC-ResNet14-1.5 . As presented in Figure 3, TC-ResNet14- 1.5 is less likely to miss target ke ywords compared to other baselines assuming that the number of incorrectly detected ke y- words is the same. The small area under the curve (A UC) means that the model would miss fe wer tar get keywords on av erage for various false alarm rates. TC-ResNet14-1.5 shows the smallest A UC, which is critical for good user experience with KWS sys- tem. 4.2. Impact of T emporal Con volution W e demonstrate that the proposed method could effectiv ely im- prov e both accuracy and inference speed compared to the base- line models which treat the feature map as a 2D image. W e further explore the impact of the temporal con volution by com- Model Acc. T ime FLOPs Params (%) ( ms ) 2D-ResNet8 96.1 10.1 35.8M 66K 2D-ResNet8-Pool 94.9 3.5 4.0M 66K T able 2: Comparison of TC-ResNet variants, 2D-ResNet8 and 2D-ResNet8-P ool, which utilize 2D con volutions while retain- ing the arc hitectur e and the number of parameters of TC- ResNet8. paring v ariants of TC-ResNet8 , named 2D-ResNet8 and 2D- ResNet8-P ool , which adopt a similar network architecture and the number of parameters but utilize 2D con volutions. W e designed 2D-ResNet8 , whose architecture is identical to TC-ResNet8 except for the use of 3 × 3 2D con volutions. 2D- ResNet8 (in T able 2) sho ws comparable accuracy , but is 9.2x slower compared to TC-ResNet8 (in T able 1). TC-ResNet8-1.5 is able to surpass 2D-ResNet8 while using less computational resources. W e also demonstrate the use of temporal conv olution is su- perior to other methods of reducing the number of operations in CNNs such as applying a pooling layer . In order to reduce the number of operations while minimizing the accuracy loss, CNN-1 , Res8 , and Res8-Narrow adopt average pooling at an early stage, speciﬁcally , right after the ﬁrst con volution layer . W e inserted an av erage pooling layer , where both the windo w size and the stride are set to 4, after the ﬁrst conv olution layer of 2D-ResNet8 , and named it 2D-ResNet8-P ool . 2D-ResNet8-P ool improv es inference time with the same number of parameters, howe ver , it loses 1.2%p accuracy and is still 3.2x slower com- pared to TC-ResNet8 . 5. Related W orks Recently , there has been a wide adoption of CNNs in KWS. Sainath et al . [6] proposed small-footprint CNN models for KWS. Zhang et al . [7] searched and ev aluated proper neural network architectures within memory and computation con- straints. T ang and Lin [8] exploited residual architecture and di- lated conv olutions to achiev e further improv ement in accuracy while preserving compact models. In previous studies [6, 7, 8], it has been common to use 2D con volutions for inputs with time-frequency representations. Howe ver , there has been an in- crease in the use of 1D con volutions in acoustics and speech domain [23, 24]. Unlike pre vious studies [23, 24] our work ap- plies 1D con volution along the temporal axis of time-frequenc y representations instead of conv olving along the frequency axis or processing raw audio signals. 6. Conclusion In this inv estigation, we aimed to implement fast and accu- rate models for real-time KWS on mobile devices. W e mea- sured inference speed on the mobile device, Google Pixel 1, and pro vided quantitati ve analysis of conv entional con volution- based KWS models and our models utilizing temporal conv o- lutions. Our proposed model achieved 385x speedup while im- proving 0.3%p accuracy compared to the state-of-the-art model. Through ablation study , we demonstrated that temporal conv o- lution is indeed responsible for the dramatic speedup while im- proving the accuracy of the model. Further studies analyzing the efﬁcac y of temporal conv olutions for a diverse set of net- work architectures would be w orthwhile. 7. References [1] S. Sigtia, R. Haynes, H. Richards, E. Marchi, and J. Bridle, “Ef- ﬁcient voice trigger detection for lo w resource hardware, ” in Pro- ceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH) , 2018. [2] M. Sun, D. Snyder, Y . Gao, V . Nagaraja, M. Rodehorst, S. Pan- chapagesan, N. Strom, S. Matsoukas, and S. V italadevuni, “Com- pressed time delay neural network for small-footprint ke yword spotting, ” in Proceedings of the Annual Confer ence of the Inter- national Speech Communication Association (INTERSPEECH) , 2017. [3] G. T ucker, M. W u, M. Sun, S. P anchapagesan, G. Fu, and S. V ita- ladevuni, “Model compression applied to small-footprint k eyword spotting, ” in Proceedings of the Annual Confer ence of the Inter- national Speech Communication Association (INTERSPEECH) , 2016. [4] G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks, ” in Pr oceedings of the IEEE International Conference on Acoustics, Speech and Signal Pr o- cessing (ICASSP) , 2014. [5] Z. W ang, X. Li, and J. Zhou, “Small-footprint keyword spotting using deep neural network and connectionist temporal classiﬁer , ” arXiv pr eprint arXiv:1709.03665 , 2017. [6] T . N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyw ord spotting, ” in Pr oceedings of the Annual Confer ence of the International Speec h Communication Associa- tion (INTERSPEECH) , 2015. [7] Y . Zhang, N. Suda, L. Lai, and V . Chandra, “Hello edge: Keyword spotting on microcontrollers, ” arXiv preprint , 2017. [8] R. T ang and J. Lin, “Deep residual learning for small-footprint keyw ord spotting, ” in Pr oceedings of the IEEE International Con- fer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2018. [9] D. C. de Andrade, S. Leo, M. L. D. S. V iana, and C. Bernkopf, “ A neural attention model for speech command recognition, ” arXiv pr eprint arXiv:1808.08929 , 2018. [10] S. ¨ O. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky , C. F ougner , R. Prenger , and A. Coates, “Con volutional recurrent neural networks for small-footprint ke yword spotting, ” in Pr o- ceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH) , 2017. [11] M. Sandler , A. Howard, M. Zhu, A. Zhmogino v , and L.-C. Chen, “MobileNetV2: In verted residuals and linear bottlenecks, ” in Pr o- ceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , 2018. [12] M. T an, B. Chen, R. Pang, V . V asude van, and Q. V . Le, “Mnas- Net: Platform-aw are neural architecture search for mobile, ” arXiv pr eprint arXiv:1807.11626 , 2018. [13] N. Ma, X. Zhang, H.-T . Zheng, and J. Sun, “ShufﬂeNet v2: Prac- tical guidelines for ef ﬁcient cnn architecture design, ” in Pr oceed- ings of the European Confer ence on Computer V ision (ECCV) , 2018. [14] P . W arden. (2017, August) Launching the speech commands dataset. [Online]. A vailable: https://ai.googleblog.com/2017/08/ launching- speech- commands- dataset.html [15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , 2016. [16] S. Ioffe and C. Sze gedy , “Batch normalization: Accelerating deep network training by reducing internal cov ariate shift, ” in Pr oceed- ings of the Internal Confer ence on Machine Learning (ICML) , 2015. [17] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W . W ang, T . W eyand, M. Andreetto, and H. Adam, “MobileNets: Efﬁcient con volutional neural networks for mobile vision applications, ” arXiv pr eprint arXiv:1704.04861 , 2017. [18] M. Abadi, P . Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghema wat, G. Irving, M. Isard et al. , “T ensorFlow: A system for lar ge-scale machine learning. ” in Proceedings of the USENIX Symposium on Operating Systems Design and Im- plementation (OSDI) , 2016. [19] L. Prechelt, “Early stopping-but when?” in Neural Networks: T ricks of the trade . Springer , 1998, pp. 55–69. [20] S. Arik, H. Jun, and G. Diamos, “Fast spectrogram in version using multi-head conv olutional neural networks, ” arXiv preprint arXiv:1808.06719 , 2018. [21] T ensorFlow Proﬁler and Advisor . [Online]. A vail- able: https://github .com/tensorﬂow/tensorﬂo w/blob/master/ tensorﬂow/core/proﬁler/README.md [22] TFLite Model Benchmark T ool. [Online]. A vail- able: https://github .com/tensorﬂow/tensorﬂo w/tree/r1.13/ tensorﬂow/lite/tools/benchmark/ [23] H. Lim, J. Park, K. Lee, and Y . Han, “Rare sound e vent detection using 1d convolutional recurrent neural networks, ” in Proceedings of the Detection and Classiﬁcation of Acoustic Scenes and Events 2017 W orkshop , 2017. [24] K. Choi, G. Fazekas, M. Sandler , and K. Cho, “Con volutional re- current neural networks for music classiﬁcation, ” in Pr oceedings of the IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2017.

Temporal Convolution for Real-time Keyword Spotting on Mobile Devices

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment