Interpretable Convolutional Filters with SincNet
Deep learning is currently playing a crucial role toward higher levels of artificial intelligence. This paradigm allows neural networks to learn complex and abstract representations, that are progressively obtained by combining simpler ones. Neverthe…
Authors: Mirco Ravanelli, Yoshua Bengio
Interpr etable Con v olutional Filters with SincNet Mirco Rav anelli Mila, Univ ersité de Montréal Y oshua Bengio Mila, Univ ersité de Montréal CIF AR Fellow Abstract Deep learning is currently playing a crucial role to ward higher le vels of artificial intelligence. This paradigm allo ws neural networks to learn complex and abstract representations, that are progressi vely obtained by combining simpler ones. Nev- ertheless, the internal "black-box" representations automatically discovered by current neural architectures often suf fer from a lack of interpretability , making of primary interest the study of explainable machine learning techniques. This paper summarizes our recent efforts to dev elop a more interpretable neural model for directly processing speech from the raw waveform. In particular , we propose SincNet , a nov el Con volutional Neural Network (CNN) that encourages the first layer to discov er more meaningful filters by exploiting parametrized sinc functions. In contrast to standard CNNs, which learn all the elements of each filter , only low and high cutof f frequencies of band-pass filters are directly learned from data. This inductive bias of fers a very compact w ay to deriv e a customized filter-bank front-end, that only depends on some parameters with a clear physical meaning. Our experiments, conducted on both speaker and speech recognition, show that the proposed architecture con verges f aster, performs better , and is more interpretable than standard CNNs. 1 Introduction Deep learning has recently contrib uted to achieving unprecedented performance le vels in numerous tasks, mainly thanks to the progressive maturation of supervised learning techniques [ 1 ]. The increased discrimination po wer of modern neural netw orks, howe ver , is often obtained at the cost of a reduced interpretability of the model. Modern end-to-end systems, whose popularity is increasing in many fields such as speech recognition [ 2 , 3 , 4 ], often discov er "black-box" internal representations that make sense for the machine b ut are arguably dif ficult to interpret by humans. The remarkable sensiti vity of current neural networks toward adversarial examples [ 5 ], for instance, not only highlights how superficial the discovered representations could be but also raises crucial concerns about our capabilities to really interpret neural models. Such a lack of interpretability can be a major bottleneck for the development of future deep learning techniques. Having more meaningful insights on the logic behind network predictions and errors, in fact, can help us to better trust, understand, and diagnose our model, ev entually guiding our efforts to ward more robust deep learning. In recent years, a growing interest has been thus de voted to the dev elopment of interpretable machine learning [ 6 , 7 ], as witnessed by the numerous w orks in the field, ranging from visualization [ 8 , 9 ], diagnosis of DNNs [10], explanatory graphs [11], and e xplainable models [12], just to name a few . Interpretability is a major concern for audio and speech applications as well [ 13 ]. CNNs and Recurrent Neural Networks (RNNs) are the most popular architectures no wadays used in speech and speak er recognition [ 2 ]. RNN can be employed to capture the temporal ev olution of the speech signal [ 14 , 15 , 16 , 17 ], while CNNs, thanks to their weight sharing, local filters, and pooling networks are normally employed to extract robust and in variant representations [ 18 ]. Even though standard hand-crafted features such as FB ANK and Mel-Frequency Cepstral Coefficients (MFCC) are still 32nd Conference on Neural Information Processing Systems (NIPS 2018) IRASL workshop, Montréal, Canada. employed in many state-of-the-art systems [ 19 , 20 , 21 ], directly feeding a CNN with spectrogram bins [ 22 , 23 , 24 ] or e ven with raw audio samples [ 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , 33 , 34 , 35 , 36 , 37 ] is an approach of increasing popularity . The engineered features, in fact, are originally designed from perceptual evidence and there are no guarantees that such representations are optimal for all speech-related tasks. Standard features, for instance, smooth the speech spectrum, possibly hindering the extraction of crucial narrow-band speak er characteristics such as pitch and formants. Conv ersely , directly processing the raw w aveform allo ws the network to learn lo w-level representations that are possibly more customized on each specific task. The do wnside of raw speech processing lies in the possible lack of interpretability of the filter bank learned in the first con volutional layer . According to us, the latter layer is arguably the most critical part of current wav eform-based CNNs. This layer deals with high-dimensional inputs and is also more af fected by vanishing gradient problems, especially when emplo ying very deep architectures. As will be discussed in this paper , the filters learned by CNNs often take noisy and incongruous multi-band shapes, especially when fe w training samples are av ailable. These filters certainly make some sense for the neural network, b ut they do not appeal to human intuition, nor appear to lead to an efficient representation of the speech signal. T o help the CNNs disco ver more meaningful filters, this work proposes to add some constraints on their shape. Compared to standard CNNs, where the filter-bank characteristics depend on several parameters (each element of the filter v ector is directly learned), SincNet con volv es the wa veform with a set of parametrized sinc functions that implement band-pass filters [ 38 ]. The low and high cutoff frequencies are the only parameters of the filter learned from data. This solution still off ers considerable flexibility b ut forces the network to focus on high-le vel tunable parameters that ha ve a clear physical meaning. Our experimental validation has considered both speaker and speech recognition tasks. Speaker recognition is carried out on TIMIT [ 39 ] and Librispeech [ 40 ] datasets under challenging but realistic conditions, characterized by minimal training data (i.e., 12-15 seconds for each speak er) and short test sentences (lasting from 2 to 6 seconds). W ith the purpose of v alidating SincNet in both clean and noisy conditions, speech recognition e xperiments are conducted on both the TIMIT and DIRHA dataset [ 41 , 42 ]. Results show that the proposed SincNet con verges f aster, achiev es better performance, and is more interpretable than a more standard CNN. The remainder of the paper is org anized as follows. The SincNet architecture is described in Sec. 2. Sec. 3 discusses the relation to prior work. The experimental activity on both speaker and speech recognition is outlined in Sec. 4. Finally , Sec. 5 discusses our conclusions. 2 The SincNet Architectur e The first layer of a standard CNN performs a set of time-domain conv olutions between the input wa veform and some Finite Impulse Response (FIR) filters [ 43 ]. Each con volution is defined as follows 1 : y [ n ] = x [ n ] ∗ h [ n ] = L − 1 X l =0 x [ l ] · h [ n − l ] (1) where x [ n ] is a chunk of the speech signal, h [ n ] is the filter of length L , and y [ n ] is the filtered output. In standard CNNs, all the L elements (taps) of each filter are learned from data. Conv ersely , the proposed SincNet (depicted in Fig. 1) performs the conv olution with a predefined function g that depends on few learnable parameters θ only , as highlighted in the following equation: y [ n ] = x [ n ] ∗ g [ n, θ ] (2) A reasonable choice, inspired by standard filtering in digital signal processing, is to define g such that a filter-bank composed of rectangular bandpass filters is employed. In the frequency domain, the magnitude of a generic bandpass filter can be written as the dif ference between two lo w-pass filters: G [ f , f 1 , f 2 ] = r ect f 2 f 2 − rect f 2 f 1 , (3) 1 Most deep learning toolkits actually compute corr elation rather than con volution . The obtained flipped (mirrored) filters do not affect the results. 2 P oo l i ng D r op ou t C N N / D N N l ay ers S of t max S pe ak er Cl as s i fi c ati on S pe ec h W av efo r m Lay er Norm Leak y R eL U Figure 1: Architecture of SincNet. where f 1 and f 2 are the learned low and high cutof f frequencies, and r ect ( · ) is the rectangular function in the magnitude frequency domain 2 . After returning to the time domain (using the in verse Fourier transform [43]), the reference function g becomes: g [ n, f 1 , f 2 ] = 2 f 2 sinc (2 π f 2 n ) − 2 f 1 sinc (2 π f 1 n ) , (4) where the sinc function is defined as sinc ( x ) = sin ( x ) /x . The cut-off frequencies can be initialized randomly in the range [0 , f s / 2] , where f s represents the sampling frequency of the input signal. As an alternativ e, filters can be initialized with the cutoff frequencies of the mel-scale filter-bank, which has the advantage of directly allocating more filters in the lo wer part of the spectrum, where crucial speech information is located. T o ensure f 1 ≥ 0 and f 2 ≥ f 1 , the previous equation is actually fed by the follo wing parameters: f abs 1 = | f 1 | (5) f abs 2 = f 1 + | f 2 − f 1 | (6) Note that no bounds ha ve been imposed to force f 2 to be smaller than the Nyquist frequenc y , since we observed that this constraint is naturally fulfilled during training. Moreover , the gain of each filter is not learned at this lev el. This parameter is managed by the subsequent layers, which can easily attribute more or less importance to each filter output. An ideal bandpass filter (i.e., a filter where the passband is perfectly flat and the attenuation in the stopband is infinite) requires an infinite number of elements L . Any truncation of g thus ine vitably leads to an approximation of the ideal filter, characterized by ripples in the passband and limited attenuation in the stopband. A popular solution to mitigate this issue is windowing [ 43 ]. W indowing 2 The phase of the r ect ( · ) function is considered to be linear . 3 0 n 250 0 n 250 0 n 250 0 f [Hz] 4000 0 f [Hz] 4000 0 f [Hz] 4000 (a) CNN Filters 0 n 250 0 n 250 0 n 250 0 f [Hz] 4000 0 f [Hz] 4000 0 f [Hz] 4000 (b) SincNet Filters Figure 2: Examples of filters learned by a standard CNN and by the proposed SincNet (using the Librispeech corpus on a speaker -id task). The first row reports the filters in the time domain, while the second one shows their magnitude frequenc y response. is performed by multiplying the truncated function g with a windo w function w , which aims to smooth out the abrupt discontinuities at the ends of g : g w [ n, f 1 , f 2 ] = g [ n, f 1 , f 2 ] · w [ n ] . (7) This paper uses the popular Hamming window [44], defined as follo ws: w [ n ] = 0 . 54 − 0 . 46 · cos 2 π n L . (8) The Hamming windo w is particularly suitable to achie ve high frequency selecti vity [ 44 ]. Ho wever , results not reported here rev eal no significant performance difference when adopting other functions, such as Hann, Blackman, and Kaiser windows. Note also that the filters g are symmetric and thus do not introduce any phase distortions. Due to the symmetry , the filters can be computed efficiently by considering one side of the filter and inheriting the results for the other half All operations in volved in SincNet are fully differentiable and the cutoff frequencies of the filters can be jointly optimized with other CNN parameters using Stochastic Gradient Descent (SGD) or other gradient-based optimization routines. As shown in Fig. 1, a standard CNN pipeline (pooling, normalization, activ ations, dropout) can be employed after the first sinc-based con volution. Multiple standard con volutional, fully-connected or recurrent layers [ 15 , 16 , 17 , 45 ] can then be stacked together to finally perform a classification with a softmax classifier . Fig. 2 shows some e xamples of filters learned by a standard CNN and by the proposed SincNet for a speaker identification task trained on Librispeech (the frequency response is plotted between 0 and 4 kHz). As observed in the figures, the standard CNN does not always learn filters with a well-defined frequency response. In some cases, the frequency response looks noisy (see the first CNN filter), while in others assuming multi-band shapes (see the third CNN filter). SincNet, instead, is specifically designed to implement rectangular bandpass filters, leading to more a meaningful filter-bank. 2.1 Model properties The proposed SincNet has some remarkable properties: • Fast Con vergence: SincNet forces the network to focus only on the filter parameters with a major impact on performance. The proposed approach actually implements a natural inductiv e bias, utilizing knowledge about the filter shape (similar to feature extraction methods generally deployed on this task) while retaining flexibility to adapt to data. This prior kno wledge makes learning the filter characteristics much easier , helping SincNet to con ver ge significantly faster to a better solution. Fig. 3 shows the learning curv es of SincNet 4 0 50 100 150 200 # Epoch 0.3 0.4 0.5 0.6 0.7 0.8 0.9 FER(%) SincNet CNN Figure 3: Frame Error Rate (%) obtained on speaker -id with the TIMIT corpus (using held- out data). 0 1000 2000 3000 4000 Frequency [Hz] 0 0.2 0.4 0.6 0.8 1 Normalized Filter Sum SincNet CNN 2nd Formant 1st Formant Pitch Figure 4: Cumulati ve frequency response of SincNet and CNN filters on speaker -id. and CNN obtained in a speaker-id task. These results are achieved on the TIMIT dataset and highlight a faster decrease of the Frame Error Rate ( F E R % ) when SincNet is used. Moreov er, SincNet con ver ges to better performance leading to a FER of 33.0% against a FER of 37.7% achiev ed with the CNN baseline. • Few P arameters: SincNet drastically reduces the number of parameters in the first con- volutional layer . For instance, if we consider a layer composed of F filters of length L , a standard CNN employs F · L parameters, against the 2 F considered by SincNet. If F = 80 and L = 100 , we employ 8k parameters for the CNN and only 160 for SincNet. Moreov er, if we double the filter length L , a standard CNN doubles its parameter count (e.g., we go from 8k to 16k), while SincNet has an unchanged parameter count (only two parameters are employed for each filter , regardless its length L ). This offers the possibility to deriv e very selecti ve filters with many taps, without actually adding parameters to the optimization problem. Moreov er, the compactness of the SincNet architecture makes it suitable in the few sample re gime. • Interpr etability : The SincNet feature maps obtained in the first con volutional layer are definitely more interpretable and human-readable than other approaches. The filter bank, in fact, only depends on parameters with a clear physical meaning. Fig. 4, for instance, shows the cumulati ve frequenc y response of the filters learned by SincNet and CNN on a speaker -id task. The cumulative frequency response is obtained by summing up all the discovered filters and is useful to highlight which frequency bands are co vered by the learned filters. Interestingly , there are three main peaks which clearly stand out from the SincNet plot (see the red line in the figure). The first one corresponds to the pitch region (the average pitch is 133 Hz for a male and 234 for a female). The second peak (approximately located at 500 Hz) mainly captures first formants, whose a verage v alue over the v arious English vo wels is indeed 500 Hz. Finally , the third peak (ranging from 900 to 1400 Hz) captures some important second formants, such as the second formant of the v owel /a/ , which is located on a verage at 1100 Hz. This filter -bank configuration indicates that SincNet has successfully adapted its characteristics to address speaker identification. Con versely , the standard CNN does not exhibit such a meaningful pattern: the CNN filters tend to correctly focus on the lower part of the spectrum, but peaks tuned on first and second formants do not clearly appear . As one can observe from Fig. 4, the CNN curve stands above the SincNet one. SincNet, in fact, learns filters that are, on a verage, more selectiv e than CNN ones, possibly better capturing narrow-band speak er clues. Fig. 5 shows the cumulati ve frequency response of a CNN and SincNet obtained on a noisy speech recognition task. In this experiment, we hav e artificially corrupted TIMIT with a significant quantity of noise in the band between 2.0 and 2.5 kHz (see the spectrogram) and we have analyzed how fast the two architectures learn to av oid such a useless band. The second row of sub-figures compares the CNN and the SincNet at a very early training stage 5 0 k 2 k 4 k 6 k 8 k 0 1000 2000 3000 4000 Frequency [Hz] 0.2 0.4 0.6 0.8 1 Normalized Filter Sum CNN (1-hour) 0 1000 2000 3000 4000 Frequency [Hz] 0 0.2 0.4 0.6 0.8 1 Normalized Filter Sum SincNet (1-hour) 0 1000 2000 3000 4000 Frequency [Hz] 0 0.2 0.4 0.6 0.8 1 Normalized Filter Sum CNN (Full-Train) 0 1000 2000 3000 4000 Frequency [Hz] 0 0.2 0.4 0.6 0.8 1 Normalized Filter Sum SincNet (Full-Train) Figure 5: Cumulativ e frequency responses obtained on a speech recognition task trained with a noisy version of TIMIT . As shown in the spectrogram, noise has been artificially added into the band 2.0-2.5 kHz. Both the CNN and SincNet learn to avoid the noisy band, but SincNet learns it much faster , after processing only one hour of speech. (i.e., after ha ving processed only one hour of speech in the first epoch), while the last ro w shows the cumulati ve frequency responses after completing the training. From the figures emerges that both CNN and SincNet have correctly learned to av oid the corrupted band at end of training, as highlighted by the holes between 2.0 and 2.5 kHz in the cumulativ e frequency responses. SincNet, howe ver , learns to avoid such a noisy band much earlier . In the second ro w of sub-figures, in fact, SincNet sho ws a visible v alley in the cumulati ve spectrum even after processing only one hour of speech, while CNN has only learned to giv e more importance to the lower part of the spectrum. 3 Related W ork Sev eral works ha ve recently explored the use of lo w-level speech representations to process audio and speech with CNNs. Most prior attempts exploit magnitude spectrogram features [ 22 , 23 , 24 , 46 , 47 , 48 ]. Although spectrograms retain more information than standard hand-crafted features, their design still requires careful tuning of some crucial hyper -parameters, such as the duration, o verlap, and typology of the frame window , as well as the number of frequency bins. For this reason, a more 6 recent trend is to directly learn from raw wa veforms, thus completely av oiding any feature extraction step. This approach has shown promise in speech [ 25 , 26 , 27 , 28 , 29 ], including emotion tasks [ 30 ], speaker recognition [35], spoofing detection [34], and speech synthesis [31, 32]. Similar to SincNet, some pre vious works hav e proposed to add constraints on the CNN filters, for instance forcing them to w ork on specific bands [ 46 , 47 ]. Dif ferently from the proposed approach, the latter works operate on spectrogram features and still learn all the L elements of the CNN filters. An idea related to the proposed method has been recently explored in [ 48 ], where a set of parameterized Gaussian filters are employed. This approach operates on the spectrogram domain, while SincNet directly considers the raw wa veform in the time domain. Similarly to our w ork, in [ 49 ] the con volutional filters are initialized with a predefined filter shape. Howe ver , rather than focusing on cut-off frequencies only , all the basic taps of the FIR filters are still learned. Some v aluable works hav e recently proposed theoretical and experimental frame works to analyze CNNs [ 50 , 51 ]. In particular , [ 52 , 35 , 53 ] feed a standard CNN with raw audio samples and analyze the filters learned in the first layer on both speech recognition and speaker identification tasks. The authors highlight some interesting properties emerged from analyzing the cumulativ e frequency response and propose a spectral dictionary interpretation of the learned filters. Similarly to our findings, the latter works noticed that the filters tend to focus more on the lower part of the spectrum and they can sometimes highlight some peaks that likely corresponds to the fundamental frequenc y . In this work, we ar gue that all of these interesting properties can be observed more clearly and at an earlier training stage with SincNet. This paper extends our pre vious studies on the SincNet [ 38 ]. T o the best of our knowledge, this paper is the first that sho ws the effecti veness of the proposed SincNet in a speech recognition application. Moreov er, this w ork not only considers standard close-talking speech recognition, but it also extends the validation of SincNet to distant-talking speech recognition [54, 55, 56]. 4 Results The proposed SincNet has been ev aluated on both speech and speaker recognition using different corpora. This work considers a challenging but realistic speaker recognition scenario: for all the adopted corpora, we only emplo yed 12-15 seconds of training material for each speaker , and we tested the system performance on short sentences lasting from 2 to 6 seconds. In the spirit of reproducible research, we release the code of SincNet for speaker identification 3 and speech recognition 4 (under the PyT orch-Kaldi project [ 57 ]). More details on the adopted datasets as well as on the SincNet and baseline setups can found in the appendix . 4.1 Speaker Recognition T able 1 reports the Classification Error Rates (CER%) achieved on a speaker -id task. The table shows that SincNet outperforms other systems on both TIMIT (462 speakers) and Librispeech (2484 speakers) datasets. The gap with a standard CNN fed by raw w aveform is lar ger on TIMIT , confirming the effecti veness of SincNet when few training data are av ailable. Although this gap is reduced when LibriSpeech is used, we still observe a 4% relati ve improvement that is also obtained with faster con vergence (1200 vs 1800 epochs). Standard FBANKs pro vide results comparable to SincNet only on TIMIT , but are significantly worse than our architecture when using Librispech. W ith few training data, the network cannot disco ver filters that are much better than that of FB ANKs, but with more data a customized filter-bank is learned and e xploited to improve the performance. T able 2 extends our v alidation to speaker v erification, reporting the Equal Error Rate (EER%) achiev ed with Librispeech. All DNN models show promising performance, leading to an EER lower than 1% in all cases. The table also highlights that SincNet outperforms the other models, showing a relativ e performance improv ement of about 11% over the standard CNN model. Note that the speaker verification system is deri ved from the speaker -id neural network using the d-vector technique. The d-vector [ 19 , 24 ] is extracted from the last hidden layer of the speaker -id network. A speaker -dependent d-vector is computed and stored for each enrollment speaker by performing an L2 normalization and averaging all the d-vectors of the different speech chunks. The cosine distance 3 at https://github.com/mravanelli/SincNet/ . 4 at https://github.com/mravanelli/pytorch- kaldi/ . 7 TIMIT LibriSpeech DNN-MFCC 0.99 2.02 CNN-FB ANK 0.86 1.55 CNN-Raw 1.65 1.00 SincNet 0.85 0.96 T able 1: Classification Error Rate (CER%) of speak er identification systems trained on TIMIT (462 spks) and Librispeech (2484 spks) datasets. SincNet outperforms the competing alternativ es. EER(%) DNN-MFCC 0.88 CNN-FB ANK 0.60 CNN-Raw 0.58 SINCNET 0.51 T able 2: Speaker V erification Equal Error Rate (EER%) on Librispeech datasets ov er different systems. SincNet outperforms the competing alternati ves. between enrolment and test d-vectors is then calculated, and a threshold is then applied on it to reject or accept the speaker . T en utterances from impostors were randomly selected for each sentence coming from a genuine speaker . T o assess our approach on a standard open-set speaker verification task, all the enrolment and test utterances were taken from a speaker pool different from that used for training the speaker -id DNN. For the sake of completeness, experiments hav e also been conducted with standard i-v ectors. Although a detailed comparison with this technology is out of the scope of this paper , it is worth noting that our best i-vector system achie ves an EER=1.1%, rather far from what is achie ved with DNN systems. It is well-kno wn in the literature that i-vectors pro vide competitive performa nce when more training material is used for each speaker and when longer test sentences are emplo yed [ 58 , 59 , 60 ]. Under the challenging conditions faced in this work, neural netw orks achieve better generalization. 4.2 Speech Recognition T ab . 3 reports the speech recognition performance obtained by CNN and SincNet using the TIMIT and the DIRHA dataset [ 41 ]. T o ensure a more accurate comparison between the architectures, fi ve experiments v arying the initialization seeds were conducted for each model and corpus. T able 3 thus reports the a verage speech recognition performance. Standard de viations, not reported here, range between 0 . 15 and 0 . 2 for all the experiments. TIMIT DIRHA CNN-FB ANK 18.3 40.1 CNN-Raw wa veform 18.1 40.0 SincNet-Raw wa veform 17.2 37.2 T able 3: Speech recognition performance obtained on the TIMIT and DIRHA datasets. For all the datasets, SincNet outperforms CNNs trained on both standard FB ANK and ra w wa veforms. The latter result confirms the effecti veness of SincNet not only in close-talking scenarios but also in challenging noisy conditions characterized by the presence of both noise and rev erberation. As emerged in Sec.2, SincNet is able to effecti vely tune its filter-bank front-end to better address the characteristics of the noise. 5 Conclusions and Future W ork This paper proposed SincNet, a neural architecture for directly processing wav eform audio. Our model, inspired by the way filtering is conducted in digital signal processing, imposes constraints on the filter shapes through efficient parameterization. SincNet has been extensi vely ev aluated on challenging speaker and speech recognition tasks, consistently sho wing some performance benefits. Beyond performance impro vements, SincNet also significantly improv es conv ergence speed ov er a standard CNN, is more computationally ef ficient due to the exploitation of filter symmetry , and it is more interpretable than standard black-box models. Analysis of the SincNet filters, in fact, re vealed 8 that the learned filter-bank is tuned to the specific task addressed by the neural network. In future work, we would like to e valuate SincNet on other popular speaker recognition tasks, such as V oxCeleb . Inspired by the promising results obtained in this paper , in the future we will explore the use of SincNet for supervised and unsupervised speaker/en vironmental adaptation. Moreo ver , although this study targeted speak er and speech recognition only , we believ e that the proposed approach defines a general paradigm to process time-series and can be applied in numerous other fields. Acknowledgement This research was enabled in part by support provided by Calcul Québec and Compute Canada. References [1] I. Goodfellow , Y . Bengio, and A. Courville. Deep Learning . MIT Press, 2016. [2] D. Y u and L. Deng. A utomatic Speech Recognition - A Deep Learning Appr oach . Springer , 2015. [3] D. Bahdanau, J. Chorowski, D. Serdyuk, P . Brakel, and Y . Bengio. End-to-end attention-based large vocab ulary speech recognition. In Pr oc. of ICASSP , pages 4945–4949, 2016. [4] A. Grav es and N. Jaitly . T owards end-to-end speech recognition with recurrent neural networks. In Pr oc. of ICML , pages 1764–1772, 2014. [5] I. Goodfellow , J. Shlens, and C. Szegedy . Explaining and harnessing adversarial examples. In Proc.of ICLR , 2015. [6] C. Molnar . Interpretable Machine Learning: A Guide for Making Black Box Models Explainable . Leanpub, 2018. [7] S. Chakraborty et al. Interpretability of deep learning models: A survey of results. In Proc. of SmartW orld , 2017. [8] M. D. Zeiler and R. Fergus. V isualizing and understanding conv olutional networks. In Proc. of ECCV , 2014. [9] Q.-S. Zhang and S.-C. Zhu. V isual interpretability for deep learning: a survey . F r ontiers of Information T echnolo gy & Electr onic Engineering , 19(1):27–39, Jan 2018. [10] M. T . Ribeiro, S. Singh, and C. Guestrin. "why should i trust you?": Explaining the predictions of any classifier . In Pr oc. of A CM SIGKDD , pages 1135–1144, 2016. [11] Q. Zhang, R. Cao, F . Shi, Y . N. W u, and S.-C. Zhu. Interpreting CNN Knowledge via an Explanatory Graph. In Pr oc. of AAAI , 2018. [12] S. Sabour , N. Frosst, and G. E Hinton. Dynamic routing between capsules. In Pr oc. of NIPS , pages 3856–3866. 2017. [13] S. Becker , M. Ackermann, S. Lapuschkin, K.-R. Müller , and W . Samek. Interpreting and explaining deep neural networks for classification of audio signals. CoRR , abs/1807.03418, 2018. [14] S. Hochreiter and J. Schmidhuber . Long short-term memory . Neural Computation , 9(8):1735–1780, Nov ember 1997. [15] J. Chung, Ç. Gülçehre, K. Cho, and Y . Bengio. Empirical ev aluation of gated recurrent neural networks on sequence modeling. In Pr oc. of NIPS , 2014. [16] M. Rav anelli, P . Brakel, M. Omologo, and Y . Bengio. Improving speech recognition by revising gated recurrent units. In Pr oc. of Interspeech , 2017. [17] M. Rav anelli, P . Brakel, M. Omologo, and Y . Bengio. Light gated recurrent units for speech recognition. IEEE T ransactions on Emer ging T opics in Computational Intelligence , 2(2):92–102, April 2018. [18] Y . LeCun, P . Haffner , L. Bottou, and Y . Bengio. Object recognition with gradient-based learning. In Shape, Contour and Gr ouping in Computer V ision , London, UK, UK, 1999. Springer-V erlag. [19] E. V ariani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez. Deep neural networks for small footprint text-dependent speaker v erification. In Pr oc. of ICASSP , pages 4052–4056, 2014. [20] F . Richardson, D. A. Re ynolds, and N. Dehak. A unified deep neural network for speaker and language recognition. In Pr oc. of Interspeech , pages 1146–1150, 2015. [21] D. Snyder , D. Garcia-Romero, D. Pove y , and S. Khudanpur . Deep neural network embeddings for text-independent speaker v erification. In Pr oc. of Interspeech , pages 999–1003, 2017. 9 [22] C. Zhang, K. K oishida, and J. Hansen. T ext-independent speaker verification based on triplet con volutional neural network embeddings. IEEE/ACM T rans. Audio, Speech and Lang . Pr oc. , 26(9):1633–1644, 2018. [23] G. Bhattacharya, J. Alam, and P . K enny . Deep speaker embeddings for short-duration speaker verification. In Pr oc. of Interspeec h , pages 1517–1521, 2017. [24] A. Nagrani, J. S. Chung, and A. Zisserman. V oxceleb: a large-scale speaker identification dataset. In Proc. of Interspec h , 2017. [25] D. Palaz, M. Magimai-Doss, and R. Collobert. Analysis of CNN-based speech recognition system using raw speech as input. In Pr oc. of Interspeec h , 2015. [26] T . N. Sainath, R. J. W eiss, A. W . Senior, K. W . W ilson, and O. V inyals. Learning the speech front-end with raw wa veform CLDNNs. In Pr oc. of Interspeech , 2015. [27] Y . Hoshen, R. W eiss, and K. W . W ilson. Speech acoustic modeling from raw multichannel wa veforms. In Pr oc. of ICASSP , 2015. [28] T . N. Sainath, R. J. W eiss, K. W . Wilson, A. Narayanan, M. Bacchiani, and A. Senior . Speaker localization and microphone spacing in variant acoustic modeling from raw multichannel wa veforms. In Proc. of ASRU , 2015. [29] Z. Tüske, P . Golik, R. Schlüter, and H. Ney . Acoustic modeling with deep neural networks using raw time signal for L VCSR. In Pr oc. of Interspeec h , 2014. [30] G. T rigeorgis, F . Ringev al, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou. Adieu features? end-to-end speech emotion recognition using a deep conv olutional recurrent network. In Pr oc. of ICASSP , pages 5200–5204, 2016. [31] A. van den Oord, S. Dieleman, H. Zen, K. Simon yan, O. V inyals, A. Grav es, N. Kalchbrenner , A. Senior, and K. Kavukcuoglu. W av enet: A generative model for ra w audio. In Arxiv , 2016. [32] S. Mehri, K. Kumar , I. Gulrajani, R. Kumar , S. Jain, J. Sotelo, A. C. Courville, and Y . Bengio. Samplernn: An unconditional end-to-end neural audio generation model. CoRR , abs/1612.07837, 2016. [33] P . Ghahremani, V . Manohar , D. Povey , and S. Khudanpur. Acoustic modelling from the signal domain using CNNs. In Pr oc. of Interspeech , 2016. [34] H. Dinkel, N. Chen, Y . Qian, and K. Y u. End-to-end spoofing detection with raw wa veform CLDNNS. Pr oc. of ICASSP , 2017. [35] H. Muckenhirn, M. Magimai-Doss, and S. Marcel. T owards directly modeling raw speech signal for speaker verification using CNNs. In Pr oc. of ICASSP , 2018. [36] J.-W . Jung, H.-S. Heo, I.-H. Y ang, H.-J. Shim, , and H.-J. Y u. A complete end-to-end speaker verification system using deep neural networks: From raw signals to v erification result. In Pr oc. of ICASSP , 2018. [37] J.-W . Jung, H.-S. Heo, I.-H. Y ang, H.-J. Shim, and H.-J. Y u. A voiding Speaker Ov erfitting in End-to-End DNNs using Raw W a veform for T ext-Independent Speaker V erification. In Proc. of Interspeech , 2018. [38] M. Ravanelli and Y . Bengio. Speaker Recognition from raw waveform with SincNet. In Proc. of SLT , 2018. [39] J. S. Garofolo, L. F . Lamel, W . M. Fisher , J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. D ARP A TIMIT Acoustic Phonetic Continuous Speech Corpus CDR OM, 1993. [40] V . P anayotov , G. Chen, D. Pove y , and S. Khudanpur . Librispeech: An ASR corpus based on public domain audio books. In Pr oc. of ICASSP , pages 5206–5210, 2015. [41] M. Rav anelli, L. Cristoforetti, R. Gretter , M. Pellin, A. Sosi, and M. Omologo. The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic environments. In Pr oc. of ASR U 2015 , pages 275–282. [42] M. Rav anelli, P . Svaizer , and M. Omologo. Realistic multi-microphone data simulation for distant speech recognition. In Pr oc. of Interspeech , 2016. [43] L. R. Rabiner and R. W . Schafer . Theory and Applications of Digital Speech Processing . Prentice Hall, NJ, 2011. [44] S. K. Mitra. Digital Signal Pr ocessing . McGraw-Hill, 2005. [45] M. Rav anelli, D. Serdyuk, and Y . Bengio. T win regularization for online speech recognition. In Proc. of Interspeech , 2018. [46] T . N. Sainath, B. Kingsbury , A. R. Mohamed, and B. Ramabhadran. Learning filter banks within a deep neural network frame work. In Proc. of ASRU , pages 297–302, 2013. [47] H. Y u, Z. H. T an, Y . Zhang, Z. Ma, and J. Guo. DNN Filter Bank Cepstral Coefficients for Spoofing Detection. IEEE Access , 5:4779–4787, 2017. 10 [48] H. Seki, K. Y amamoto, and S. Nakagawa. A deep neural network inte grated with filterbank learning for speech recognition. In Pr oc. of ICASSP , pages 5480–5484, 2017. [49] N. Zeghidour , N. Usunier , I. Kokkinos, T . Schatz, G. Synnaeve, and E. Dupoux. Learning filterbanks from raw speech for phone recognition. In Pr oc. of ICASSP , pages 5509–5513, 2018. [50] V . Papyan, Y . Romano, and M. Elad. Conv olutional neural networks analyzed via conv olutional sparse coding. Journal of Machine Learning Resear ch , 18:83:1–83:52, 2017. [51] S. Mallat. Understanding deep conv olutional networks. CoRR , abs/1601.04920, 2016. [52] D. P alaz and R. Magimai-Doss, M.and Collobert. End-to-end acoustic modeling using con volutional neural networks for automatic speech recognition. 2016. [53] H. Muckenhirn, M. Magimai-Doss, and S. Marcel. On Learning V ocal T ract System Related Speaker Discriminativ e Information from Raw Signal Using CNNs. In Pr oc. of Interspeech , 2018. [54] M. Rav anelli. Deep learning for Distant Speech Recognition . PhD Thesis, Unitn, 2017. [55] M. Rav anelli, P . Brakel, M. Omologo, and Y . Bengio. A netw ork of deep neural networks for distant speech recognition. In Pr oc. of ICASSP , pages 4880–4884, 2017. [56] M. Ravanelli and M. Omologo. Contaminated speech training methods for robust DNN-HMM distant speech recognition. In Pr oc. of Interspeech 2015 , pages 756–760. [57] M. Ravanelli, T . Parcollet, and Y . Bengio. The PyT orch-Kaldi Speech Recognition T oolkit. In arXiv:1811.07453 , 2018. [58] A. K. Sarkar , D Matrouf, P .M. Bousquet, and J.F . Bonastre. Study of the ef fect of i-vector modeling on short and mismatch utterance duration for speaker verification. In Proc. of Interspeech , pages 2662–2665, 2012. [59] R. Tra vadi, M. V an Segbroeck, and S. Narayanan. Modified-prior i-V ector Estimation for Language Identification of Short Duration Utterances. In Pr oc. of Interspeech , pages 3037–3041, 2014. [60] A. Kanagasundaram, R. V ogt, D. Dean, S. Sridharan, and M. Mason. i-vector based speak er recognition on short utterances. In Pr oc. of Interspeech , pages 2341–2344, 2011. [61] M. Matassoni, R. Astudillo, A. Katsamanis, and M. Ra vanelli. The DIRHA-GRID corpus: baseline and tools for multi-room distant speech recognition using distributed microphones. In Proc. of Interspeec h 2014 , pages 1616–1617. [62] E. Zwyssig, M. Rav anelli, P . Svaizer , and M. Omologo. A multi-channel corpus for distant-speech interaction in presence of known interferences. In Pr oc. of ICASSP 2015 , pages 4480–4484. [63] L. Cristoforetti, M. Rav anelli, M. Omologo, A. Sosi, A. Abad, M. Hagmueller , and P . Maragos. The DIRHA simulated corpus. In Pr oc. of LREC 2014 , pages 2629–2634. [64] Douglas P . and J. M. Baker . The design for the wall street journal-based csr corpus. In Pr oceedings of the W orkshop on Speech and Natural Language , Proc. of HL T , pages 357–362, 1992. [65] M. Rav anelli, A. Sosi, P . Svaizer , and M. Omologo. Impulse response estimation for robust speech recognition in a rev erberant environment. In Proc. of EUSIPCO 2012 . [66] M. Rav anelli and M. Omologo. On the selection of the impulse responses for distant-speech recognition based on contaminated speech training. In Pr oc. of Interspeech 2014 , pages 1028–1032. [67] J. Ba, R. Kiros, and G. E. Hinton. Layer normalization. CoRR , abs/1607.06450, 2016. [68] S. Iof fe and C. Szegedy . Batch normalization: Accelerating deep network tr aining by reducing internal cov ariate shift. In Pr oc. of ICML , pages 448–456, 2015. [69] M. Ra vanelli, P . Brak el, M. Omologo, and Y . Bengio. Batch-normalized joint training for dnn-based distant speech recognition. In Pr oc. of SLT , 2016. [70] A. L. Maas, A. Y . Hannun, and A. Y . Ng. Rectifier nonlinearities improve neural network acoustic models. In Pr oc. of ICML , 2013. [71] X. Glorot and Y . Bengio. Understanding the difficulty of training deep feedforw ard neural networks. In Pr oc. of AIST ATS , pages 249–256, 2010. [72] D. Pov ey et al. The Kaldi Speech Recognition T oolkit. In Pr oc. of ASR U , 2011. [73] A. Larcher , K. A. Lee, and S. Meignier . An extensible speaker identification sidekit in python. In Pr oc. of ICASSP , pages 5095–5099, 2016. 11 A ppendix Corpora T o provide experimental e vidence on datasets characterized by dif ferent numbers of speakers, this paper considers the TIMIT (462 spks, train chunk) [ 39 ] and Librispeech (2484 spks) [ 40 ] corpora. For speaker verification experiments, non-speech intervals at the beginning and end of each sentence were removed. Moreov er, the Librispeech sentences with internal silences lasting more than 125 ms were split into multiple chunks. T o address text-independent speaker recognition, the calibration sentences of TIMIT (i.e., the utterances with the same text for all speakers) ha ve been remo ved. For the latter dataset, fi ve sentences for each speak er were used for training, while the remaining three were used for test. For the Librispeech corpus, the training and test material have been randomly selected to exploit 12-15 seconds of training material for each speaker and test sentences lasting 2-6 seconds. T o ev aluate the performance in a challenging distant-talking scenario, speech recognition experiments hav e also considered the DIRHA dataset [ 41 ]. This corpus, similarly to the other DIRHA corpora [ 61 , 62 ], has been developed in the context of the DIRHA project [ 63 ] and is based on WSJ sentences [ 64 ] recorded in a domestic en vironment. Training is based on contaminating WSJ-5k utterances with realistic impulse responses [ 65 , 66 ], while the test phase test phase consists of 409 WSJ sentences recorded by nativ e speakers in a domestic en vironment (the average SNR is 10 dB). SincNet Setup The wav eform of each speech sentence was split into chunks of 200 ms (with 10 ms overlap), which were fed into the SincNet architecture. The first layer performs sinc-based con volutions as described in Sec. 2, using 80 filters of length L = 251 samples. The architecture then employs two standard con volutional layers, both using 60 filters of length 5. Layer normalization [ 67 ] was used for both the input samples and for all con volutional layers (including the SincNet input layer). Ne xt, three fully-connected layers composed of 2048 neurons and normalized with batch normalization [ 68 , 69 ] were applied. All hidden layers use leaky-ReLU [ 70 ] non-linearities. The parameters of the sinc-layer were initialized using mel-scale cutoff frequencies, while the rest of the network was initialized with the well-kno wn “Glorot" initialization scheme [ 71 ]. Frame-level speaker and phoneme classifications were obtained by applying a softmax classifier , providing a set of posterior probabilities ov er the targets. F or speaker-id, a sentence-le vel classification w as simply deriv ed by averaging the frame predictions and voting for the speaker which maximizes the av erage posterior . Training used the RMSprop optimizer , with a learning rate lr = 0 . 001 , α = 0 . 95 , = 10 − 7 , and minibatches of size 128. All the hyper-parameters of the architecture were tuned on TIMIT , then inherited for Librispeech as well. The speaker verification system was deri ved from the speaker-id neural netw ork using the d-vector approach [ 19 , 24 ], which relies on the output of the last hidden layer and computes the cosine distance between test and the claimed speaker d-vectors. T en utterances from impostors were randomly selected for each sentence coming from a genuine speaker . Note that to assess our approach on a standard open-set speaker-id task, all the impostors were taken from a speaker pool dif ferent from that used for training the speaker-id DNN. Baseline Setups W e compared SincNet with several alternati ve systems. First, we considered a standard CNN fed by the ra w wa veform. This network is based on the same architecture as SincNet, but replacing the sinc-based con volution with a standard one. A comparison with popular hand-crafted features was also performed. T o this end, we computed 39 MFCCs (13 static+ ∆ + ∆∆ ) and 40 FB ANKs using the Kaldi toolkit [ 72 ]. These features, computed ev ery 25 ms with 10 ms ov erlap, were gathered to form a context window of approximately 200 ms (i.e., a context similar to that of the considered wa veform-based neural network). A CNN was used for FB ANK features, while a Multi-Layer Perceptron (MLP) was used for MFCCs. Note that CNNs exploit local correlation across features and cannot be effecti vely used with uncorrelated MFCC features. Layer normalization w as used for the FBANK network, while batch normalization was employed for the MFCC one. The hyper-parameters of these networks were also tuned using the aforementioned approach. For speak er verification experiments, we also considered an i-vector baseline. The i-vector system was imple- mented with the SIDEKIT toolkit [ 73 ]. The GMM-UBM model, the T otal V ariability (TV) matrix, and the Probabilistic Linear Discriminant Analysis (PLDA) were trained on the Librispeech data (av oiding test and enrollment sentences). GMM-UBM was composed of 2048 Gaussians, and the rank of the TV and PLDA eigen voice matrix was 400. The enrollment and test phase is conducted on Librispeech using the same set of speech segments used for DNN e xperiments. 12
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment