Interpretable Convolutional Filters with SincNet

Interpr etable Con v olutional Filters with SincNet Mirco Rav anelli Mila, Univ ersité de Montréal Y oshua Bengio Mila, Univ ersité de Montréal CIF AR Fellow Abstract Deep learning is currently playing a crucial role to ward higher le vels of artiﬁcial intelligence. This paradigm allo ws neural networks to learn complex and abstract representations, that are progressi vely obtained by combining simpler ones. Nev- ertheless, the internal "black-box" representations automatically discovered by current neural architectures often suf fer from a lack of interpretability , making of primary interest the study of explainable machine learning techniques. This paper summarizes our recent efforts to dev elop a more interpretable neural model for directly processing speech from the raw waveform. In particular , we propose SincNet , a nov el Con volutional Neural Network (CNN) that encourages the ﬁrst layer to discov er more meaningful ﬁlters by exploiting parametrized sinc functions. In contrast to standard CNNs, which learn all the elements of each ﬁlter , only low and high cutof f frequencies of band-pass ﬁlters are directly learned from data. This inductive bias of fers a very compact w ay to deriv e a customized ﬁlter-bank front-end, that only depends on some parameters with a clear physical meaning. Our experiments, conducted on both speaker and speech recognition, show that the proposed architecture con verges f aster, performs better , and is more interpretable than standard CNNs. 1 Introduction Deep learning has recently contrib uted to achieving unprecedented performance le vels in numerous tasks, mainly thanks to the progressive maturation of supervised learning techniques [ 1 ]. The increased discrimination po wer of modern neural netw orks, howe ver , is often obtained at the cost of a reduced interpretability of the model. Modern end-to-end systems, whose popularity is increasing in many ﬁelds such as speech recognition [ 2 , 3 , 4 ], often discov er "black-box" internal representations that make sense for the machine b ut are arguably dif ﬁcult to interpret by humans. The remarkable sensiti vity of current neural networks toward adversarial examples [ 5 ], for instance, not only highlights how superﬁcial the discovered representations could be but also raises crucial concerns about our capabilities to really interpret neural models. Such a lack of interpretability can be a major bottleneck for the development of future deep learning techniques. Having more meaningful insights on the logic behind network predictions and errors, in fact, can help us to better trust, understand, and diagnose our model, ev entually guiding our efforts to ward more robust deep learning. In recent years, a growing interest has been thus de voted to the dev elopment of interpretable machine learning [ 6 , 7 ], as witnessed by the numerous w orks in the ﬁeld, ranging from visualization [ 8 , 9 ], diagnosis of DNNs [10], explanatory graphs [11], and e xplainable models [12], just to name a few . Interpretability is a major concern for audio and speech applications as well [ 13 ]. CNNs and Recurrent Neural Networks (RNNs) are the most popular architectures no wadays used in speech and speak er recognition [ 2 ]. RNN can be employed to capture the temporal ev olution of the speech signal [ 14 , 15 , 16 , 17 ], while CNNs, thanks to their weight sharing, local ﬁlters, and pooling networks are normally employed to extract robust and in variant representations [ 18 ]. Even though standard hand-crafted features such as FB ANK and Mel-Frequency Cepstral Coefﬁcients (MFCC) are still 32nd Conference on Neural Information Processing Systems (NIPS 2018) IRASL workshop, Montréal, Canada. employed in many state-of-the-art systems [ 19 , 20 , 21 ], directly feeding a CNN with spectrogram bins [ 22 , 23 , 24 ] or e ven with raw audio samples [ 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , 33 , 34 , 35 , 36 , 37 ] is an approach of increasing popularity . The engineered features, in fact, are originally designed from perceptual evidence and there are no guarantees that such representations are optimal for all speech-related tasks. Standard features, for instance, smooth the speech spectrum, possibly hindering the extraction of crucial narrow-band speak er characteristics such as pitch and formants. Conv ersely , directly processing the raw w aveform allo ws the network to learn lo w-level representations that are possibly more customized on each speciﬁc task. The do wnside of raw speech processing lies in the possible lack of interpretability of the ﬁlter bank learned in the ﬁrst con volutional layer . According to us, the latter layer is arguably the most critical part of current wav eform-based CNNs. This layer deals with high-dimensional inputs and is also more af fected by vanishing gradient problems, especially when emplo ying very deep architectures. As will be discussed in this paper , the ﬁlters learned by CNNs often take noisy and incongruous multi-band shapes, especially when fe w training samples are av ailable. These ﬁlters certainly make some sense for the neural network, b ut they do not appeal to human intuition, nor appear to lead to an efﬁcient representation of the speech signal. T o help the CNNs disco ver more meaningful ﬁlters, this work proposes to add some constraints on their shape. Compared to standard CNNs, where the ﬁlter-bank characteristics depend on several parameters (each element of the ﬁlter v ector is directly learned), SincNet con volv es the wa veform with a set of parametrized sinc functions that implement band-pass ﬁlters [ 38 ]. The low and high cutoff frequencies are the only parameters of the ﬁlter learned from data. This solution still off ers considerable ﬂexibility b ut forces the network to focus on high-le vel tunable parameters that ha ve a clear physical meaning. Our experimental validation has considered both speaker and speech recognition tasks. Speaker recognition is carried out on TIMIT [ 39 ] and Librispeech [ 40 ] datasets under challenging but realistic conditions, characterized by minimal training data (i.e., 12-15 seconds for each speak er) and short test sentences (lasting from 2 to 6 seconds). W ith the purpose of v alidating SincNet in both clean and noisy conditions, speech recognition e xperiments are conducted on both the TIMIT and DIRHA dataset [ 41 , 42 ]. Results show that the proposed SincNet con verges f aster, achiev es better performance, and is more interpretable than a more standard CNN. The remainder of the paper is org anized as follows. The SincNet architecture is described in Sec. 2. Sec. 3 discusses the relation to prior work. The experimental activity on both speaker and speech recognition is outlined in Sec. 4. Finally , Sec. 5 discusses our conclusions. 2 The SincNet Architectur e The ﬁrst layer of a standard CNN performs a set of time-domain conv olutions between the input wa veform and some Finite Impulse Response (FIR) ﬁlters [ 43 ]. Each con volution is deﬁned as follows 1 : y [ n ] = x [ n ] ∗ h [ n ] = L − 1 X l =0 x [ l ] · h [ n − l ] (1) where x [ n ] is a chunk of the speech signal, h [ n ] is the ﬁlter of length L , and y [ n ] is the ﬁltered output. In standard CNNs, all the L elements (taps) of each ﬁlter are learned from data. Conv ersely , the proposed SincNet (depicted in Fig. 1) performs the conv olution with a predeﬁned function g that depends on few learnable parameters θ only , as highlighted in the following equation: y [ n ] = x [ n ] ∗ g [ n, θ ] (2) A reasonable choice, inspired by standard ﬁltering in digital signal processing, is to deﬁne g such that a ﬁlter-bank composed of rectangular bandpass ﬁlters is employed. In the frequency domain, the magnitude of a generic bandpass ﬁlter can be written as the dif ference between two lo w-pass ﬁlters: G [ f , f 1 , f 2 ] = r ect  f 2 f 2  − rect  f 2 f 1  , (3) 1 Most deep learning toolkits actually compute corr elation rather than con volution . The obtained ﬂipped (mirrored) ﬁlters do not affect the results. 2 P oo l i ng D r op ou t C N N / D N N l ay ers S of t max S pe ak er Cl as s i fi c ati on S pe ec h W av efo r m Lay er Norm Leak y R eL U Figure 1: Architecture of SincNet. where f 1 and f 2 are the learned low and high cutof f frequencies, and r ect ( · ) is the rectangular function in the magnitude frequency domain 2 . After returning to the time domain (using the in verse Fourier transform [43]), the reference function g becomes: g [ n, f 1 , f 2 ] = 2 f 2 sinc (2 π f 2 n ) − 2 f 1 sinc (2 π f 1 n ) , (4) where the sinc function is deﬁned as sinc ( x ) = sin ( x ) /x . The cut-off frequencies can be initialized randomly in the range [0 , f s / 2] , where f s represents the sampling frequency of the input signal. As an alternativ e, ﬁlters can be initialized with the cutoff frequencies of the mel-scale ﬁlter-bank, which has the advantage of directly allocating more ﬁlters in the lo wer part of the spectrum, where crucial speech information is located. T o ensure f 1 ≥ 0 and f 2 ≥ f 1 , the previous equation is actually fed by the follo wing parameters: f abs 1 = | f 1 | (5) f abs 2 = f 1 + | f 2 − f 1 | (6) Note that no bounds ha ve been imposed to force f 2 to be smaller than the Nyquist frequenc y , since we observed that this constraint is naturally fulﬁlled during training. Moreover , the gain of each ﬁlter is not learned at this lev el. This parameter is managed by the subsequent layers, which can easily attribute more or less importance to each ﬁlter output. An ideal bandpass ﬁlter (i.e., a ﬁlter where the passband is perfectly ﬂat and the attenuation in the stopband is inﬁnite) requires an inﬁnite number of elements L . Any truncation of g thus ine vitably leads to an approximation of the ideal ﬁlter, characterized by ripples in the passband and limited attenuation in the stopband. A popular solution to mitigate this issue is windowing [ 43 ]. W indowing 2 The phase of the r ect ( · ) function is considered to be linear . 3 0 n 250 0 n 250 0 n 250 0 f [Hz] 4000 0 f [Hz] 4000 0 f [Hz] 4000 (a) CNN Filters 0 n 250 0 n 250 0 n 250 0 f [Hz] 4000 0 f [Hz] 4000 0 f [Hz] 4000 (b) SincNet Filters Figure 2: Examples of ﬁlters learned by a standard CNN and by the proposed SincNet (using the Librispeech corpus on a speaker -id task). The ﬁrst row reports the ﬁlters in the time domain, while the second one shows their magnitude frequenc y response. is performed by multiplying the truncated function g with a windo w function w , which aims to smooth out the abrupt discontinuities at the ends of g : g w [ n, f 1 , f 2 ] = g [ n, f 1 , f 2 ] · w [ n ] . (7) This paper uses the popular Hamming window [44], deﬁned as follo ws: w [ n ] = 0 . 54 − 0 . 46 · cos  2 π n L  . (8) The Hamming windo w is particularly suitable to achie ve high frequency selecti vity [ 44 ]. Ho wever , results not reported here rev eal no signiﬁcant performance difference when adopting other functions, such as Hann, Blackman, and Kaiser windows. Note also that the ﬁlters g are symmetric and thus do not introduce any phase distortions. Due to the symmetry , the ﬁlters can be computed efﬁciently by considering one side of the ﬁlter and inheriting the results for the other half All operations in volved in SincNet are fully differentiable and the cutoff frequencies of the ﬁlters can be jointly optimized with other CNN parameters using Stochastic Gradient Descent (SGD) or other gradient-based optimization routines. As shown in Fig. 1, a standard CNN pipeline (pooling, normalization, activ ations, dropout) can be employed after the ﬁrst sinc-based con volution. Multiple standard con volutional, fully-connected or recurrent layers [ 15 , 16 , 17 , 45 ] can then be stacked together to ﬁnally perform a classiﬁcation with a softmax classiﬁer . Fig. 2 shows some e xamples of ﬁlters learned by a standard CNN and by the proposed SincNet for a speaker identiﬁcation task trained on Librispeech (the frequency response is plotted between 0 and 4 kHz). As observed in the ﬁgures, the standard CNN does not always learn ﬁlters with a well-deﬁned frequency response. In some cases, the frequency response looks noisy (see the ﬁrst CNN ﬁlter), while in others assuming multi-band shapes (see the third CNN ﬁlter). SincNet, instead, is speciﬁcally designed to implement rectangular bandpass ﬁlters, leading to more a meaningful ﬁlter-bank. 2.1 Model properties The proposed SincNet has some remarkable properties: • Fast Con vergence: SincNet forces the network to focus only on the ﬁlter parameters with a major impact on performance. The proposed approach actually implements a natural inductiv e bias, utilizing knowledge about the ﬁlter shape (similar to feature extraction methods generally deployed on this task) while retaining ﬂexibility to adapt to data. This prior kno wledge makes learning the ﬁlter characteristics much easier , helping SincNet to con ver ge signiﬁcantly faster to a better solution. Fig. 3 shows the learning curv es of SincNet 4 0 50 100 150 200 # Epoch 0.3 0.4 0.5 0.6 0.7 0.8 0.9 FER(%) SincNet CNN Figure 3: Frame Error Rate (%) obtained on speaker -id with the TIMIT corpus (using held- out data). 0 1000 2000 3000 4000 Frequency [Hz] 0 0.2 0.4 0.6 0.8 1 Normalized Filter Sum SincNet CNN 2nd Formant 1st Formant Pitch Figure 4: Cumulati ve frequency response of SincNet and CNN ﬁlters on speaker -id. and CNN obtained in a speaker-id task. These results are achieved on the TIMIT dataset and highlight a faster decrease of the Frame Error Rate ( F E R % ) when SincNet is used. Moreov er, SincNet con ver ges to better performance leading to a FER of 33.0% against a FER of 37.7% achiev ed with the CNN baseline. • Few P arameters: SincNet drastically reduces the number of parameters in the ﬁrst con- volutional layer . For instance, if we consider a layer composed of F ﬁlters of length L , a standard CNN employs F · L parameters, against the 2 F considered by SincNet. If F = 80 and L = 100 , we employ 8k parameters for the CNN and only 160 for SincNet. Moreov er, if we double the ﬁlter length L , a standard CNN doubles its parameter count (e.g., we go from 8k to 16k), while SincNet has an unchanged parameter count (only two parameters are employed for each ﬁlter , regardless its length L ). This offers the possibility to deriv e very selecti ve ﬁlters with many taps, without actually adding parameters to the optimization problem. Moreov er, the compactness of the SincNet architecture makes it suitable in the few sample re gime. • Interpr etability : The SincNet feature maps obtained in the ﬁrst con volutional layer are deﬁnitely more interpretable and human-readable than other approaches. The ﬁlter bank, in fact, only depends on parameters with a clear physical meaning. Fig. 4, for instance, shows the cumulati ve frequenc y response of the ﬁlters learned by SincNet and CNN on a speaker -id task. The cumulative frequency response is obtained by summing up all the discovered ﬁlters and is useful to highlight which frequency bands are co vered by the learned ﬁlters. Interestingly , there are three main peaks which clearly stand out from the SincNet plot (see the red line in the ﬁgure). The ﬁrst one corresponds to the pitch region (the average pitch is 133 Hz for a male and 234 for a female). The second peak (approximately located at 500 Hz) mainly captures ﬁrst formants, whose a verage v alue over the v arious English vo wels is indeed 500 Hz. Finally , the third peak (ranging from 900 to 1400 Hz) captures some important second formants, such as the second formant of the v owel /a/ , which is located on a verage at 1100 Hz. This ﬁlter -bank conﬁguration indicates that SincNet has successfully adapted its characteristics to address speaker identiﬁcation. Con versely , the standard CNN does not exhibit such a meaningful pattern: the CNN ﬁlters tend to correctly focus on the lower part of the spectrum, but peaks tuned on ﬁrst and second formants do not clearly appear . As one can observe from Fig. 4, the CNN curve stands above the SincNet one. SincNet, in fact, learns ﬁlters that are, on a verage, more selectiv e than CNN ones, possibly better capturing narrow-band speak er clues. Fig. 5 shows the cumulati ve frequency response of a CNN and SincNet obtained on a noisy speech recognition task. In this experiment, we hav e artiﬁcially corrupted TIMIT with a signiﬁcant quantity of noise in the band between 2.0 and 2.5 kHz (see the spectrogram) and we have analyzed how fast the two architectures learn to av oid such a useless band. The second row of sub-ﬁgures compares the CNN and the SincNet at a very early training stage 5 0 k 2 k 4 k 6 k 8 k 0 1000 2000 3000 4000 Frequency [Hz] 0.2 0.4 0.6 0.8 1 Normalized Filter Sum CNN (1-hour) 0 1000 2000 3000 4000 Frequency [Hz] 0 0.2 0.4 0.6 0.8 1 Normalized Filter Sum SincNet (1-hour) 0 1000 2000 3000 4000 Frequency [Hz] 0 0.2 0.4 0.6 0.8 1 Normalized Filter Sum CNN (Full-Train) 0 1000 2000 3000 4000 Frequency [Hz] 0 0.2 0.4 0.6 0.8 1 Normalized Filter Sum SincNet (Full-Train) Figure 5: Cumulativ e frequency responses obtained on a speech recognition task trained with a noisy version of TIMIT . As shown in the spectrogram, noise has been artiﬁcially added into the band 2.0-2.5 kHz. Both the CNN and SincNet learn to avoid the noisy band, but SincNet learns it much faster , after processing only one hour of speech. (i.e., after ha ving processed only one hour of speech in the ﬁrst epoch), while the last ro w shows the cumulati ve frequency responses after completing the training. From the ﬁgures emerges that both CNN and SincNet have correctly learned to av oid the corrupted band at end of training, as highlighted by the holes between 2.0 and 2.5 kHz in the cumulativ e frequency responses. SincNet, howe ver , learns to avoid such a noisy band much earlier . In the second ro w of sub-ﬁgures, in fact, SincNet sho ws a visible v alley in the cumulati ve spectrum even after processing only one hour of speech, while CNN has only learned to giv e more importance to the lower part of the spectrum. 3 Related W ork Sev eral works ha ve recently explored the use of lo w-level speech representations to process audio and speech with CNNs. Most prior attempts exploit magnitude spectrogram features [ 22 , 23 , 24 , 46 , 47 , 48 ]. Although spectrograms retain more information than standard hand-crafted features, their design still requires careful tuning of some crucial hyper -parameters, such as the duration, o verlap, and typology of the frame window , as well as the number of frequency bins. For this reason, a more 6 recent trend is to directly learn from raw wa veforms, thus completely av oiding any feature extraction step. This approach has shown promise in speech [ 25 , 26 , 27 , 28 , 29 ], including emotion tasks [ 30 ], speaker recognition [35], spooﬁng detection [34], and speech synthesis [31, 32]. Similar to SincNet, some pre vious works hav e proposed to add constraints on the CNN ﬁlters, for instance forcing them to w ork on speciﬁc bands [ 46 , 47 ]. Dif ferently from the proposed approach, the latter works operate on spectrogram features and still learn all the L elements of the CNN ﬁlters. An idea related to the proposed method has been recently explored in [ 48 ], where a set of parameterized Gaussian ﬁlters are employed. This approach operates on the spectrogram domain, while SincNet directly considers the raw wa veform in the time domain. Similarly to our w ork, in [ 49 ] the con volutional ﬁlters are initialized with a predeﬁned ﬁlter shape. Howe ver , rather than focusing on cut-off frequencies only , all the basic taps of the FIR ﬁlters are still learned. Some v aluable works hav e recently proposed theoretical and experimental frame works to analyze CNNs [ 50 , 51 ]. In particular , [ 52 , 35 , 53 ] feed a standard CNN with raw audio samples and analyze the ﬁlters learned in the ﬁrst layer on both speech recognition and speaker identiﬁcation tasks. The authors highlight some interesting properties emerged from analyzing the cumulativ e frequency response and propose a spectral dictionary interpretation of the learned ﬁlters. Similarly to our ﬁndings, the latter works noticed that the ﬁlters tend to focus more on the lower part of the spectrum and they can sometimes highlight some peaks that likely corresponds to the fundamental frequenc y . In this work, we ar gue that all of these interesting properties can be observed more clearly and at an earlier training stage with SincNet. This paper extends our pre vious studies on the SincNet [ 38 ]. T o the best of our knowledge, this paper is the ﬁrst that sho ws the effecti veness of the proposed SincNet in a speech recognition application. Moreov er, this w ork not only considers standard close-talking speech recognition, but it also extends the validation of SincNet to distant-talking speech recognition [54, 55, 56]. 4 Results The proposed SincNet has been ev aluated on both speech and speaker recognition using different corpora. This work considers a challenging but realistic speaker recognition scenario: for all the adopted corpora, we only emplo yed 12-15 seconds of training material for each speaker , and we tested the system performance on short sentences lasting from 2 to 6 seconds. In the spirit of reproducible research, we release the code of SincNet for speaker identiﬁcation 3 and speech recognition 4 (under the PyT orch-Kaldi project [ 57 ]). More details on the adopted datasets as well as on the SincNet and baseline setups can found in the appendix . 4.1 Speaker Recognition T able 1 reports the Classiﬁcation Error Rates (CER%) achieved on a speaker -id task. The table shows that SincNet outperforms other systems on both TIMIT (462 speakers) and Librispeech (2484 speakers) datasets. The gap with a standard CNN fed by raw w aveform is lar ger on TIMIT , conﬁrming the effecti veness of SincNet when few training data are av ailable. Although this gap is reduced when LibriSpeech is used, we still observe a 4% relati ve improvement that is also obtained with faster con vergence (1200 vs 1800 epochs). Standard FBANKs pro vide results comparable to SincNet only on TIMIT , but are signiﬁcantly worse than our architecture when using Librispech. W ith few training data, the network cannot disco ver ﬁlters that are much better than that of FB ANKs, but with more data a customized ﬁlter-bank is learned and e xploited to improve the performance. T able 2 extends our v alidation to speaker v eriﬁcation, reporting the Equal Error Rate (EER%) achiev ed with Librispeech. All DNN models show promising performance, leading to an EER lower than 1% in all cases. The table also highlights that SincNet outperforms the other models, showing a relativ e performance improv ement of about 11% over the standard CNN model. Note that the speaker veriﬁcation system is deri ved from the speaker -id neural network using the d-vector technique. The d-vector [ 19 , 24 ] is extracted from the last hidden layer of the speaker -id network. A speaker -dependent d-vector is computed and stored for each enrollment speaker by performing an L2 normalization and averaging all the d-vectors of the different speech chunks. The cosine distance 3 at https://github.com/mravanelli/SincNet/ . 4 at https://github.com/mravanelli/pytorch- kaldi/ . 7 TIMIT LibriSpeech DNN-MFCC 0.99 2.02 CNN-FB ANK 0.86 1.55 CNN-Raw 1.65 1.00 SincNet 0.85 0.96 T able 1: Classiﬁcation Error Rate (CER%) of speak er identiﬁcation systems trained on TIMIT (462 spks) and Librispeech (2484 spks) datasets. SincNet outperforms the competing alternativ es. EER(%) DNN-MFCC 0.88 CNN-FB ANK 0.60 CNN-Raw 0.58 SINCNET 0.51 T able 2: Speaker V eriﬁcation Equal Error Rate (EER%) on Librispeech datasets ov er different systems. SincNet outperforms the competing alternati ves. between enrolment and test d-vectors is then calculated, and a threshold is then applied on it to reject or accept the speaker . T en utterances from impostors were randomly selected for each sentence coming from a genuine speaker . T o assess our approach on a standard open-set speaker veriﬁcation task, all the enrolment and test utterances were taken from a speaker pool different from that used for training the speaker -id DNN. For the sake of completeness, experiments hav e also been conducted with standard i-v ectors. Although a detailed comparison with this technology is out of the scope of this paper , it is worth noting that our best i-vector system achie ves an EER=1.1%, rather far from what is achie ved with DNN systems. It is well-kno wn in the literature that i-vectors pro vide competitive performa nce when more training material is used for each speaker and when longer test sentences are emplo yed [ 58 , 59 , 60 ]. Under the challenging conditions faced in this work, neural netw orks achieve better generalization. 4.2 Speech Recognition T ab . 3 reports the speech recognition performance obtained by CNN and SincNet using the TIMIT and the DIRHA dataset [ 41 ]. T o ensure a more accurate comparison between the architectures, ﬁ ve experiments v arying the initialization seeds were conducted for each model and corpus. T able 3 thus reports the a verage speech recognition performance. Standard de viations, not reported here, range between 0 . 15 and 0 . 2 for all the experiments. TIMIT DIRHA CNN-FB ANK 18.3 40.1 CNN-Raw wa veform 18.1 40.0 SincNet-Raw wa veform 17.2 37.2 T able 3: Speech recognition performance obtained on the TIMIT and DIRHA datasets. For all the datasets, SincNet outperforms CNNs trained on both standard FB ANK and ra w wa veforms. The latter result conﬁrms the effecti veness of SincNet not only in close-talking scenarios but also in challenging noisy conditions characterized by the presence of both noise and rev erberation. As emerged in Sec.2, SincNet is able to effecti vely tune its ﬁlter-bank front-end to better address the characteristics of the noise. 5 Conclusions and Future W ork This paper proposed SincNet, a neural architecture for directly processing wav eform audio. Our model, inspired by the way ﬁltering is conducted in digital signal processing, imposes constraints on the ﬁlter shapes through efﬁcient parameterization. SincNet has been extensi vely ev aluated on challenging speaker and speech recognition tasks, consistently sho wing some performance beneﬁts. Beyond performance impro vements, SincNet also signiﬁcantly improv es conv ergence speed ov er a standard CNN, is more computationally ef ﬁcient due to the exploitation of ﬁlter symmetry , and it is more interpretable than standard black-box models. Analysis of the SincNet ﬁlters, in fact, re vealed 8 that the learned ﬁlter-bank is tuned to the speciﬁc task addressed by the neural network. In future work, we would like to e valuate SincNet on other popular speaker recognition tasks, such as V oxCeleb . Inspired by the promising results obtained in this paper , in the future we will explore the use of SincNet for supervised and unsupervised speaker/en vironmental adaptation. Moreo ver , although this study targeted speak er and speech recognition only , we believ e that the proposed approach deﬁnes a general paradigm to process time-series and can be applied in numerous other ﬁelds. Acknowledgement This research was enabled in part by support provided by Calcul Québec and Compute Canada. References [1] I. Goodfellow , Y . Bengio, and A. Courville. Deep Learning . MIT Press, 2016. [2] D. Y u and L. Deng. A utomatic Speech Recognition - A Deep Learning Appr oach . Springer , 2015. [3] D. Bahdanau, J. Chorowski, D. Serdyuk, P . Brakel, and Y . Bengio. End-to-end attention-based large vocab ulary speech recognition. In Pr oc. of ICASSP , pages 4945–4949, 2016. [4] A. Grav es and N. Jaitly . T owards end-to-end speech recognition with recurrent neural networks. In Pr oc. of ICML , pages 1764–1772, 2014. [5] I. Goodfellow , J. Shlens, and C. Szegedy . Explaining and harnessing adversarial examples. In Proc.of ICLR , 2015. [6] C. Molnar . Interpretable Machine Learning: A Guide for Making Black Box Models Explainable . Leanpub, 2018. [7] S. Chakraborty et al. Interpretability of deep learning models: A survey of results. In Proc. of SmartW orld , 2017. [8] M. D. Zeiler and R. Fergus. V isualizing and understanding conv olutional networks. In Proc. of ECCV , 2014. [9] Q.-S. Zhang and S.-C. Zhu. V isual interpretability for deep learning: a survey . F r ontiers of Information T echnolo gy & Electr onic Engineering , 19(1):27–39, Jan 2018. [10] M. T . Ribeiro, S. Singh, and C. Guestrin. "why should i trust you?": Explaining the predictions of any classiﬁer . In Pr oc. of A CM SIGKDD , pages 1135–1144, 2016. [11] Q. Zhang, R. Cao, F . Shi, Y . N. W u, and S.-C. Zhu. Interpreting CNN Knowledge via an Explanatory Graph. In Pr oc. of AAAI , 2018. [12] S. Sabour , N. Frosst, and G. E Hinton. Dynamic routing between capsules. In Pr oc. of NIPS , pages 3856–3866. 2017. [13] S. Becker , M. Ackermann, S. Lapuschkin, K.-R. Müller , and W . Samek. Interpreting and explaining deep neural networks for classiﬁcation of audio signals. CoRR , abs/1807.03418, 2018. [14] S. Hochreiter and J. Schmidhuber . Long short-term memory . Neural Computation , 9(8):1735–1780, Nov ember 1997. [15] J. Chung, Ç. Gülçehre, K. Cho, and Y . Bengio. Empirical ev aluation of gated recurrent neural networks on sequence modeling. In Pr oc. of NIPS , 2014. [16] M. Rav anelli, P . Brakel, M. Omologo, and Y . Bengio. Improving speech recognition by revising gated recurrent units. In Pr oc. of Interspeech , 2017. [17] M. Rav anelli, P . Brakel, M. Omologo, and Y . Bengio. Light gated recurrent units for speech recognition. IEEE T ransactions on Emer ging T opics in Computational Intelligence , 2(2):92–102, April 2018. [18] Y . LeCun, P . Haffner , L. Bottou, and Y . Bengio. Object recognition with gradient-based learning. In Shape, Contour and Gr ouping in Computer V ision , London, UK, UK, 1999. Springer-V erlag. [19] E. V ariani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez. Deep neural networks for small footprint text-dependent speaker v eriﬁcation. In Pr oc. of ICASSP , pages 4052–4056, 2014. [20] F . Richardson, D. A. Re ynolds, and N. Dehak. A uniﬁed deep neural network for speaker and language recognition. In Pr oc. of Interspeech , pages 1146–1150, 2015. [21] D. Snyder , D. Garcia-Romero, D. Pove y , and S. Khudanpur . Deep neural network embeddings for text-independent speaker v eriﬁcation. In Pr oc. of Interspeech , pages 999–1003, 2017. 9 [22] C. Zhang, K. K oishida, and J. Hansen. T ext-independent speaker veriﬁcation based on triplet con volutional neural network embeddings. IEEE/ACM T rans. Audio, Speech and Lang . Pr oc. , 26(9):1633–1644, 2018. [23] G. Bhattacharya, J. Alam, and P . K enny . Deep speaker embeddings for short-duration speaker veriﬁcation. In Pr oc. of Interspeec h , pages 1517–1521, 2017. [24] A. Nagrani, J. S. Chung, and A. Zisserman. V oxceleb: a large-scale speaker identiﬁcation dataset. In Proc. of Interspec h , 2017. [25] D. Palaz, M. Magimai-Doss, and R. Collobert. Analysis of CNN-based speech recognition system using raw speech as input. In Pr oc. of Interspeec h , 2015. [26] T . N. Sainath, R. J. W eiss, A. W . Senior, K. W . W ilson, and O. V inyals. Learning the speech front-end with raw wa veform CLDNNs. In Pr oc. of Interspeech , 2015. [27] Y . Hoshen, R. W eiss, and K. W . W ilson. Speech acoustic modeling from raw multichannel wa veforms. In Pr oc. of ICASSP , 2015. [28] T . N. Sainath, R. J. W eiss, K. W . Wilson, A. Narayanan, M. Bacchiani, and A. Senior . Speaker localization and microphone spacing in variant acoustic modeling from raw multichannel wa veforms. In Proc. of ASRU , 2015. [29] Z. Tüske, P . Golik, R. Schlüter, and H. Ney . Acoustic modeling with deep neural networks using raw time signal for L VCSR. In Pr oc. of Interspeec h , 2014. [30] G. T rigeorgis, F . Ringev al, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou. Adieu features? end-to-end speech emotion recognition using a deep conv olutional recurrent network. In Pr oc. of ICASSP , pages 5200–5204, 2016. [31] A. van den Oord, S. Dieleman, H. Zen, K. Simon yan, O. V inyals, A. Grav es, N. Kalchbrenner , A. Senior, and K. Kavukcuoglu. W av enet: A generative model for ra w audio. In Arxiv , 2016. [32] S. Mehri, K. Kumar , I. Gulrajani, R. Kumar , S. Jain, J. Sotelo, A. C. Courville, and Y . Bengio. Samplernn: An unconditional end-to-end neural audio generation model. CoRR , abs/1612.07837, 2016. [33] P . Ghahremani, V . Manohar , D. Povey , and S. Khudanpur. Acoustic modelling from the signal domain using CNNs. In Pr oc. of Interspeech , 2016. [34] H. Dinkel, N. Chen, Y . Qian, and K. Y u. End-to-end spooﬁng detection with raw wa veform CLDNNS. Pr oc. of ICASSP , 2017. [35] H. Muckenhirn, M. Magimai-Doss, and S. Marcel. T owards directly modeling raw speech signal for speaker veriﬁcation using CNNs. In Pr oc. of ICASSP , 2018. [36] J.-W . Jung, H.-S. Heo, I.-H. Y ang, H.-J. Shim, , and H.-J. Y u. A complete end-to-end speaker veriﬁcation system using deep neural networks: From raw signals to v eriﬁcation result. In Pr oc. of ICASSP , 2018. [37] J.-W . Jung, H.-S. Heo, I.-H. Y ang, H.-J. Shim, and H.-J. Y u. A voiding Speaker Ov erﬁtting in End-to-End DNNs using Raw W a veform for T ext-Independent Speaker V eriﬁcation. In Proc. of Interspeech , 2018. [38] M. Ravanelli and Y . Bengio. Speaker Recognition from raw waveform with SincNet. In Proc. of SLT , 2018. [39] J. S. Garofolo, L. F . Lamel, W . M. Fisher , J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. D ARP A TIMIT Acoustic Phonetic Continuous Speech Corpus CDR OM, 1993. [40] V . P anayotov , G. Chen, D. Pove y , and S. Khudanpur . Librispeech: An ASR corpus based on public domain audio books. In Pr oc. of ICASSP , pages 5206–5210, 2015. [41] M. Rav anelli, L. Cristoforetti, R. Gretter , M. Pellin, A. Sosi, and M. Omologo. The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic environments. In Pr oc. of ASR U 2015 , pages 275–282. [42] M. Rav anelli, P . Svaizer , and M. Omologo. Realistic multi-microphone data simulation for distant speech recognition. In Pr oc. of Interspeech , 2016. [43] L. R. Rabiner and R. W . Schafer . Theory and Applications of Digital Speech Processing . Prentice Hall, NJ, 2011. [44] S. K. Mitra. Digital Signal Pr ocessing . McGraw-Hill, 2005. [45] M. Rav anelli, D. Serdyuk, and Y . Bengio. T win regularization for online speech recognition. In Proc. of Interspeech , 2018. [46] T . N. Sainath, B. Kingsbury , A. R. Mohamed, and B. Ramabhadran. Learning ﬁlter banks within a deep neural network frame work. In Proc. of ASRU , pages 297–302, 2013. [47] H. Y u, Z. H. T an, Y . Zhang, Z. Ma, and J. Guo. DNN Filter Bank Cepstral Coefﬁcients for Spooﬁng Detection. IEEE Access , 5:4779–4787, 2017. 10 [48] H. Seki, K. Y amamoto, and S. Nakagawa. A deep neural network inte grated with ﬁlterbank learning for speech recognition. In Pr oc. of ICASSP , pages 5480–5484, 2017. [49] N. Zeghidour , N. Usunier , I. Kokkinos, T . Schatz, G. Synnaeve, and E. Dupoux. Learning ﬁlterbanks from raw speech for phone recognition. In Pr oc. of ICASSP , pages 5509–5513, 2018. [50] V . Papyan, Y . Romano, and M. Elad. Conv olutional neural networks analyzed via conv olutional sparse coding. Journal of Machine Learning Resear ch , 18:83:1–83:52, 2017. [51] S. Mallat. Understanding deep conv olutional networks. CoRR , abs/1601.04920, 2016. [52] D. P alaz and R. Magimai-Doss, M.and Collobert. End-to-end acoustic modeling using con volutional neural networks for automatic speech recognition. 2016. [53] H. Muckenhirn, M. Magimai-Doss, and S. Marcel. On Learning V ocal T ract System Related Speaker Discriminativ e Information from Raw Signal Using CNNs. In Pr oc. of Interspeech , 2018. [54] M. Rav anelli. Deep learning for Distant Speech Recognition . PhD Thesis, Unitn, 2017. [55] M. Rav anelli, P . Brakel, M. Omologo, and Y . Bengio. A netw ork of deep neural networks for distant speech recognition. In Pr oc. of ICASSP , pages 4880–4884, 2017. [56] M. Ravanelli and M. Omologo. Contaminated speech training methods for robust DNN-HMM distant speech recognition. In Pr oc. of Interspeech 2015 , pages 756–760. [57] M. Ravanelli, T . Parcollet, and Y . Bengio. The PyT orch-Kaldi Speech Recognition T oolkit. In arXiv:1811.07453 , 2018. [58] A. K. Sarkar , D Matrouf, P .M. Bousquet, and J.F . Bonastre. Study of the ef fect of i-vector modeling on short and mismatch utterance duration for speaker veriﬁcation. In Proc. of Interspeech , pages 2662–2665, 2012. [59] R. Tra vadi, M. V an Segbroeck, and S. Narayanan. Modiﬁed-prior i-V ector Estimation for Language Identiﬁcation of Short Duration Utterances. In Pr oc. of Interspeech , pages 3037–3041, 2014. [60] A. Kanagasundaram, R. V ogt, D. Dean, S. Sridharan, and M. Mason. i-vector based speak er recognition on short utterances. In Pr oc. of Interspeech , pages 2341–2344, 2011. [61] M. Matassoni, R. Astudillo, A. Katsamanis, and M. Ra vanelli. The DIRHA-GRID corpus: baseline and tools for multi-room distant speech recognition using distributed microphones. In Proc. of Interspeec h 2014 , pages 1616–1617. [62] E. Zwyssig, M. Rav anelli, P . Svaizer , and M. Omologo. A multi-channel corpus for distant-speech interaction in presence of known interferences. In Pr oc. of ICASSP 2015 , pages 4480–4484. [63] L. Cristoforetti, M. Rav anelli, M. Omologo, A. Sosi, A. Abad, M. Hagmueller , and P . Maragos. The DIRHA simulated corpus. In Pr oc. of LREC 2014 , pages 2629–2634. [64] Douglas P . and J. M. Baker . The design for the wall street journal-based csr corpus. In Pr oceedings of the W orkshop on Speech and Natural Language , Proc. of HL T , pages 357–362, 1992. [65] M. Rav anelli, A. Sosi, P . Svaizer , and M. Omologo. Impulse response estimation for robust speech recognition in a rev erberant environment. In Proc. of EUSIPCO 2012 . [66] M. Rav anelli and M. Omologo. On the selection of the impulse responses for distant-speech recognition based on contaminated speech training. In Pr oc. of Interspeech 2014 , pages 1028–1032. [67] J. Ba, R. Kiros, and G. E. Hinton. Layer normalization. CoRR , abs/1607.06450, 2016. [68] S. Iof fe and C. Szegedy . Batch normalization: Accelerating deep network tr aining by reducing internal cov ariate shift. In Pr oc. of ICML , pages 448–456, 2015. [69] M. Ra vanelli, P . Brak el, M. Omologo, and Y . Bengio. Batch-normalized joint training for dnn-based distant speech recognition. In Pr oc. of SLT , 2016. [70] A. L. Maas, A. Y . Hannun, and A. Y . Ng. Rectiﬁer nonlinearities improve neural network acoustic models. In Pr oc. of ICML , 2013. [71] X. Glorot and Y . Bengio. Understanding the difﬁculty of training deep feedforw ard neural networks. In Pr oc. of AIST ATS , pages 249–256, 2010. [72] D. Pov ey et al. The Kaldi Speech Recognition T oolkit. In Pr oc. of ASR U , 2011. [73] A. Larcher , K. A. Lee, and S. Meignier . An extensible speaker identiﬁcation sidekit in python. In Pr oc. of ICASSP , pages 5095–5099, 2016. 11 A ppendix Corpora T o provide experimental e vidence on datasets characterized by dif ferent numbers of speakers, this paper considers the TIMIT (462 spks, train chunk) [ 39 ] and Librispeech (2484 spks) [ 40 ] corpora. For speaker veriﬁcation experiments, non-speech intervals at the beginning and end of each sentence were removed. Moreov er, the Librispeech sentences with internal silences lasting more than 125 ms were split into multiple chunks. T o address text-independent speaker recognition, the calibration sentences of TIMIT (i.e., the utterances with the same text for all speakers) ha ve been remo ved. For the latter dataset, ﬁ ve sentences for each speak er were used for training, while the remaining three were used for test. For the Librispeech corpus, the training and test material have been randomly selected to exploit 12-15 seconds of training material for each speaker and test sentences lasting 2-6 seconds. T o ev aluate the performance in a challenging distant-talking scenario, speech recognition experiments hav e also considered the DIRHA dataset [ 41 ]. This corpus, similarly to the other DIRHA corpora [ 61 , 62 ], has been developed in the context of the DIRHA project [ 63 ] and is based on WSJ sentences [ 64 ] recorded in a domestic en vironment. Training is based on contaminating WSJ-5k utterances with realistic impulse responses [ 65 , 66 ], while the test phase test phase consists of 409 WSJ sentences recorded by nativ e speakers in a domestic en vironment (the average SNR is 10 dB). SincNet Setup The wav eform of each speech sentence was split into chunks of 200 ms (with 10 ms overlap), which were fed into the SincNet architecture. The ﬁrst layer performs sinc-based con volutions as described in Sec. 2, using 80 ﬁlters of length L = 251 samples. The architecture then employs two standard con volutional layers, both using 60 ﬁlters of length 5. Layer normalization [ 67 ] was used for both the input samples and for all con volutional layers (including the SincNet input layer). Ne xt, three fully-connected layers composed of 2048 neurons and normalized with batch normalization [ 68 , 69 ] were applied. All hidden layers use leaky-ReLU [ 70 ] non-linearities. The parameters of the sinc-layer were initialized using mel-scale cutoff frequencies, while the rest of the network was initialized with the well-kno wn “Glorot" initialization scheme [ 71 ]. Frame-level speaker and phoneme classiﬁcations were obtained by applying a softmax classiﬁer , providing a set of posterior probabilities ov er the targets. F or speaker-id, a sentence-le vel classiﬁcation w as simply deriv ed by averaging the frame predictions and voting for the speaker which maximizes the av erage posterior . Training used the RMSprop optimizer , with a learning rate lr = 0 . 001 , α = 0 . 95 ,  = 10 − 7 , and minibatches of size 128. All the hyper-parameters of the architecture were tuned on TIMIT , then inherited for Librispeech as well. The speaker veriﬁcation system was deri ved from the speaker-id neural netw ork using the d-vector approach [ 19 , 24 ], which relies on the output of the last hidden layer and computes the cosine distance between test and the claimed speaker d-vectors. T en utterances from impostors were randomly selected for each sentence coming from a genuine speaker . Note that to assess our approach on a standard open-set speaker-id task, all the impostors were taken from a speaker pool dif ferent from that used for training the speaker-id DNN. Baseline Setups W e compared SincNet with several alternati ve systems. First, we considered a standard CNN fed by the ra w wa veform. This network is based on the same architecture as SincNet, but replacing the sinc-based con volution with a standard one. A comparison with popular hand-crafted features was also performed. T o this end, we computed 39 MFCCs (13 static+ ∆ + ∆∆ ) and 40 FB ANKs using the Kaldi toolkit [ 72 ]. These features, computed ev ery 25 ms with 10 ms ov erlap, were gathered to form a context window of approximately 200 ms (i.e., a context similar to that of the considered wa veform-based neural network). A CNN was used for FB ANK features, while a Multi-Layer Perceptron (MLP) was used for MFCCs. Note that CNNs exploit local correlation across features and cannot be effecti vely used with uncorrelated MFCC features. Layer normalization w as used for the FBANK network, while batch normalization was employed for the MFCC one. The hyper-parameters of these networks were also tuned using the aforementioned approach. For speak er veriﬁcation experiments, we also considered an i-vector baseline. The i-vector system was imple- mented with the SIDEKIT toolkit [ 73 ]. The GMM-UBM model, the T otal V ariability (TV) matrix, and the Probabilistic Linear Discriminant Analysis (PLDA) were trained on the Librispeech data (av oiding test and enrollment sentences). GMM-UBM was composed of 2048 Gaussians, and the rank of the TV and PLDA eigen voice matrix was 400. The enrollment and test phase is conducted on Librispeech using the same set of speech segments used for DNN e xperiments. 12

Interpretable Convolutional Filters with SincNet

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment