Deep Neural Baselines for Computational Paralinguistics

Detecting sleepiness from spoken language is an ambitious task, which is addressed by the Interspeech 2019 Computational Paralinguistics Challenge (ComParE). We propose an end-to-end deep learning approach to detect and classify patterns reflecting s…

Authors: Daniel Elsner, Stefan Langer, Fabian Ritz

Deep Neural Baselines f or Computational Paralinguistics Daniel Elsner 1 , 2 , Stefan Langer 1 , F abian Ritz 1 , Robert Mueller 1 , Steffen Illium 1 1 LMU Munich, Germany 2 T awn y GmbH, Germany { extern.daniel.elsner,stefan.langer,fabian.ritz,robert.mueller,steffen.illium } @ifi.lmu.de Abstract Detecting sleepiness from spoken language is an ambitious task, which is addressed by the Interspeech 2019 Computational Par - alinguistics Challenge (ComParE). W e propose an end-to-end deep learning approach to detect and classify patterns reflecting sleepiness in the human voice. Our approach is based solely on a moderately comple x deep neural network architecture. It may be applied directly on the audio data without requiring an y spe- cific feature engineering, thus remaining transferable to other audio classification tasks. Nevertheless, our approach performs similar to state-of-the-art machine learning models. Index T erms : affecti ve computing, speech recognition, deep learning, computational paralinguistics, ComParE 1. Introduction Computational par alinguistics describe a research field that has been established in the past twenty years through remarkable research progress, e.g. in the area of automated af fect recogni- tion or determining illnesses from speech [1, 2, 3]. More gen- erally , emphasis is put on the analysis of the human voice to design affective computers that possess empathic competencies such as recognizing, expressing, modeling, or communicating emotions [4]. In most paralinguistic problems, task-specific fea- ture engineering and model tuning allows to build state-of-the- art statistical learning models, e.g. machine learning (ML) clas- sifiers [5]. Powerful software packages such as the open-source toolkits OpenSMILE [6], OpenXBO W [7] and AuDeep [8] were created to extract these relev ant features from raw audio data. Howe ver , more general approaches ease the transfer of mod- els to familiar problem domains without prior expert knowl- edge about audio signal processing, while still pro viding ad- equate predictive performance. Recent work in deep learning focuses on end-to-end modeling of af fect recognition problems, i.e. training models on raw signal data without extensiv e data pre-processing [9]. Even though deep learning has revolution- ized artificial intelligence (AI) research areas, e.g. computer vision and machine translation, the full potential of Deep Neu- ral Networks (DNNs) can usually only be utilized with datasets of suf ficient quality and quantity [10]. The av ailability of pub- lic databases for affect recognition problems has especially im- prov ed through annual competitions and challenges. This paper proposes an end-to-end deep learning approach for the Interspeech 2019 Computational Paralinguistics Challenge (ComParE) [11]. While a voiding task-specific feature engineer- ing and providing an agnostic approach for modeling paralin- guistic problems, it addresses the aforementioned growing de- mand for general approaches in computational paralinguistics and af fectiv e computing. Our main contrib ution is a moder- ately complex depp neural network (DNN) architecture that is able to detect and classify sleepiness in human voice from the SLEEP corpus. W e show that its performance is comparable to the provided baseline which fuses three support vector ma- chines (SVMs) trained on more than 6,000 extracted audio fea- tures from the toolkits OpenSMILE , OpenXBO W and AuDeep . Also, we perform a sanity check on its transferability . 2. Deep Learning in Computational Paralinguistics Throughout recent years there has been a shift in af fectiv e computing from classical ML approaches towards deep learn- ing [10]. Among other areas, this also concerned computa- tional paralinguistics. The inherent absence of task-specific, manual feature extraction and selection in deep learning allows researchers to design complex, non-linear models. Those ei- ther rev eal useful (latent) feature embeddings (e.g., unsuper- vised learning with deep auto encoders [8]) or learn feature rep- resentations directly from unstructured data for predicti ve mod- eling (e.g., classification of images with CNNs [12]). Gener- ally , DNNs allo w end-to-end problem modeling, where raw data (e.g. audio signal data) is fed to the model and a prediction, e.g. a class label or a continuous v alue, is returned. T o summarize, three major causes hav e enabled this shift, namely (i) increased computing capabilities and (ii) learning capacity and precision of deep learning models through wider and deeper neural net- works, as well as (iii) an ev er growing amount of free-to-use, labeled datasets publicly av ailable [10]. DNNs can be considered the state-of-the-art technology to hu- man affect recognition software [13]. T o tackle the problem of div erse data representations, DNNs can be altered accordingly in order to enhance their performance. W ithin the field of com- putational paralinguistics CNNs, as well as Recurrent Neural Networks (RNNs), are amongst the best known network archi- tectures. This work emphasizes the former, which are foremost established and applied numerously [12, 14, 15] in the field of image processing. Moreov er , CNNs are able to learn features representations of different abstraction levels, ranging from low lev el features, such as edges, to high lev el features, such as eyes or noses, nativ ely [10, 13]. Those deep spatial features outperform hand- crafted spatial features in most cases [10, 13]. Another major advantage of CNNs is the lo w amount of parameters, compared to other network types like RNNs, and hence faster computing times during model training. CNNs can be applied in multiple dimensionalities, such as 3D, mostly used for video analysis, 2D for the aforementioned image processing tasks, as well as 1D for one dimensional raw data inputs. Therefore, CNNs ha ve been an attractiv e option to researchers in computational par- alinguistics, starting to be applied as early as 2011 [16]. 3. Related W ork Greeley et al. inv estigate the effect of sleepiness on war fight- ers and civilian pilots and how machines could non-intrusiv ely detect fatigue in very noisy environments [17]. The authors re- port that changes in single discrete voice parameters are not sufficient to detect whether the speaker suf fers from fatigue. Therefore, they propose a more holistic approach, combining the coefficients of the cepstral transformation with an automatic speech recognition system. This correlation-based voice metric achiev ed on-par results with state-of-the-art fatigue measure- ments. Krajewski et al. follow a some what classical ML proce- dure towards measuring fatigue in audio recordings [18]. Their results are based on speech characteristics such as prosody , articulation, and speech-quality-related and reach a classifica- tion accuracy of 86.1% on data collected in a sleep depriv a- tion study . Their top result was achie ved utilizing an SVM. W ithin the ComparE challenge 2010, Marie-Jos Caraty and Claide Montaci address the problem of detecting vocal fatigue based on the statement that it highly affects the work in some professions [19]. The authors build their in vestigations upon three experiments, being prosodic analysis, a two-class SVM classifier , and a combination of multiple phoneme-based com- parison functions. The two-class SVM model reached an un- weighted accuracy of 68.2%. The authors state that this sug- gests the feasibility of vocal fatigue detection. Recently , Cum- mins et al. re visited approaches and results of past years of par - alinguistics challenges and outline a noticeable shift from clas- sical ML techniques towards deep learning models [13]. Ac- cordingly , the majority of submissions made use of some sort of deep learning approach. Participants utilized DNNs for fea- ture representation learning, classification or the combination of both, which underlines the growing interest in DNNs. In 2011 there has been a ComParE sub-challenge, asking participants to distinguish between sleepy and not-sleepy . The best solution at the time can be considered a classical ML solution, and has not been challenged through an end-to-end deep learning ap- proach as of 2017. This leads to the assumption that there is a gap in research approaches concerning sleepiness detection and in vestigating more general, non task-specific approaches is reasonable. 4. Paralinguistic Problem Modeling The datasets provided for the 2019 ComParE challenges – in their original form – are not suited specifically for deep learn- ing due to the limited number of samples. Thus, we experi- mented with pre-processing as well as augmentation techniques and tested different CNN architectures. These steps are outlined in the following subsections and e valuated in section 5. 4.1. SLEEP Corpus The SLEEP corpus consists of 16 , 462 audio samples, each with a duration of about four seconds, which are split into training , development and test set. Its audio files have sampling rates of 16 kHz with a quantization of 16 bit. The audio samples were gathered from 915 subjects performing dif ferent speaking tasks such as reading out giv en text passages. Recordings were car- ried out between 6 p.m. and midnight. Afterwards, participants and post-hoc observers had to report sleepiness on the Karolin- ska Sleepiness Scale (KSS), ranging from 1 (extremely alert) to 9 (very sleepy). These ratings were av eraged to build the final label per recording. The dataset was created at the Institute of Psychophysiology , Duesseldorf, Germany and the Institute of Figure 1: KSS rating distribution of the training and develop- ment set. Safety T echnology , Uni versity of W uppertal, Germany . More detailed statistics about the dataset and the label distrib ution are provided in T able 1 and Figure 1, respectiv ely . The task of the challenge is to build a regression model that is able to predict the KSS rating for an audio recording [11]. T able 1: Number of samples for training, development and test set as well as the respective mean ( µ ), standard deviation ( σ ), minimum and maximum duration of the samples (in seconds). Dataset Samples µ σ min. max. T raining 5,564 3.87 0.64 1.56 5.00 Dev elopment 5,328 3.87 0.65 1.57 5.00 T est 5,570 3.86 0.63 1.59 5.00 4.2. Data Pre-pr ocessing and Augmentation In classical ML problem modeling, feature engineering and fea- ture selection are integral parts of the data preparation process. Hereby , the a vailable data is reduced to relevant pieces of infor - mation, e.g. by excluding irrelev ant data or modeling specific features. Our approach omits the e xtraction of problem-specific features, but relies on the DNN models being capable of learn- ing relev ant feature representations themselves. Nev ertheless it is necessary to synthetically increase the training data volume to model the regression task as a deep learning problem. The following subsections delineate pre-processing steps and aug- mentation techniques. 4.2.1. Audio Pr e-processing W e experimented with different do wn-sampling rates for the provided audio files. This is inspired by narrowband telephony only transmitting frequencies up to 4 kHz but still retaining the majority of information. With respect to the Nyquist-Shannon sampling theorem, we choose a do wn-sampling to at least 8 kHz. As a side effect, the down-sampling reduces the amount of data thus increases processing speed. 4.2.2. Sliding W indows As mentioned previously , the small v olume of training data ( 5 , 564 samples) needs to be increased by magnitudes to suit a deep learning set up. Therefore, we slice windo ws from a single audio recording in a sliding window manner . W e experi- mented with varying window sizes, as well as different strides, i.e. step sizes. Consequently , we do not obtain one data point per audio sample, but get ( L − w ) /s ov erlapping windows for a sample of length L with windo w size w and stride s . The best results were achiev ed with w = 1 . 5 s , and s = 100 ms , that, gi ven an exemplary sample of length L = 4 s , results in (4 s − 1 . 5 s ) / 100 ms = 25 samples of the same label. The total amount of data samples extracted from the training dataset is 134 , 395 and 128 , 808 from the dev elopment dataset. 4.2.3. Data Up- and Downsampling As described in section 4.2, the distribution of samples is imbal- anced at the extrema of the KSS. T o prev ent the models from ov erfitting on the large corpus of samples labelled between 3 and 8, we performed up- and down-sampling of under- and ov errepresented samples, respectiv ely . Consequently , the ex- tracted sample windows labelled 1 or 9 (extremely alert and very sleepy) were included multiple times into the dataset, i.e. up-sampled, whereas sample windows labelled between 3 and 8 were only partly included, i.e. down-sampled. 4.2.4. Data Augmentation T o further increase the v olume of available training data, we tested the following augmentation techniques: Reversing samples: W e flipped each sample window and in- cluded both the reversed and the original sample in the training set. The hypothesis was that the relev ant patterns might be in- dependent of the exact sequential structure of the sample. Background overlay: T o make the models more robust, we ov erlayed the training samples with background noise manually extracted from parts of training recordings where v oice was not present. Again, the original as well as the modified samples were included for training. This is a common approach and not limited to the a vailable training data since background noise can be recorded separately for different background settings. Noisy labelling : W ith previously mentioned sliding window approach, all windows extracted from a single sample were la- belled with the same KSS rating. As the actual self-reported degree of sleepiness may not be occurrent during the whole sample, we applied noise randomly taken from a normal dis- tribution onto the KSS ratings. The hypothesis was that this bias might lead to more robust models, as the KSS ratings could hav e been ambiguous at the time of labelling (i.e. participants’ self-reports compared to the post-hoc observers’). W e included both the samples with the actual and those with noisy labels in the training dataset. 4.3. Model Architectur e The follo wing describes our DNN architecture and the mecha- nism necessary both aggregate the predicted labels of the sliding windows and map them back onto the original audio samples. 4.3.1. Prediction Ag gr e gation Mechanism In section 4.2.2 we outlined the sliding windo w approach to in- crease the volume of the dataset. As this approach lea ves us with n samples per audio file (depending on the file’ s length L , window size w , and stride s ), a mechanism for merging the pre- dicted labels (i.e. KSS ratings) during the inference to finally generate one label for the original audio sample is needed. For example, if an audio file a leads to 25 windows, which were fed Input (12,000 x 1) BatchNorm MaxPool (2,2) Dropout (0.1) Conv-Block (I) Conv-Block (II) Dense (24,000 x 1) Dense (32) Dropout (0.5) BatchNorm Output (1) Conv 1D (4,3) Figure 2: Pr oposed DNN Arc hitectur e to the model during inference (i.e. prediction), 25 predicted la- bels, one for each windo w , ha ve to be merged and mapped back onto the original audio sample a . W e aggregated the predicted labels in two ways, by taking the mean and the median of the predictions. Ultimately , we clipped the resulting prediction for the original audio sample into the KSS rating from 1 to 9 and performed a typecast to an integer number . 4.3.2. Con volutional Neural Network Figure 2 depicts the proposed DNN architecture that is capa- ble of learning spatial – thus short sequential – feature repre- sentations from raw audio data. The DNN consists of two 1D con volutional layers, each with four filters and kernel size tree, that are connected through batch normalization and max pool- ing layers, each having the size of the v ector . After these con vo- lutional blocks (Con v-Blocks), one fully connected layer with 32 neurons leads to a final dense layer with a single linear ac- tiv ation unit. The regression model was trained using the mean squared error (MSE) loss and the Adam optimization algorithm with a learning rate of 0 . 001 without decay . Except for the final dense layer , ReLU acti vations were used. T o pre vent ov erfitting during training, a dropout was applied after the con volutional (drop rate 0.1) and dense layers (drop rate 0.5) Evaluated on the dev elopment dataset, the described model resulted in the best model. Howe ver , we discuss further (hyper-) parameters in sec- tion 5.1. 5. Results The baselines’ results are reported with Spearmans rank corre- lation coefficient ρ [11]. Since the actual labels of the test set are not av ailable ex-ante, we trained our models on the training set and tested on the development set. W e then submitted the best model to the challenge and reported the score on the test set. For the sake of interpretation, we include our loss metric MSE as well as the mean absolute error (MAE). Figure 3: Overview of Spearman’ s rank corr elation coefficient ( ρ ) of applied pr e-pr ocessing and augmentation techniques compar ed to the best model without further pr ocessing ( None ) on the development set. Models denoted with * were submitted to the challenge . 5.1. Development Dataset T able 2: Evaluation scores for MSE, MAE, and ρ on the de- velopment set with differ ent parameter combinations r egar ding (i) sample rate, (ii) window size, (iii) amount of con volutional blocks. The proposed appr oach is varied by each one of the thr ee parameters while the others remain fixed. Note that the stride was fixed at 0 . 1 s . Models denoted with * were submitted to the challenge . MSE MAE ρ Strongest Baseline – – 0.27 Proposed Approach* (16 kHz, 1.5 s, 2 Con v-Blocks) 4.44 1.72 0.29 Smaller W indow Size (1 s) 3.96 1.67 0.28 Smaller Sampling Rate (8 kHz)* 3.89 1.65 0.26 More Con v-Blocks (3) 4.12 1.70 0.24 T o allow proper comparison of models with (hyper -) param- eters, we set the batch size to 64 and the amount of epochs to 8 for training. This allowed testing multiple parameter combi- nations without large computational and time-consuming over- head. T able 2 shows the ev aluation scores for MSE, MAE, and ρ on the development set for different parameters and mod- els. The results imply that our models perform slightly better ( ρ ≈ 0 . 29 ± 0 . 03 ) than the strongest baseline models ( ρ ≈ 0 . 26 ) on the development set. Generally , we found that during train- ing, the loss was monotonically decreasing on the training and the dev elopment set. Ho wev er , most models tend to overfit on the training set after 6 epochs and we therefore saved the best models during the entire training process based on their loss on the dev elopment set. The different pre-processing and aug- mentation techniques described in section 4.2.4 did not produce better results regarding ρ (see figure 3). Howe ver , a combina- tion of techniques is beyond the scope of this work as we aim to provide a generic approach rather than a tailored solution for this specific problem. 5.2. T est Dataset After experimenting with different model (hyper-) parameters, we selected the best performing model, trained it again on the entire development set, and submitted the resulting model to the ComparE challenge. As we only received ρ , MSE and MAE cannot be reported. Note that the distribution of predicted la- bels is very similar to the actual distribution of labels in the de- velopment and training set (underrepresentation of 1, 2, 8, and 9). The best provided baseline, an ensemble of three SVMs, achiev ed ρ = 0 . 343 . T o test different parameter configura- tions and augmentation techniques, three of the described mod- els were submitted to the challenge and performed as follows: 1. The proposed approach with smaller sampling rate (8 kHz) achiev ed ρ = 0 . 28 on the test set. 2. The proposed approach with noisy labels (see figure 3) achiev ed ρ = 0 . 302 on the test set. 3. The proposed approach with default sampling rate (16 kHz) and without augmentation achieved ρ = 0 . 335 on the test set. The two left submissions out of five possible submissions were reserved for experimentation with other approaches within the challenge. 5.3. T ransferability In order to ev aluate the transferability of our DNN architec- ture, a similar model was trained to classify Styrian Dialects within the respective ComParE sub-challenge. Ho wev er , in- stead of solving a regression problem, the final single linear activ ation unit was replaced with a three-unit softmax activated dense layer . On the dev elopment set, our best model achieved an Unweighted A verage Recall (UAR) of 44 . 00% . The same model scored an UAR of 38 . 28% on the test set in our submis- sion to the challenge. Compared the three strongest baseline models with an a verage of 40 . 00% on the test set, we claim that transferring our DNN architecture to other audio classification tasks is feasible. 6. Conclusion This paper presented an end-to-end deep learning approach to detect and classify sleepiness in the human voice. The proposed 1D con volutional DNN is capable of learning spatio-temporal feature representations from raw audio data. Its performance is comparable with models trained on current audio feature ex- traction toolkits. Moreover , we performed a sanity check on the transferability of our approach. The results indicate that our DNN architecture may be used as a problem-agnostic and straightforward baseline in addition to classical ML approaches. The end-to-end method emphasizes generalizability and trans- ferability to other domains, e.g. in computational paralinguis- tics, contrary to problem-specific feature engineering. Our pro- posed architecture is especially suitable for context aware mul- timedia recommendation systems. In a possible use-case, the system could recommend e.g. radio stations or songs dependent on the fatigue lev el, detected in the user’ s voice. Future work could deepen the extent of architecture search and parameter tuning as other automatic ML approaches suggest, to ultimately further democratize AI research [20]. Acknowledgements. The HRADIO project and thus this work was funded by H2020, the EU Framew ork Programme for Research and Innov ation. 7. References [1] B. Schuller and A. Batliner , Computational paralinguistics: emo- tion, affect and personality in speech and language processing . John W iley & Sons, 2013. [2] R. Grishman, Computational linguistics: an intr oduction . Cam- bridge Univ ersity Press, 1986. [3] G. Kiss, M. G. T ulics, D. Sztah ´ o, A. Esposito, and K. V icsi, “Language independent detection possibilities of depression by speech, ” in Recent advances in nonlinear speech pr ocessing . Springer , 2016, pp. 103–114. [4] R. W . Picard, “ Affectiv e computing: challenges, ” International Journal of Human-Computer Studies , vol. 59, no. 1-2, pp. 55–64, 2003. [5] B. Schuller , S. Steidl, A. Batliner , F . Burkhardt, L. De villers, C. M ¨ uLler , and S. Narayanan, “Parali nguistics in speech and language—state-of-the-art and the challenge, ” Computer Speech & Language , v ol. 27, no. 1, pp. 4–39, 2013. [6] F . Eyben, M. W ¨ ollmer , and B. Schuller, “Opensmile: the mu- nich versatile and fast open-source audio feature extractor , ” in Pr oceedings of the 18th ACM international conference on Mul- timedia . A CM, 2010, pp. 1459–1462. [7] M. Schmitt and B. Schuller, “Openxbow: introducing the pas- sau open-source crossmodal bag-of-words toolkit, ” The Journal of Machine Learning Resear ch , vol. 18, no. 1, pp. 3370–3374, 2017. [8] M. Freitag, S. Amiriparian, S. Pugachevskiy , N. Cummins, and B. Schuller, “audeep: Unsupervised learning of representations from audio with deep recurrent neural networks, ” The Journal of Machine Learning Resear ch , v ol. 18, no. 1, pp. 6340–6344, 2017. [9] A. Graves and N. Jaitly , “T owards end-to-end speech recognition with recurrent neural networks, ” in International conference on machine learning , 2014, pp. 1764–1772. [10] P . V . Rouast, M. Adam, and R. Chiong, “Deep learning for hu- man affect recognition: Insights and new developments, ” IEEE T ransactions on Affective Computing , 2019. [11] B. W . Schuller, A. Batliner , C. Bergler , F . B. Pokorny , J. Kra- jewski, M. Cychosz, R. V ollmann, S.-D. Roelen, S. Schnieder, E. Bergelson10 et al. , “The interspeech 2019 computational par- alinguistics challenge: Styrian dialects, continuous sleepiness, baby sounds & orca acti vity , ” Pr oceedings INTERSPEECH 2019 , 2019. [12] A. Krizhevsk y , I. Sutskev er , and G. E. Hinton, “Imagenet classi- fication with deep con volutional neural netw orks, ” in Advances in neural information pr ocessing systems , 2012, pp. 1097–1105. [13] N. Cummins, A. Baird, and B. Schuller, “Speech analysis for health: Current state-of-the-art and the increasing impact of deep learning, ” Methods , 2018. [14] M. Liang and X. Hu, “Recurrent conv olutional neural network for object recognition, ” in Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , 2015, pp. 3367–3375. [15] S. La wrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face recog- nition: A con volutional neural-network approach, ” IEEE transac- tions on neural networks , v ol. 8, no. 1, pp. 98–113, 1997. [16] N. Jaitly and G. Hinton, “Learning a better representation of speech soundwa ves using restricted boltzmann machines, ” in 2011 IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2011, pp. 5884–5887. [17] H. P . Greeley , E. Friets, J. P . Wilson, S. Raghav an, J. Picone, and J. Berg, “Detecting fatigue from voice using speech recognition, ” in 2006 IEEE International Symposium on Signal Pr ocessing and Information T echnology . IEEE, 2006, pp. 567–571. [18] J. Krajewski, A. Batliner, and M. Golz, “ Acoustic sleepiness de- tection: Framework and validation of a speech-adapted pattern recognition approach, ” Behavior r esearc h methods , v ol. 41, no. 3, pp. 795–804, 2009. [19] M.-J. Caraty and C. Montaci ´ e, “V ocal fatigue induced by pro- longed oral reading: Analysis and detection, ” Computer Speech & Language , v ol. 28, no. 2, pp. 453–466, 2014. [20] B. Zoph, V . V asudevan, J. Shlens, and Q. V . Le, “Learning trans- ferable architectures for scalable image recognition, ” in Pr oceed- ings of the IEEE conference on computer vision and pattern r ecognition , 2018, pp. 8697–8710.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment