DEPA: Self-Supervised Audio Embedding for Depression Detection

DEP A: Self-Super vised A udio Embe dding for Depression Detection Pingyue Zhang MoE Ke y Lab of Articial Intelligence X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao T ong University Shanghai, China williamzhangsjtu@sjtu.edu.cn Mengyue W u † MoE Ke y Lab of Articial Intelligence X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao T ong University Shanghai, China mengyuewu@sjtu.edu.cn Heinrich Dinkel ∗ MoE Ke y Lab of Articial Intelligence X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao T ong University Shanghai, China heinrich.dinkel@gmail.com Kai Y u † MoE Ke y Lab of Articial Intelligence X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao T ong University Shanghai, China kai.yu@sjtu.edu.cn ABSTRA CT Depression detection research has increased over the last few decades, one major bottleneck of which is the limited data availability and representation learning. Recently , self-super vised learning has se en success in pretraining text emb eddings and has been applie d broadly on related tasks with sparse data, while pretrained audio embed- dings based on self-super vised learning are rarely inv estigated. This paper proposes DEP A , a self-sup ervised, pretrained dep ression a udio emb edding method for depression detection. An encoder- decoder network is used to extract DEP A on in-domain depressed datasets (DAIC and MDD) and out-domain (Switchboard, Alzheimer’s) datasets. With DEP A as the audio embedding extracted at response- level, a signicant p erformance gain is achieved on downstream tasks, evaluated on both sparse datasets like DAIC and large major depression disorder dataset (MDD). This pap er not only exhibits itself as a novel embedding extracting method capturing response- level representation for depression detection but more signicantly , is an exploration of self-supervise d learning in a specic task within audio processing. CCS CONCEPTS • Computing methodologies → Neural networks ; Supervise d learning by classication ; Supervise d learning by regression ; Multi-task learning . ∗ This author is currently aliated with Xiaomi T ech. Ltd., Beijing † Mengyue Wu and K ai Yu are the corresponding authors. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee pr ovided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. T o copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and /or a fee. Request permissions from permissions@acm.org. MM ’21, October 20–24, 2021, Virtual Event, China © 2021 Association for Computing Machinery . ACM ISBN 978-1-4503-8651-7/21/10. . . $15.00 https://doi.org/10.1145/3474085.3479236 KEY W ORDS Deep neural networks; automatic depression detection; self-super vised learning; feature embedding A CM Reference Format: Pingyue Zhang, Mengyue Wu † , Heinrich Dinkel, and Kai Yu † . 2021. DEP A: Self-Supervised A udio Emb edding for Depression Detection. In Proceedings of the 29th ACM International Conference on Multimedia (MM ’21), October 20–24, 2021, Virtual Event, China. A CM, New Y ork, NY, USA, 9 pages. https: //doi.org/10.1145/3474085.3479236 1 IN TRODUCTION Depression, a disease of considerable attention, has be en aecting more than 300 million people worldwide. An increasing amount of research has been conducted on automatic depression detection and severity prediction, in particular , from conversational speech, which has embedded crucial information ab out one ’s mental state. Despite recent advances in deep learning, automatic depression detection from speech remains a challenging task. Since depression is a complicated mental disorder consisting of various symptoms, utilizing traditional feature extraction methods for emotion r ecognition might lack precision in asserting each indi- vidual’s mental state. Previous exploration has cover ed commonly- used emotion-related features such as CO V AREP [ 5 ], general-purpose audio features including log-Mel spe ctrogram (LMS), and combina- tion of Short- Time Fourier T ransform (STFT) and Mel-Frequency Cepstral Coecients (MFCC) [ 20 ], and speaker-related audio em- bedding like i-vector [ 4 ]. However , as these features are not tailored for application in assessing mental disorders, thus could be less ecient towards such a task with high specicity . Another characteristic of depr ession is that usually only one sin- gle label (diagnosis results) is pr ovided for a multi-turn interview . Specically , during a session with a do ctor , it would be imp ossible to give a specic label 𝑦 𝑡 ∈ { 0 , 1 } , representing the mental state (depressed or healthy) for each time step 𝑡 . Here 𝑡 can be chosen on any arbitrary level, such as phone-, word- or sentence-level. Those long sequences subsequently inuence depression detection performance. Pr evious work has hinted that extracting embeddings on segment-level (e.g., sentence, response) might benet perfor- mance [ 20 ], while modeling depression via a stationary , time-step independent representation is likely to fail [ 1 ]. Hence, a successful audio-embedding for depression detection needs to be e xtracted on sequence-level (e.g., spoken sentence/utterance), to capture rich, long-term spoken context as well as emotional dev elopment within an interview . The last important problem is that models so far are heavily restricted by the limite d amount of depression data. Hence even with the recent advances of deep learning, this data sparsity has caused diculty in model performance enhancement and repro- duction. One potential solution to the aforementioned data sparsity problem is to pretrain a model on large data and then leverage the model’s knowledge to a downstream task. However , pretraining on super vised tasks (e.g., speech recognition) is time-intensive and costly with manual labeling. Self-super vised training, which uti- lizes the original property of the data, can potentially remov e the dependency on manual labels, thus b eing able to easily scale with data. encoder decoder Figure 1: The DEP A pretraining framework. The training objective follows the estimation of a middle spectrogram within a sequence of 2 𝑘 + 1 sp ectrograms. Contribution. This paper proposes DEP A , a self-super vised, pre- trained dep ression a udio emb edding method for automatic depres- sion dete ction (see Figure 1). T o our knowledge, this is the rst time a self-super vised neural network pretraining is performed on a depression detection task. • W e achieved the highest classication performance and low- est regression errors on the b enchmark depression detec- tion dataset by mo deling each patient’s response via the sequence of his/her uttered spe ech and realize the extraction of response-level repr esentation with DEP A. • T o highlight the ne cessity of using sentence-level representa- tions for tasks like depression detection, we compared with previously-used audio features, including general-purpose features, emotion-related representations, and x-vector speaker embeddings. Results suggest a signicant p erformance gain with the use of DEP A and the eciency of sequence-level rep- resentations. W e also design several experiments to further illustrate performance enhancement by using sentence-level representations. • DEP A pretrained on depression data (in-domain) and out- domain datasets are compar ed, including other mental disor- ders inter viewing conversation datasets and general-purpose speech datasets. Results indicate that self-super vised pre- training on large datasets, especially on those share similar mental disor ders with depr ession, is benecial to the curr ent data sparsity scenarios and largely outperforms raw features without pretraining. • W e conduct a series of ablation studies to analyze p ossible factors which may inuence the pretraining process, includ- ing the conguration of feature extraction, the hyperparam- eters in pretraining process, and the pretraining strategy . 2 RELA TED W ORK In this section, related work on depression detection and self- supervise d learning will be discussed. 2.1 Depression detection V arious metho ds have been proposed for automatic depression de- tection. Representation learning and classier selection are the tw o major research areas within depression detection. Deep learning methods have been employed to extract high-le vel feature represen- tations [ 1 ]. In particular [ 11 ] utilized causal convolutional neural networks (C-CNN) to enable sequence-level feature extraction and achieved a high performance by combining visual, audio, and tex- tual modalities. Results indicate that se quence-level r epresentation outperforms frame-level ones with respect to depression detection. Notable work on pretraining audio features for depression dete ction includes [ 22 ], which trained an audio word-book in unsuper vised fashion using Gaussian mixture models to extract segment-level features and uses a BLSTM model with max-temp oral pooling as a depression classier . [ 20 ] investigated a knowledge transfer from emotion recognition to depression detection by rstly pretraining a recurrent neural network on a fully-labeled emotion recognition dataset. Their results suggest that emotion is a p ossible marker for automatic depr ession detection and that transfer learning enhances performance. 2.2 Self-supervise d learning Self-supervise d learning is a te chnique where training data is au- tonomously labeled, yet the training procedure is sup ervised. A clas- sic e xample of self-sup ervised learning is auto-encoders [ 8 ], aiming to reconstruct a given input from a hidden r epresentation. Learning representations with self-supervised training has lead to remark- able improvement in sev eral elds, including textual, visual, audio, and multimodal processing. In natural language processing (NLP), self-supervise d text embedding pretraining can be seen as a major breakthrough, with metho ds such as Glo V e [ 16 ], BERT [ 7 ], and ELMo [ 17 ]. Self-supervise d pretraining fr om audio-visual signals, such as SoundNet [ 2 ], have been found to outperform traditional spectrogram-based features in acoustic environment classication. In fact, much research has focused on self-super vised audio-visual segmentation and feature extraction [ 15 , 21 , 28 ]. Recently , many researches in Computer Vision eld use contrastive learning as their self-supervised learning method. SimCLR [ 3 ], MoCo [ 12 ], and CoCLR [ 10 ] all use contrastive learning with some variations to retain visual representations. In particular , pretrained approaches such as EmoA udioNet [ 19 ] have been applied in depression detection, howev er , its pretraining process requires large 1000h (Librispee ch), gender-labeled training data to be successful, which requires extensive manual labor . Our main inspiration for this work stems fr om Audio2V ec [ 25 ], where a self-supervise d approach was proposed, the objective of which is to extract general-purpose audio representations for mobile devices. Relatively , little research has be en conducted on self-super vised audio repr esentation learning in depr ession detection, or such med- ical applications. The reasons could include: 1 ) Content-rich audio contains undesirable information, such as environmental sounds, interfering spee ch, and noise. 2 ) Features are typically low-level and extracted within a short time-scale (e.g., 40 ms), each containing little information about high-level concepts (e .g., a single phoneme contains little information about a sentence). 3 ) Due to the nature of depression detection, interviews are comparatively long (many minutes), which, combined with ne-scale features, means that a classier needs to remember very long se quences while ltering out unimportant information. 3 DEP A: SELF-SUPERVISED AUDIO EMBEDDING This paper proposes DEP A, an auditory feature extracted via a neu- ral network to summarize spoken language. Our proposed method consists of a self-super vised convolutional encoder-decoder net- work, where the enco der is later use d as DEP A emb edding extractor from spectrograms. Given a spectrogram of a spe cic audio clip (e.g., a spoken sentence) X ∈ R 𝑆 × 𝐷 , where 𝑆 is the number of frames and 𝐷 the data dimension (e.g., fr equency bins). W e proceed to slice X into j 𝑆 ( 1 + 𝛼 ) ( 2 𝑘 + 1 ) · 𝑇 k non-overlapping samples X 𝑖 ∈ R ( ( 2 𝑘 + 1 ) · 𝑇 ) × 𝐷 , where 𝑇 is the numb er of frames in one sub-spectrogram which will be explained below , 𝑘 is the hyperparameter which controls number of such sub-spectrograms in one sample, and 𝛼 ≥ 0 is the gap pa- rameter such that each sample being 𝑔𝑎𝑝 ∈ R ( 𝛼 · ( 2 𝑘 + 1 ) · 𝑇 ) × 𝐷 apart. The gap between two segments avoids that the self-sup ervised model exploits sp ectral leakage to shortcut and easily solve the task. X = [ X 0 , 𝑔𝑎𝑝 , X 1 , 𝑔𝑎𝑝 , · · · , X 𝑖 , · · · , ] Each sample X 𝑖 is sliced into 2 𝑘 + 1 sub-spectrograms M 𝑗 ∈ R 𝑇 × 𝐷 . Each X 𝑖 is therefore the concatenation of a center M 0 sub- spectrogram and its 𝑘 adjacent left-right context: X 𝑖 = [ M − 𝑘 , · · · , M − 1 , M 0 , M 1 , · · · , M 𝑘 ] , If X is shorter than ( 2 𝑘 + 1 ) 𝑇 frames, we pad zeros to ll ( 2 𝑘 + 1 ) 𝑇 frames. Our self-supervise d learning uses a generative strategy . The training process treats the center spectrogram M 0 as the target, taking its surrounding spectrograms M 𝑗 , ( 𝑗 ≠ 0 ) to re-generate the center sp ectrogram and computes the emb edding loss (Equation (1)). Figure 2 shows the slicing process, and the detailed pretraining process is depicted in Figure 1. Gap Figure 2: Slicing an audio clip spe ctrogram ( X ) into sam- ples X 𝑖 and sub-sp e ctrograms M 𝑗 for DEP A training. The gap avoids sp ectral leakage of a sub-spe ctrogram to its neigh- bors. L 𝑒𝑚𝑏𝑒 𝑑 = 1 𝑇 𝐹 𝑇  𝑡 = 1 𝐹  𝑑 = 1 ( M 0 𝑡 ,𝑑 − M ′ 0 𝑡 ,𝑑 ) 2 . (1) Encoder architecture. The encoder architecture contains three downsampling blocks, followed by an extra convolution layer as well as an adaptiv e po oling layer . Each block consists of a convo- lution, average pooling, batch-normalization, and r e ctied linear unit (ReLU) activation layer . The time-axis 2 𝑘𝑇 is subsampled to 2 𝑘𝑇 64 before being average po oled in time and frequency dimension. Decoder architecture. The decoder upsamples the encoder out- put v via four transpose d convolutional upsampling blocks and predicts the center spe ctrogram M ′ 0 ∈ R 𝑇 × 𝐷 . The encoder-deco der architecture is shown in Figure 3. Input:mel-scalespectrogramsof(1,2kT ,D) Convolution1:(4,2kT/2,D/2) A veragePooling1:(4,2kT/4,D/4) BatchNormalization1:(4,2kT/4,D/4) ReLU1:(4,2kT/4,D/4) Convolution2:(16,2kT/8,D/8) A veragePooling2:(16,2kT/16,D/16) BatchNormalization2:(16,2kT/16,D/16) ReLU2:(16,2kT/16,D/16) Convolution3:(64,2kT/16,D/16) A veragePooling3:(64,2kT/32,D/32) BatchNormalization3:(64,2kT/32,D/32) ReLU3:(64,2kT/32,D/32) Convolution4:(256,2kT/64,D/64) AdaptivePooling1:(256,1,1) Output:(256) Input:(256) Linearprojectionandreshape:(64,T/16,D/16) T ransposeConvolution1:(32,T/8,D/8) BatchNormalization1:(32,T/8,D/8) ReLU1:(32,T/8,D/8) Output:(T ,D) T ransposeConvolution2:(16,T/4,D/4) BatchNormalization2:(16,T/4,D/4) ReLU2:(16,T/4,D/4) T ransposeConvolution3:(4,T/2,D/2) BatchNormalization3:(4,T/2,D/2) ReLU3:(4,T/2,D/2) T ransposeConvolution4:(1,T,D) Figure 3: DEP A pretraining encoder-decoder architecture. After pretraining the encoder-decoder network, DEP A is ex- tracted via feeding a variable-length audio segment R (here on patient response-level) into the encoder model and obtaining a sin- gle | v | = 256 -dimensional embedding per segment. The sequence of DEP A embeddings is then further fed into a depression detection network, which can be seen in Figure 4. 4 DO WNSTREAM T ASK: DEPRESSION DETECTION In this section, we detail our approach to the downstream task of depression detection on two datasets: D AIC, a small dataset used as depression detection benchmark; MDD, a large dataset focused specically on female patients with major depression dete ction (see section 5 for a detailed introduction). Small Benchmark Data with T wo Label Sets. Depression state and severity scor e is provided in the DAIC dataset, hence , we propose a multi-task scheme, combining depression state classication and depression score pr ediction. This approach models a patients’ de- pression sequentially , meaning that only the patients’ responses are utilized. Due to the recent success of LSTM networks in this eld [ 1 ], our depression prediction structure follows a bidirectional LSTM (BLSTM) approach with four layers of size 128 . The model outputs for response 𝑟 a two dimensional vector ( 𝑦 ′ 𝑐 ( 𝑟 ) , 𝑦 ′ 𝑟 ( 𝑟 ) ) , representing the estimated binar y patient state ( 𝑦 ′ 𝑐 ( 𝑟 ) ) as well as the PHQ-8 score ( 𝑦 ′ 𝑟 ( 𝑟 ) , a numerical metric to e valuate depression extent). Finally , rst timestep pooling is applie d to com- pile all responses of a patient to a single vector ( 𝑦 ′ 𝑐 ( 0 ) , 𝑦 ′ 𝑟 ( 0 ) ) . The architecture is shown in Figure 4. ℓ 𝑏𝑐 𝑒 ( 𝑦 ′ 𝑐 , 𝑦 𝑐 ) = - [ 𝑦 𝑐 · log 𝑦 ′ 𝑐 + ( 1 − 𝑦 𝑐 ) log ( 1 − 𝑦 ′ 𝑐 ) ] (2) ℓ ℎ𝑢𝑏 ( 𝑦 ′ 𝑟 , 𝑦 𝑟 ) = ( 0 . 5 ( 𝑦 𝑟 − 𝑦 ′ 𝑟 ) 2 , if | 𝑦 𝑟 − 𝑦 ′ 𝑟 | < 1 | 𝑦 𝑟 − 𝑦 ′ 𝑟 | − 0 . 5 , otherwise (3) ℓ ( 𝑦 ′ 𝑐 , 𝑦 𝑐 , 𝑦 ′ 𝑟 , 𝑦 𝑟 ) = ℓ 𝑏𝑐 𝑒 ( 𝜎 ( 𝑦 ′ 𝑐 ) , 𝑦 𝑐 ) + ℓ ℎ𝑢𝑏 ( 𝑦 ′ 𝑟 , 𝑦 𝑟 ) (4) T wo outputs are constructed, one directly predicts the binary outcome of a participant being depressed, the other outputs the estimated PHQ-8 score . W e opt to use a combination of binary cross entropy (BCE, for binar y classication, Equation (2)) and Huber loss (for regression, Equation (3)). 𝑦 𝑐 , 𝑦 𝑟 are the ground truth binary and PHQ-8 score, respectively , while 𝜎 is the sigmoid function. In this way , our model considers the internal relationship between binary classication and PHQ-8 score regression, wher e a higher PHQ-8 score commonly indicates a probability of being classied as depressed. Large Data with One Classication Label. MDD, a privately col- lected large depression dataset, is also applied in our downstream de- tection task. For this dataset, we merely predict the depression state, which is the only label provided. Therefore, the utilized method is similar to the one above with minor changes: the BLSTM only output one scaler: 𝑦 ′ 𝑐 , and only a binary cross-entropy loss ℓ 𝑏𝑐 𝑒 is used. Similarly , we model the patient’s response in a sequential manner . 5 EXPERIMEN T AL SET UP Depression Data. A commonly use d dataset within depression detection is the Distress Analysis Interview Corpus – Wizard of Oz (DAIC) [ 6 ] dataset, which encompasses 50 hours of data col- lected from a total of 142 patients. T wo labels are provided for each participant: a binary diagnosis of depressed/healthy and the patient’s eight-item P atient H ealth Q uestionnaire score (PHQ-8) metric. Thirty speakers within the training (28 %) and 12 within encoder encoder BLSTM BLSTM BLSTM BLSTM BLSTM BLSTM BLSTM BLSTM PoolingFunction LinearProjection LinearProjection "ummyparentsarefromhereum" Response0ofPatient P "byebye" ResponsenofPatient P DEP A DEP A Figure 4: Depression dete ction with DEP A on DAIC with multi-task training scheme. The encoder from the proposed encoder-decoder mo del provides the BLSTM network with high-level auditor y features. In this gure, DEP A is ex- tracted on response-level. the development (34 %) set are classied to have depr ession. The D AIC dataset is fully transcribe d, including corresponding on- and osets within the audio. While this dataset contains training, de- velopment, and test subsets, our evaluation protocol is reported on the de velopment subset, since test subset labels ar e only available to participants of the 2017 A udio/Visual Emotion Challenge (A VEC). Dataset Train Dev T est D AIC #(D/H) 30/77 12/23 - # responses 158 190 - ∅ response (s) 2.74 2.63 - MDD #(D/H) 516/357 101/87 105/83 # responses 318 320 373 ∅ response (s) 0.77 0.769 0.78 T able 1: Statistics regarding the number of responses for each subset. D/H represents depressed and healthy patients respectively . In addition, a large conversational dataset (MDD) for major de- pression disorder detection under colle ction has now consisted of 1000 hours of speech conversation between interviewers and subjects, with a balanced proportion of healthy and depressed par- ticipants (722 depressed and 527 healthy). W e split the dataset into a Domain Dataset Duration (h) Language In D AIC 13 English MDD 411 Mandarin Out SWB 300 English AD 400 Mandarin T able 2: In- and out-domain datasets used for DEP A pretrain- ing. training set ( 70% ), a dev elopment set ( 15% ), and a test set ( 15% ). Un- like the fully-transcribed DAIC dataset, no annotation is provided in MDD . W e hence applied the X-vector-based speaker diarization tool pro vided by the Kaldi T oolkit [ 18 ] to extract all patient’s speak- ing segments from the audio. MDD is incorporate d to highlight the benet of summarizing long sequences using DEP A. Detailed statistics regarding the pr oportion of depressed/healthy subjects, the numb er of patient responses, and their average duration is displayed in T able 1. Pretraining Data. W e aim to compare DEP A in regards to pre- training on related, e.g., in-domain ( depression detection) and out- domain (e.g., speech recognition) datasets. Regarding in-domain data, we utilize d the aforementioned DAIC and MDD datasets (we take a subset of 411 hours) for in-domain pretraining in order to compare DEP A to traditional audio feature approaches. In order to ascertain DEP As’ usability , we further used the mature Switchboard (SWB) [ 9 ] dataset, containing 300 hours of English telephone speech. Lastly , we utilized the Alzheimer’s dis- ease (AD ) dataset, collected by Shanghai Mental Clinic Center [ 13 ], containing about 400 hours (questions and answ ers) of Mandarin interview recordings from elderly participants. The four datasets are described in T able 2. Feature Selection. Regarding front-end features, our work inv es- tigates common LMS and log-p ower STFT features. Due to dierent sample rates across the datasets, we resample each dataset’s audio to 22050 Hz. All following features ar e extracted as default with a hop length ( 𝜔 ℎ𝑜 𝑝 ) of 5 𝑚𝑠 and a Hann window length ( 𝜔 𝑤𝑖 𝑛 ) of four times 𝜔 ℎ𝑜 𝑝 (e .g., 20 𝑚𝑠 ). 128 dimensional LMS and 512-dimensional STFT features wer e chosen as the default signal-processing front- end. In order to compare DEP A against non-self-super vised ap- proaches, 553 -dimensional higher-order (mean, median, variance, min, max, skewness, kurtosis) COV AREP [ 5 ] (HCVP) features wer e extracted on response-level. HCVP can b e seen as a traditional high-level repr esentation, which is an ensemble of lower-level de- scriptors, such as MFCC, pitch, glottal ow , and other features. Lastly , we also extracted 256-dimensional x-vectors using a Resnet34 structur e [ 23 ], for comparison purposes. X-vectors, which are a state-of-the-art method within sp eaker recognition, have been seen to outperform traditional i-vectors, reported as markers for depression and some other mental diseases [4]. DEP A Pretraining Process. Our encoder-deco der training utilizes LMS and STFT front-end features, with hyper-parameters 𝑘 = 5 , 𝑇 = 96 , 𝛼 = 0 . 1 , extracting a | v | = 256 dimensional DEP A emb edding. Moreover , the model is trained for 25 ep ochs using Adam opti- mization with a starting learning rate of 0 . 004 , and a batch size of 512. Depression Detection Training Process. As mentioned, for the D AIC dataset, we used a multi-task learning strategy to output both binary classication and PHQ-8 scor es with a BLSTM network structure. Regarding the MDD dataset, the BLSTM only output classication prediction. Data standardization was applied by cal- culating a global mean and variance on the training set and using those on the dev elopment set. A dropout of 0 . 1 was applied after each BLSTM layer to prevent o vertting. Adam optimization with a starting learning rate of 4e − 5 and a batch size of 1 was used. Metrics. Following previous work [ 1 ], results are reported in terms of mean average error (MAE) and root mean square de viation (RMSE) for regression and macr o-averaged (class-wise) precision, recall, and their harmonic mean (F1) score for classication. 6 RESULTS Results on the two dierent datasets are provided respe ctively: D AIC, a b enchmark dataset for depression detection, is used to compare against previously methods and demonstrate how DEP A can help boost performance on sparse data scenarios; MDD , by con- trast, provides insight on ho w DEP A compares with raw features, along with a dierent number of input responses. 6.1 D AIC Results Our results using the proposed BLSTM approach with and without DEP A pretraining are compared to previous attempts in Table 3. The results are analyzed on multiple levels. Feature Level Comparison. The results in T able 3 are in line with our initial assumption, that frame-level audio-featur es are indeed underperforming compared to response-level ones, especially for the BLSTM model. This is likely due to the models’ inherent inca- pability to remember very long sequences ( >10000 frames) for an abstract task such as depression dete ction, which is also commonly seen within other audio processing tasks where long sequences ar e harder to pr edict. Regarding the classication results, it can be se en that traditional HCVP features outp erform LMS, STFT , and x-vector approaches. Specically , with respect to regr ession, HCVP achie ves a score of 4.95, much lower than any other frame-level feature approach (LMS, STFT , dMFCC- VT , CVP). The sub-optimal perfor- mance of the x-vector system is likely due to the short response durations in this dataset, being on average ≈ 2 seconds long. The performance of x-vector systems generally decreases for utterances shorter than 3 seconds [ 23 ]. Furthermore, r esp onse-level features are likely to contain more context-r elated information, while frame- level features tend to isolate information between frames. Feature Comparison. Even though a multitude of features are compared (MFCC, LMS, LLD , CVP, STFT), no clear trend can b e established between feature and nal performance. Regarding our BLSTM approach, STFT features consistently underperform against LMS and HCVP features in terms of MAE. This is likely due to the increased amount of parameters nee ded to b e estimated by Classication Regression Method Feature Feature level Pretrain Database Pre Rec F1 MAE RMSE [27] dMFCC- VT Frame ✗ - - 0.57 5.32 6.38 [14] LMS Frame ✗ 0.68 0.77 0.72 - - [1] HCVP Response ✗ 0.71 0.56 0.63 5.13 6.50 [24] LLD Response ✗ - - - 4.96 6.32 [26] CVP Frame ✗ 0.63 0.69 0.66 5.36 6.74 [19] STFT +MFCC Frame LibriSpee ch - - 0.66 - - BLSTM (Ours) HCVP Response ✗ 0.73 0.66 0.69 4.95 6.45 LMS Frame ✗ 0.61 0.61 0.61 5.68 6.51 STFT Frame ✗ 0.64 0.64 0.64 6.83 9.27 x-vector Response V oxceleb 0.59 0.59 0.59 6.23 7.10 BLSTM (Ours) + DEP A LMS Response DAIC 0.71 0.65 0.68 5.47 6.33 STFT 0.91 0.89 0.90 5.48 6.31 LMS Response MDD 0.75 0.74 0.75 5.10 6.05 STFT 0.94 0.94 0.94 5.59 6.46 LMS Response SWB 0.84 0.87 0.86 5.43 6.41 STFT 0.91 0.90 0.91 5.15 6.02 LMS Response AD 0.67 0.67 0.67 5.37 6.50 STFT 0.93 0.96 0.94 4.75 5.73 T able 3: Comparison between DEP A and other audio-based depression detection methods on the DAIC development set. the BLSTM model (input layer increases from 128 to 512) in con- junction with the limited available training data. This can partially be improved by either reducing the featur e size (e.g., utilize LMS, MFCC features) or the number of samples per speaker (e .g., use the response averaged HCVP features). By contrast, when experi- mented with DEP A features, extracting DEP A from STFT features constantly outperforms DEP A from LMS features. Pretraining Datasets Comparison. DEP A pretraining on the same D AIC dataset can be seen to enhance p erformance for LMS (F1 0.61 → 0.68) and especially STFT features (F1 0.64 → 0.90). This, in turn, reinforces our initial assumption that response-level features ar e much more useful for depression detection. Pretraining on large datasets (MDD, SWB, and AD) outperformed DAIC in terms of binary classication as well as regression. Further , pretraining on AD resulted in the best p erformance in terms of all metrics. Larger datasets (DAIC

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment