An interpretable speech foundation model for depression detection by revealing prediction-relevant acoustic features from long speech

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Speech-based depression detection tools could aid early screening. Here, we propose an interpretable speech foundation model approach to enhance the clinical applicability of such tools. We introduce a speech-level Audio Spectrogram Transformer (AST) to detect depression using long-duration speech instead of short segments, along with a novel interpretation method that reveals prediction-relevant acoustic features for clinician interpretation. Our experiments show the proposed model outperforms a segment-level AST, highlighting the impact of segment-level labelling noise and the advantage of leveraging longer speech duration for more reliable depression detection. Through interpretation, we observe our model identifies reduced loudness and F0 as relevant depression signals, aligning with documented clinical findings. This interpretability supports a responsible AI approach for speech-based depression detection, rendering such tools more clinically applicable.

💡 Research Summary

This paper addresses two major obstacles that limit the clinical deployment of speech‑based depression detection systems: (1) reliance on short audio segments that introduce label noise because not every segment contains depression‑relevant cues, and (2) lack of interpretability, which makes clinicians hesitant to trust black‑box predictions. To overcome these issues, the authors propose a speech‑level Audio Spectrogram Transformer (AST) that processes an entire recorded monologue rather than isolated snippets, and they introduce a novel gradient‑weighted attention interpretation pipeline that extracts human‑readable acoustic features from the most influential spectrogram frames.

Data preprocessing begins with automatic transcription of YouTube vlogs using the Whisper model. The transcripts are segmented into natural sentences, each accompanied by precise word‑level timestamps. Corresponding audio segments are extracted, converted to 128‑dimensional log‑Mel filterbank features (25 ms window, 10 ms hop) and padded/truncated to a uniform size of 128 × 1024. Unlike the original AST, which splits spectrograms into 16 × 16 patches, the proposed “frame‑based” AST divides each spectrogram into temporal frames of size 128 × 2, preserving fine‑grained time resolution. Padding masks are applied to ignore empty regions.

The model architecture consists of two hierarchical transformer encoders. The sentence‑level encoder re‑uses a pre‑trained AST to generate a

An interpretable speech foundation model for depression detection by revealing prediction-relevant acoustic features from long speech

💡 Research Summary

Comments & Academic Discussion

Leave a Comment