Reading Between the Waves: Robust Topic Segmentation Using Inter-Sentence Audio Features
Spoken content, such as online videos and podcasts, often spans multiple topics, which makes automatic topic segmentation essential for user navigation and downstream applications. However, current methods do not fully leverage acoustic features, leaving room for improvement. We propose a multi-modal approach that fine-tunes both a text encoder and a Siamese audio encoder, capturing acoustic cues around sentence boundaries. Experiments on a large-scale dataset of YouTube videos show substantial gains over text-only and multi-modal baselines. Our model also proves more resilient to ASR noise and outperforms a larger text-only baseline on three additional datasets in Portuguese, German, and English, underscoring the value of learned acoustic features for robust topic segmentation.
💡 Research Summary
This paper addresses the problem of automatic topic segmentation in spoken content such as YouTube videos and podcasts, where traditional text‑only approaches struggle due to informal transcripts and high ASR error rates. The authors propose a multimodal architecture that jointly fine‑tunes a text encoder and an audio encoder, focusing specifically on acoustic cues that appear at sentence boundaries. For each inter‑sentence boundary they extract two short audio windows (default 2 seconds) – one at the end of the preceding sentence and one at the start of the following sentence – and encode both with a shared‑weight Siamese network based on a pretrained speech model (wav2vec 2.0, HuBERT, or UniSpeech‑SAT). The resulting vectors are mean‑pooled, projected to 192 dimensions, concatenated, and passed through a tanh activation to form a 384‑dimensional acoustic boundary feature.
Textual information is obtained by encoding each sentence with MiniLM (384‑dimensional). The sentence embedding and the acoustic boundary feature are concatenated into a 768‑dimensional vector, which is fed into a RoFormer encoder that produces contextualized token representations. A lightweight classifier maps each representation to a binary probability indicating whether the current sentence starts a new topic. The model is trained end‑to‑end with binary cross‑entropy loss; during each training step gradients are routed through either the text branch or the audio branch (probability 0.5) to reduce memory consumption, while the RoFormer and classifier are always updated.
Experiments are conducted on the YTSEG benchmark (19,299 English YouTube videos) and three additional multilingual datasets (Portuguese, German, English). MultiSeg, the proposed model, achieves an F1 score of 52.98 and Boundary Similarity of 45.09, outperforming strong text‑only baselines (MiniSeg, Cross‑segment BERT) and a prior multimodal baseline that uses L³‑Net embeddings for whole‑sentence audio. Notably, MultiSeg uses 59.9 % fewer parameters than a scaled‑up text‑only RoBERTa model while delivering higher performance, demonstrating that integrating acoustic information is more effective than merely increasing model size.
Ablation studies reveal that (1) focusing on inter‑sentence windows yields a 1.96‑point F1 gain over using full‑sentence audio, (2) fine‑tuning the audio encoder is essential (freezing wav2vec 2.0 drops F1 by 1.79 points), and (3) a 2‑second window provides the best trade‑off between accuracy and efficiency, with diminishing returns beyond three seconds. Alternative speech backbones (HuBERT, UniSpeech‑SAT) perform comparably but slightly worse than wav2vec 2.0.
Robustness to transcription errors is evaluated by re‑transcribing the test set with six off‑the‑shelf ASR systems (various Whisper models and a lightweight Vosk model). As word error rates increase from ~19 % to ~25 %, MultiSeg’s performance degrades more gracefully than the text‑only baseline, with average F1 drops of 4–6 % versus 8–13 % for the text‑only system. This confirms that acoustic boundary cues can compensate for noisy transcripts.
Qualitative analysis of videos with the largest performance gains shows that the model leverages brief pauses, pitch drops, emphatic restarts, speaker or scene changes, and short transition sounds that typically occur at topic shifts. When such cues are absent or distributed uniformly, the benefit diminishes, indicating that the audio component is most valuable when boundary‑specific signals are present.
In summary, the paper demonstrates that a simple yet targeted multimodal design—concatenating sentence‑level text embeddings with Siamese‑encoded inter‑sentence audio features and fine‑tuning both modalities—significantly improves topic segmentation for spoken documents, offers robustness to ASR noise, and generalizes across languages, providing a compelling blueprint for future multimodal NLP research.
Comments & Academic Discussion
Loading comments...
Leave a Comment