The Impact of Automatic Speech Transcription on Speaker Attribution
Speaker attribution from speech transcripts is the task of identifying a speaker from the transcript of their speech based on patterns in their language use. This task is especially useful when the audio is unavailable (e.g. deleted) or unreliable (e.g. anonymized speech). Prior work in this area has primarily focused on the feasibility of attributing speakers using transcripts produced by human annotators. However, in real-world settings, one often only has more errorful transcripts produced by automatic speech recognition (ASR) systems. In this paper, we conduct what is, to our knowledge, the first comprehensive study of the impact of automatic transcription on speaker attribution performance. In particular, we study the extent to which speaker attribution performance degrades in the face of transcription errors, as well as how properties of the ASR system impact attribution. We find that attribution is surprisingly resilient to word-level transcription errors and that the objective of recovering the true transcript is minimally correlated with attribution performance. Overall, our findings suggest that speaker attribution on more errorful transcripts produced by ASR is as good, if not better, than attribution based on human-transcribed data, possibly because ASR transcription errors can capture speaker-specific features revealing of speaker identity.
💡 Research Summary
The paper presents the first comprehensive investigation of how automatic speech recognition (ASR) transcription errors affect speaker attribution performance. While prior work has largely relied on human‑produced “gold” transcripts to evaluate whether a speaker can be identified from the text of their speech, real‑world applications often must rely on error‑prone ASR outputs. To fill this gap, the authors conduct a systematic study using the Fisher English Training Speech Transcripts corpus, a large collection of telephone conversations that includes both audio and two styles of human transcription.
Five ASR systems of varying architectures and training data are selected: a GigaSpeech‑trained zipformer transducer, AssemblyAI’s Universal‑1 model, a wav2vec2 model fine‑tuned on Switchboard, OpenAI’s Whisper‑Turbo, and a TED‑LIUM3‑trained zipformer. These systems produce character‑level word error rates (cpWER) ranging from 0 % (the human reference) up to 32 % on the test set. The authors extract speaker‑turn audio segments using the gold timestamps, run each ASR system on each turn, and recombine the turn‑level outputs into full conversation transcripts. For evaluation of transcription quality they compute WER against the LDC‑encoded Fisher transcripts, stripping punctuation and non‑speech markers to focus on lexical errors.
With the same set of text‑based authorship attribution models that have previously been applied to human‑transcribed speech—ranging from simple n‑gram logistic regression and SVM classifiers to modern Transformer‑based models (BERT/Roberta)—the authors measure speaker identification accuracy and F1 score across three difficulty levels (base, hard, harder). The “hard” setting, where topic control forces speakers in positive trials to discuss different topics and speakers in negative trials to discuss the same topic, is used as the primary benchmark.
Key findings are:
-
Robustness to Word‑Level Errors – Across the five ASR systems, increasing WER from 0 % to 32 % leads to only modest declines in speaker attribution performance. Even at 20 % error the drop is negligible, and at 30 % the reduction is small. This suggests that the lexical patterns used by attribution models are largely preserved despite substantial transcription noise.
-
ASR Errors Capture Speaker Idiosyncrasies – A detailed error analysis shows that many ASR mistakes involve filler words (“uh”, “like”), repeated disfluencies, or systematic mis‑pronunciations. These error types tend to be speaker‑specific; the same speaker’s utterances are consistently altered in similar ways. Consequently, the erroneous tokens become part of the n‑gram or token‑frequency features that the attribution models exploit, effectively encoding a speaker’s “style” even when the content is corrupted.
-
Performance at Extreme Error Rates – When transcripts are artificially degraded to over 90 % cpWER, content‑based features become unreliable, yet models still achieve non‑trivial accuracy by relying on meta‑features such as utterance length, turn frequency, and overall turn‑count patterns. The authors caution that these cues may be dataset‑specific (the Fisher calls have relatively uniform turn structures) and may not generalize to other domains.
-
Domain Mismatch Between Training and Test Transcripts – Models trained exclusively on gold human transcripts suffer a sharp performance loss when evaluated on ASR outputs, highlighting a domain‑shift problem. This underscores the need for either domain‑adaptation techniques or training on ASR‑style data when the target deployment scenario involves automatic transcriptions.
-
Feature Probing Experiments – Probing classifiers trained on various feature families reveal that adding content‑based embeddings (e.g., Transformer hidden states) improves attribution, while length‑only or turn‑count‑only features only help when content is heavily corrupted.
The authors situate their findings within prior work on lexical speaker recognition, noting that earlier studies (e.g., Doddington 2001) already hinted at the usefulness of word‑level patterns, and that recent forensic research has revived interest in high‑level textual cues. Their results extend these observations by demonstrating that ASR‑induced noise does not necessarily erase speaker‑specific signals; on the contrary, certain systematic errors may amplify them.
Implications: In large‑scale speech analytics, voice cloning, or privacy‑preserving scenarios where audio may be unavailable or unreliable, practitioners can confidently employ off‑the‑shelf ASR systems without fearing a dramatic loss in speaker attribution accuracy. However, careful consideration of training‑test domain alignment remains essential, especially when moving beyond the telephone‑conversation domain or to languages other than English.
Limitations: The study is confined to English telephone speech; other languages, acoustic environments, or conversational styles (e.g., meetings, broadcast news) may exhibit different error patterns and meta‑feature relevance. Additionally, the possibility that some ASR models have seen portions of the Fisher data during pre‑training could bias results, though the authors attempted to mitigate this through system selection and documentation.
Future work should explore multilingual corpora, diverse acoustic settings, and explicit domain‑adaptation strategies (e.g., fine‑tuning attribution models on ASR‑generated data). Combining textual attribution with acoustic embeddings (x‑vectors, wav2vec2 speaker models) may further boost robustness, especially in scenarios where both modalities are partially available.
In summary, the paper convincingly shows that speaker attribution is surprisingly resilient to ASR transcription errors, that error patterns can encode speaker‑specific stylistic cues, and that automatic transcripts can be as effective—or even more effective—than human‑produced ones for the task of identifying who said what.
Comments & Academic Discussion
Loading comments...
Leave a Comment