Benchmarking von ASR-Modellen im deutschen medizinischen Kontext: Eine Leistungsanalyse anhand von Anamnesegesprächen
Automatic Speech Recognition (ASR) offers significant potential to reduce the workload of medical personnel, for example, through the automation of documentation tasks. While numerous benchmarks exist for the English language, specific evaluations for the German-speaking medical context are still lacking, particularly regarding the inclusion of dialects. In this article, we present a curated dataset of simulated doctor-patient conversations and evaluate a total of 29 different ASR models. The test field encompasses both open-weights models from the Whisper, Voxtral, and Wav2Vec2 families as well as commercial state-of-the-art APIs (AssemblyAI, Deepgram). For evaluation, we utilize three different metrics (WER, CER, BLEU) and provide an outlook on qualitative semantic analysis. The results demonstrate significant performance differences between the models: while the best systems already achieve very good Word Error Rates (WER) of partly below 3%, the error rates of other models, especially concerning medical terminology or dialect-influenced variations, are considerably higher.
💡 Research Summary
This paper addresses the gap in benchmarking automatic speech recognition (ASR) systems for the German medical domain, especially with respect to dialectal variation and specialized terminology. The authors created a curated dataset called “Med‑De‑Anamnese,” comprising four simulated doctor‑patient anamnesis scenarios: (1) standard back‑pain interview, (2) abdominal pain with diverticulitis terminology, (3) a non‑native‑speaker physician discussing deep‑vein thrombosis, and (4) a patient speaking a strong regional dialect describing Fabry disease. Each audio segment was manually transcribed into two ground‑truth formats: a plain normalized transcript and a speaker‑attributed JSON file, enabling evaluation of both transcription accuracy and diarization performance.
A total of 29 ASR systems were evaluated, spanning open‑weight models (various sizes of OpenAI Whisper, WhisperX, Voxtral, and multilingual Wav2Vec2‑XLS‑R fine‑tuned on German) and commercial cloud APIs (AssemblyAI Universal, Deepgram Nova‑2). For consistency, all audio was resampled to 16 kHz, volume‑normalized, decoded with greedy decoding, and post‑processed (lower‑casing, punctuation stripping). The evaluation metrics included Word Error Rate (WER), Character Error Rate (CER), BLEU score, and Speaker‑Attributed WER (SA‑WER), the latter requiring correct word transcription and correct speaker label.
Results show a clear hierarchy of performance. AssemblyAI Universal achieved the lowest average WER of 2.99 % and the highest BLEU (0.942), demonstrating the robustness of its Conformer‑based architecture across all scenarios, including those with strong dialects. Among open‑source models, Voxtral Small performed best with a 7.11 % WER, offering a strong parameter‑efficiency trade‑off. Whisper Large‑v3 (distilled) recorded a 12.6 % WER, while older Whisper variants and compact Wav2Vec2 models exceeded 20 % WER, rendering them unsuitable for clinical documentation without further adaptation.
Error analysis revealed that all models struggled with specialized medical terminology (e.g., “Divertikulitis,” “Morbus Fabry”) and with heavily accented speech. The performance gap widened in the dialect‑heavy patient scenario, where Whisper Large‑v3’s error rate increased sharply. Diarization performance, measured via SA‑WER, was superior for the commercial APIs (AssemblyAI, Deepgram) which provide integrated speaker attribution; open‑source pipelines required external diarization tools such as Pyannote, adding complexity and potential error sources.
From a data‑privacy perspective, the authors emphasize that on‑premise deployment of open‑weight models (Voxtral, Whisper) aligns with GDPR requirements, preserving data sovereignty while delivering competitive accuracy. Commercial APIs, while offering the highest raw transcription quality, involve transmitting patient audio to external servers, raising compliance concerns.
The discussion underscores that current ASR technology has crossed a critical threshold for safe use in German medical settings, provided that models are selected with attention to dialect robustness, terminology coverage, and privacy constraints. The paper suggests future work on domain‑specific fine‑tuning, dialect adaptation, and integration with large language models for downstream clinical NLP tasks. In summary, AssemblyAI Universal and Voxtral Small represent the state‑of‑the‑art for German medical speech transcription, balancing accuracy, efficiency, and regulatory compliance.
Comments & Academic Discussion
Loading comments...
Leave a Comment