LALM-as-a-Judge: Benchmarking Large Audio-Language Models for Safety Evaluation in Multi-Turn Spoken Dialogues

LALM-as-a-Judge: Benchmarking Large Audio-Language Models for Safety Evaluation in Multi-Turn Spoken Dialogues
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Spoken dialogues with and between voice agents are becoming increasingly common, yet assessing them for their socially harmful content such as violence, harassment, and hate remains text-centric and fails to account for audio-specific cues and transcription errors. We present LALM-as-a-Judge, the first controlled benchmark and systematic study of large audio-language models (LALMs) as safety judges for multi-turn spoken dialogues. We generate 24,000 unsafe and synthetic spoken dialogues in English that consist of 3-10 turns, by having a single dialogue turn including content with one of 8 harmful categories (e.g., violence) and on one of 5 grades, from very mild to severe. On 160 dialogues, 5 human raters confirmed reliable unsafe detection and a meaningful severity scale. We benchmark three open-source LALMs: Qwen2-Audio, Audio Flamingo 3, and MERaLiON as zero-shot judges that output a scalar safety score in [0,1] across audio-only, transcription-only, or multimodal inputs, along with a transcription-only LLaMA baseline. We measure the judges’ sensitivity to detecting unsafe content, the specificity in ordering severity levels, and the stability of the score in dialogue turns. Results reveal architecture- and modality-dependent trade-offs: the most sensitive judge is also the least stable across turns, while stable configurations sacrifice detection of mild harmful content. Transcription quality is a key bottleneck: Whisper-Large may significantly reduce sensitivity for transcription-only modes, while largely preserving severity ordering. Audio becomes crucial when paralinguistic cues or transcription fidelity are category-critical. We summarize all findings and provide actionable guidance for practitioners.


💡 Research Summary

The paper addresses the growing need for automatic safety assessment of spoken dialogues, where voice agents are increasingly used in customer service, tutoring, healthcare, and other domains. Existing safety‑moderation work focuses on text, ignoring audio‑specific cues such as prosody, emphasis, background sounds, and transcription errors that can dramatically affect the perception of harmful content. To fill this gap, the authors introduce “LALM‑as‑a‑Judge,” the first controlled benchmark that evaluates large audio‑language models (LALMs) as safety judges for multi‑turn spoken conversations.

Data creation starts from 100 safe dialogues drawn from the DEEPDIALOGUE corpus (3–10 turns each). For each dialogue, a single turn is replaced with a synthetic unsafe version generated by GPT‑4o. The replacement is conditioned on (i) a chosen safety category among eight (hate, harassment, dangerous, deception, violence, sexual, self‑harm, overall) and (ii) a severity level on a five‑point ordinal scale (very mild to severe). GPT‑4o also produces an emotion label for the new turn, allowing the synthesis of emotionally consistent speech. The revised turn is rendered with Coqui XTTS‑v2, preserving the original speaker’s voice, and inserted back into the dialogue, yielding 24 000 spoken dialogue variants (including the original 100 safe ones). Each variant differs from its safe counterpart in exactly one turn, enabling fine‑grained attribution of safety judgments.

A human anchor study with five independent raters evaluated 160 sampled dialogues (60 safe, 100 unsafe). The study confirmed that the synthetic unsafe turns are reliably perceived as harmful (Cohen’s κ ≈ 0.84 for binary safety, weighted κ ≈ 0.80 for severity) and that the severity scale aligns with human perception (Spearman ρ ≈ 0.59, 83.5 % within ±1 level). Detection rates increase monotonically with severity, reaching 100 % for levels 3–5.

The benchmark tests three open‑source LALMs of comparable size: Qwen2‑Audio (7 B), Audio Flamingo 3 (8.2 B), and MERaLiON (10 B). A text‑only baseline, LLaMA‑3.1‑8 B, is also evaluated. Each model is probed under three modalities—audio‑only, transcript‑only, and multimodal (audio + transcript). For audio‑involved configurations, three transcription sources are examined: ground‑truth (GT), Whisper‑Large, and Whisper‑Base. Five prompting strategies are explored (basic, chain‑of‑thought, few‑shot, rubric‑anchored, calibrated) to assess how instruction design influences scoring.

Three evaluation dimensions are defined: (i) sensitivity (true‑positive rate for unsafe content), (ii) severity ordering specificity (mean safety score drop Δ_i for each severity level i), and (iii) turn‑position stability (variance of scores across dialogue turns). Results reveal clear trade‑offs. The most sensitive configuration—Qwen2‑Audio in audio‑only mode—achieves the highest unsafe detection rate but exhibits the greatest score volatility across turns, making it less reliable for continuous monitoring. Conversely, MERaLiON in transcript‑only mode with Whisper‑Large yields the most stable scores (low variance) but misses many mild‑severity cases (TPR ≈ 68 % for severity 1‑2).

Transcription quality emerges as a critical bottleneck. Whisper‑Large reduces sensitivity by roughly 10 % compared with GT transcripts in transcript‑only settings, yet it preserves the relative ordering of severity levels, indicating that accurate ranking can survive imperfect transcriptions. Audio cues become decisive for categories where prosodic emphasis or background context conveys harmful intent (e.g., threats whispered with anger). In such cases, audio‑only or multimodal configurations outperform transcript‑only judges.

Architectural differences outweigh sheer parameter count: Audio‑centric models (Qwen2‑Audio) excel at raw detection, while multimodal models (Audio Flamingo 3) strike a balance between detection and stability. Prompt engineering also matters; calibrated prompts improve the spread of scores, reducing clustering around mid‑range values.

The authors synthesize these findings into actionable guidance: (1) prioritize audio‑only LALMs when the deployment environment includes rich acoustic signals and when detecting any unsafe content is paramount; (2) opt for transcript‑only or multimodal LALMs with high‑quality ASR when stability across turns is essential, especially in low‑noise or scripted settings; (3) select transcription engines based on expected noise levels—Whisper‑Large is acceptable for severity ordering but may miss subtle threats; (4) tailor prompting strategy to the model, using calibrated or rubric‑anchored prompts for more discriminative scoring.

In summary, “LALM‑as‑a‑Judge” provides the first open‑source benchmark for safety evaluation of spoken multi‑turn dialogues, demonstrates that LALMs can serve as effective judges, and uncovers nuanced interactions among model architecture, input modality, transcription quality, and prompting. The work lays a foundation for future research on multilingual safety, real‑time streaming assessment, and integration of LALM judges into production voice‑assistant pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment