Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs
Recent work in benchmarking bias and fairness in speech large language models (SpeechLLMs) has relied heavily on multiple-choice question answering (MCQA) formats. The model is tasked to choose between stereotypical, anti-stereotypical, or neutral/irrelevant answers given an input speech prompt and an optional text prompt. Such MCQA benchmarks implicitly assume that model performance is consistent across other MCQA tasks, voices, and other task formats such as more realistic, long-form evaluations. In this paper, we probe that assumption. We fine-tune three SpeechLLMs using LoRA adapters to induce specific MCQA behaviours: preference for stereotypical, anti-stereotypical, or neutral/uncertain answers. We then evaluate whether these behaviours generalise to another, distinct MCQA benchmark, and more critically to long-form, creative generation tasks. Our results show that performance on MCQA bias benchmarks fails to reliably predict performances across other MCQA benchmarks, and more importantly across long-form tasks. We conclude that current MCQA bias benchmarks show limited evidence of cross-task generalisation in the speech domain, and also propose an evaluation suite for measuring behaviour transferability in future models and benchmarks.
💡 Research Summary
The paper investigates whether gender‑bias benchmarks that rely on multiple‑choice question answering (MCQA) for speech‑based large language models (SpeechLLMs) generalize to other MCQA datasets and to realistic long‑form generation tasks. The authors fine‑tune three state‑of‑the‑art SpeechLLMs—Qwen2‑Audio‑7B‑Instruct, LTU‑AS, and LLaMA‑Omni—using low‑rank adapters (LoRA) to enforce three distinct behaviours: always select the stereotypical answer, always select the anti‑stereotypical answer, or produce a neutral/uncertain response. Fine‑tuning data come from two MCQA benchmarks: the gender subset of Spoken StereoSet (SSS) and a newly created Speech‑based Ambiguity and Gender‑influenced Evaluation (SA GE) suite, which contains 600 occupational scenarios spoken by 20 male and 20 female TTS voices.
The study evaluates two axes of transfer: (1) cross‑benchmark consistency—does a model trained to behave a certain way on one MCQA benchmark exhibit the same behaviour on a different MCQA benchmark?—and (2) MCQA‑to‑long‑form transfer—does the learned bias (or debias) persist when the model is asked to generate free‑form responses? For long‑form assessment the authors introduce SA GE‑LF, a set of 80 prompts covering therapy advice, career counseling, interview screening, and story generation, each paired with the same TTS voices used in the MCQA tasks. Outputs are judged by an LLM‑based evaluator (Gemini‑2.5‑flash‑lite) on dimensions such as emotional validation, STEM vs. care orientation, leadership endorsement, and salary generosity, with a subset validated by human annotators.
Results show that while same‑benchmark fine‑tuning yields near‑perfect alignment (e.g., SA GE → SA GE), cross‑benchmark transfer is only partial and highly variable across models. Notably, LLaMA‑Omni, when fine‑tuned for “neutral” behaviour, frequently refuses to answer MCQA options, outputting “None of the above” over 70 % of the time, indicating that the neutral fine‑tuning teaches option avoidance rather than unbiased reasoning. In the long‑form tasks, the expected pattern—reduced emotional validation for women and increased STEM/leadership scores after anti‑stereotypical fine‑tuning—appears inconsistently. Some dimensions improve modestly, while others shift in the opposite direction or exhibit large task‑specific variance.
The authors conclude that MCQA bias benchmarks provide limited predictive power for other MCQA datasets and, especially, for realistic generative scenarios. They argue that future bias evaluation must incorporate task‑transfer measurements and propose the open‑source SA GE and SA GE‑LF suites as a starting point for more comprehensive, speech‑grounded bias assessment. The work highlights the need for richer, context‑aware benchmarks to ensure that debiasing interventions truly translate into safer, fairer behaviour in real‑world speech applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment