ABCD: All Biases Come Disguised
Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs’ ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit varying degrees of label-position-few-shot-prompt bias, where the model either uses the answer position, the label in front of the answer, the distributions of correct answers present in the few-shot prompt, or a combination of all to answer each MCQ question. We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels and prompts the LLM to use the whole answer presented. With a simple sentence similarity model, we demonstrate improved robustness and lower standard deviation between different permutations of answers with a minimal drop in LLM’s performance, exposing the LLM’s capabilities under reduced evaluation artifacts, without any help from the prompt examples or the option labels. Across multiple benchmarks and models, this protocol substantially improves the robustness to answer permutations, reducing mean accuracy variance $3\times$ with only a minimal decrease in the mean model’s performance. Through ablation studies on various embedding models and similarity functions, we show that the method is more robust than the standard ones.
💡 Research Summary
The paper “ABCD: All Biases Come Disguised” investigates hidden sources of bias in multiple‑choice question (MCQ) evaluation of large language models (LLMs). While prior work has documented label‑bias (the model prefers certain answer letters) and position‑bias (the model prefers answers in particular slots), the authors identify a third, under‑explored bias: few‑shot prompt distribution bias. This occurs when the set of exemplar questions supplied in a few‑shot prompt contains a non‑uniform distribution of correct answer labels, which LLMs can exploit as a statistical cue.
To expose these biases, the authors construct a synthetic benchmark called NonsenseQA. Each instance consists of a random‑word question and four random‑word answer options, with the “golden” answer assigned uniformly at random. In theory, performance should hover around chance (25%). However, when evaluated with the standard Select‑and‑Letter (S&L) protocol—where options are labeled A, B, C, D and the model is asked to output a single label—some models achieve >90% accuracy, demonstrating that they are leveraging label, position, and prompt‑distribution cues rather than any semantic understanding.
Based on insights from NonsenseQA, the authors propose a bias‑reduced evaluation protocol called Matched‑and‑Dashed (M&D). The protocol has three key components:
- Uniform, unordered labels – All options are prefixed with the same dash (“‑”), removing any ordinal information that could be inferred from distinct symbols.
- Full‑text answer generation – Instead of asking the model to output a label, the prompt instructs the model to generate the complete answer sentence. This aligns with the way instruction‑tuned models naturally respond.
- Semantic matching – The generated answer is embedded using a lightweight sentence‑embedding model (Qwen3‑Embedding‑0.6B) and compared to each candidate option via cosine similarity. The option with the highest similarity is selected as the model’s prediction.
Crucially, M&D requires only a single forward pass, no fine‑tuning, and no access to internal logits or attention maps. The additional computational cost is modest (≈3% extra time for embedding and similarity computation).
The authors evaluate 13 open‑source LLMs (8B–32B parameters, including DeepSeek‑R1, Qwen3, Llama‑3.1, Gemma‑3, etc.) across five real‑world MCQ benchmarks: CommonsenseQA, ARC, MMLU‑Pro, GPQA, and a multilingual subset of INCLUDE (Spanish, French, Italian, German). For each model they conduct “answer‑moving attacks” that permute the position of the correct answer both in the test question and in the few‑shot exemplars, measuring accuracy across many permutations. They report two main metrics:
- Mean accuracy variance – The spread of accuracies across permutations. M&D reduces this variance by roughly a factor of three compared to S&L.
- SCORE robustness metric – A similarity‑based consistency score that captures how often the model’s predictions remain semantically stable across permutations. M&D achieves higher SCORE values while only incurring a small drop (≈1–2%) in average accuracy.
Ablation studies confirm that (a) simply replacing distinct labels with dashes already yields a large reduction in bias, (b) the choice of embedding model (Qwen3‑Embedding‑0.6B vs. alternatives) and similarity function (cosine vs. dot product) has minimal impact on final performance, and (c) the protocol remains effective across languages.
In summary, the paper makes three contributions:
- A practical, low‑overhead debiasing protocol (M&D) that eliminates label and position cues and mitigates few‑shot distribution bias without requiring model‑specific modifications.
- The NonsenseQA diagnostic dataset, which isolates and quantifies the extent to which LLMs rely on superficial MCQ artifacts.
- A comprehensive bias analysis, revealing that LLMs can exploit a combination of label, position, and prompt‑distribution signals even on semantically meaningless inputs.
The work highlights that many reported LLM capabilities on MCQ benchmarks may be inflated by exploitable artifacts. By adopting the M&D protocol, researchers can obtain a more faithful assessment of true reasoning ability, paving the way for more reliable benchmark design and model evaluation in the future.
Comments & Academic Discussion
Loading comments...
Leave a Comment