BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multiple-choice question answering (MCQA) is standard in NLP, but benchmarks lack rigorous quality control. We present BenchMarker, an education-inspired toolkit using LLM judges to flag three common MCQ flaws: 1) contamination - items appearing exactly online; 2) shortcuts - cues in the choices that enable guessing; and 3) writing errors - structural/grammatical issues based on a 19-rule education rubric. We validate BenchMarker with human annotations, then run the tool to audit 12 benchmarks, revealing: 2) contaminated MCQs tend to inflate accuracy, while writing errors tend to lower it and change rankings beyond random; and 3) prior benchmark repairs address their targeted issues (i.e., lowering accuracy with LLM-written distractors), but inadvertently add new flaws (i.e. implausible distractors, many correct answers). Overall, flaws in MCQs degrade NLP evaluation, but education research offers a path forward. We release BenchMarker to bridge the fields and improve MCQA benchmark design.

💡 Research Summary

Multiple‑choice question answering (MCQA) has become a cornerstone of natural‑language‑processing (NLP) evaluation, yet most MCQA benchmarks are built without the rigorous quality‑control practices that educators apply to classroom tests. This paper introduces BenchMarker, an education‑inspired toolkit that automatically flags three pervasive flaws in MCQ datasets: (1) contamination, where a question‑answer pair appears verbatim on the web; (2) shortcuts, where superficial cues in the answer choices enable a model to guess the correct answer without reading the stem; and (3) writing errors, which encompass grammatical, structural, and stylistic violations defined by a 19‑rule rubric from educational measurement research.

BenchMarker leverages large language models (LLMs) as judges. For contamination, it queries multiple search APIs (Google, Bing, DuckDuckGo, Brave) with the stem and gold answer, then asks an LLM to decide whether the exact item is present in the retrieved results. For shortcuts, three high‑performing LLMs (GPT‑5, Gemini 2.5 Pro, Claude 4.5 Sonnet) are prompted to answer using only the answer choices; a secondary LLM then judges whether the inferred question derived from this “choices‑only” answer matches the original stem. If the model succeeds with the choices alone and the inferred question diverges, the item is labeled as having a shortcut. For writing errors, each of the 19 rubric rules is encoded in a prompt with definition and six examples (three flawed, three correct). An LLM evaluates whether a given MCQ violates each rule, producing a binary flag per rule.

To validate the system, the authors assembled a human‑annotated benchmark covering 12 popular MCQA datasets (TruthfulQA, HellaSwag, MMLU, ARC, PIQA, etc.). They sampled up to 10 flawed and 10 non‑flawed items per metric per dataset, yielding 8,042 annotated instances: 229 for contamination, 271 for shortcuts, and 3,419 for writing errors on NLP data, plus 4,123 existing labels from higher‑education exams for out‑of‑domain validation. Human annotators performed web searches across four engines, compared inferred versus original questions, and judged rule violations using a protocol with two expert annotators achieving >80 % inter‑annotator agreement.

BenchMarker’s predictions were compared against human labels using accuracy, F1, and Cohen’s κ. Across 23 LLMs spanning seven families (Gemini, GPT‑5, Claude, Command, Qwen‑3, Gemma‑3, LLaMA‑3) and six search APIs, the best-performing configurations (GPT‑5, Gemini Pro, Claude Sonnet with Google/Bing) achieved κ scores up to 0.78 for contamination, 0.71 for shortcuts, and 0.74 for writing errors—substantially higher than baseline heuristics and the existing SAQUT tool for rule detection.

Applying BenchMarker to the 12 datasets revealed systematic quality problems, especially in automatically generated or crowdsourced collections. Notably, 47 % of TruthfulQA items were found online, 21 % of ScholarIQA exhibited shortcuts, and 100 % of HellaSwag violated at least two writing rules. The authors then examined the impact of these flaws on model performance. Contaminated splits inflated LLM accuracy by 4–6 % (e.g., GPT‑5’s score rose from 68 % to 73 % on a contaminated TruthfulQA split), indicating memorization rather than reasoning. Conversely, items with multiple writing errors reduced accuracy by 3–5 % and caused model ranking changes that exceeded random permutation thresholds, undermining the reliability of comparative evaluations.

The paper also audits prior “benchmark repair” efforts. For example, MMLU‑Pro introduced LLM‑generated distractors to lower accuracy, but this created implausible distractors and, in some cases, multiple correct answers—new flaws not captured by the original repair goal. Such findings underscore that fixing one defect can inadvertently introduce others, highlighting the need for an iterative, comprehensive quality‑control pipeline.

BenchMarker itself is packaged as the InspectAI library, providing standardized prompts, judge logs, and a lightweight UI for monitoring runs. The authors release the code, the 8,042‑instance validation set, and the full audit results, inviting the community to adopt education‑derived standards for MCQA benchmark construction and maintenance.

In sum, the contributions are: (1) a novel, LLM‑based toolkit that operationalizes three education‑validated MCQ quality dimensions; (2) extensive human validation demonstrating high agreement with LLM judges; (3) empirical evidence that contamination, shortcuts, and writing errors materially affect NLP model evaluation; (4) a critical analysis showing that existing benchmark revisions may create new problems; and (5) open‑source resources to enable systematic, repeatable quality assessment of future MCQA datasets. BenchMarker thus bridges educational measurement and NLP, offering a path toward more trustworthy, construct‑valid benchmarks for language understanding.

BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks

💡 Research Summary

Comments & Academic Discussion

Leave a Comment