Criterion-referenceability determines LLM-as-a-judge validity across physics assessment formats
As large language models (LLMs) are increasingly considered for automated assessment and feedback, understanding when LLM marking can be trusted is essential. We evaluate LLM-as-a-judge marking across three physics assessment formats - structured questions, written essays, and scientific plots - comparing GPT-5.2, Grok 4.1, Claude Opus 4.5, DeepSeek-V3.2, Gemini Pro 3, and committee aggregations against human markers under blind, solution-provided, false-solution, and exemplar-anchored conditions. For $n=771$ blind university exam questions, models achieve fractional mean absolute errors (fMAE) $\approx 0.22$ with robust discriminative validity (Spearman $ρ> 0.6$). For secondary and university structured questions ($n=1151$), providing official solutions reduces MAE and strengthens validity (committee $ρ= 0.88$); false solutions degrade absolute accuracy but leave rank ordering largely intact (committee $ρ= 0.77$; individual models $ρ\geq 0.59$). Essay marking behaves fundamentally differently. Across $n=55$ scripts ($n=275$ essays), blind AI marking is harsher and more variable than human marking, with discriminative validity already poor ($ρ\approx 0.1$). Adding a mark scheme does not improve discrimination ($ρ\approx 0$; all confidence intervals include zero). Anchored exemplars shift the AI mean close to the human mean and compress variance below the human standard deviation, but discriminative validity remains near-zero - distributional agreement can occur without valid discrimination. For code-based plot elements ($n=1400$), models achieve exceptionally high discriminative validity ($ρ> 0.84$) with near-linear calibration. Across all task types, validity tracks criterion-referenceability - the extent to which a task maps to explicit, observable grading features - and benchmark reliability, rather than raw model capability.
💡 Research Summary
This paper investigates the conditions under which large language models (LLMs) can be trusted as automated graders in physics education. The authors introduce the notion of “criterion‑referenceability” – the degree to which grading criteria can be made explicit, observed, and consistently applied to student work. They argue that tasks with high criterion‑referenceability (e.g., structured numerical problems) should yield higher absolute accuracy and stronger discriminative validity for LLM judges, whereas low‑referenceability tasks (e.g., holistic essays) will not.
Three assessment formats were examined: (1) structured physics questions, (2) short‑form essays, and (3) scientific coding/plot artifacts. Six state‑of‑the‑art LLMs (GPT‑5.2, Grok 4.1, Claude Opus 4.5, DeepSeek‑V3.2, Gemini Pro 3) and a committee ensemble were evaluated against human markers under four prompting conditions: blind (no solution), solution‑provided, false‑solution (intentionally corrupted), and exemplar‑anchored.
Structured questions (771 blind university exam items and 1 151 curriculum items) showed that even without a solution the models achieved a fractional mean absolute error (fMAE) of ≈0.22 and Spearman ρ > 0.6, indicating robust rank‑ordering. Supplying the official solution reduced MAE and raised the committee’s ρ to 0.88. Using a false solution increased absolute error but left rank ordering relatively intact (committee ρ ≈ 0.77; individual models ρ ≥ 0.59), suggesting that LLMs perform some independent physics verification rather than mere pattern matching.
Essays (55 scripts, 275 individual essays) revealed a stark contrast. Human inter‑rater reliability was already low (average pairwise ρ ≈ 0.054, ICC ≈ 0.035). Blind AI grading was harsher, more variable, and did not improve with a provided mark scheme (ρ remained ≈ 0). Anchoring the model with exemplar essays aligned the AI mean with the human mean and compressed variance, yet discriminative validity stayed near zero. Thus, distributional agreement can be achieved without meaningful rank discrimination.
Scientific plots (1 400 code‑generated visual artifacts) occupied an intermediate level of referenceability. All models achieved ρ > 0.84 and displayed near‑linear calibration between AI and human scores, demonstrating that when grading criteria are concrete (e.g., axis labels, data point accuracy, code comments) LLMs can serve as highly valid judges.
Across all formats, discriminative validity tracked criterion‑referenceability and the intrinsic reliability of the task rather than raw model capability. Model committees consistently outperformed single models, reducing systematic biases. The false‑solution experiment confirmed that LLMs are not simply copying provided answers but can evaluate physics content.
Implications: Before deploying AI grading, educators should assess the criterion‑referenceability of each assessment type. High‑referenceability tasks (structured problems, plot evaluation) are suitable for autonomous LLM grading, while low‑referenceability tasks (holistic essays) require human oversight or at least AI assistance limited to feedback rather than final scores. Providing explicit rubrics or exemplars can improve mean agreement but does not guarantee valid rank ordering. The study offers a practical framework for integrating LLMs into physics assessment pipelines, emphasizing task‑specific design over blanket reliance on model size or sophistication.
Comments & Academic Discussion
Loading comments...
Leave a Comment