When AI Evaluates Its Own Work: Validating Learner-Initiated, AI-Generated Physics Practice Problems

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) can now generate physics practice problems in real time, yet the educational value of these items hinges on rapid, reliable post-generation vetting. In this exploratory study, we investigated which automated checks are both technically feasible and pedagogically meaningful when exercises are produced on demand within a chatbot interface. A cohort of 34 introductory-physics students generated and attempted 543 practice problems during exam preparation. Each item was labeled by an expert on a wide range of quality attributes and presented to the learners in pairs to record their preference. We then (i) benchmarked three commodity LLMs as ``judges’’ against the expert labels, (ii) quantified which attributes predict student choice via random-forest models, and (iii) triangulated these results with free-form exit surveys. Only a small subset of the original metric items proved necessary to reliably address student preferences either directly or by proxy. The study demonstrates that scalable formative assessment does not require exhaustive scoring: a carefully curated core of structural and learner-visible checks is sufficient to ensure both technical soundness and user appeal. The findings provide a practical blueprint for deploying real-time, AI-generated practice in physics and other quantitative disciplines.

💡 Research Summary

The paper investigates how large language models (LLMs) can be used not only to generate physics practice problems on demand but also to automatically evaluate the quality of those problems in real time. The motivation stems from the observation that, in large introductory physics courses, students frequently request additional, topic‑specific practice items during exam preparation. Human instructors cannot meet these individualized demands at scale, and while generative AI promises an equitable solution, AI‑generated items often suffer from misalignment with course content, ambiguous wording, unrealistic numerical values, or outright physical errors. Such defects undermine the formative value of the problems.

To address this, the authors conducted an exploratory study with 34 undergraduate physics students who interacted with a chatbot (built on the Ethel platform at ETH Zurich) during a simulated exam‑preparation session. Over the course of the study the chatbot generated 543 practice problems (both multiple‑choice and numerical) in response to student prompts. Each problem was subsequently annotated by a domain expert on a comprehensive set of quality attributes—approximately thirty metrics covering technical soundness (e.g., physical consistency, completeness of given data), pedagogical soundness (e.g., relevance of distractors, clarity of wording, realism of numerical values), and learner‑visible cues (e.g., presence of a solution explanation).

The core research questions were: (RQ1) Which metrics can be reliably judged by commodity LLMs when compared to expert labels? (RQ2) Which metrics predict the problems that students actually prefer when presented with a pair of candidate items? (RQ3) Among the reliable and relevant metrics, which subset can be assessed automatically, cheaply, and quickly enough for real‑time deployment?

For RQ1 the authors benchmarked three widely available LLMs—GPT‑4, Claude‑2, and Llama‑2‑70B—using a standardized JSON‑based prompting schema that asked each model to output a 0/1 score for every metric. Agreement with expert labels was measured with Cohen’s κ. GPT‑4 consistently achieved the highest agreement (κ ≥ 0.78 for most metrics), Claude‑2 performed moderately well, and Llama‑2‑70B lagged behind, especially on nuanced pedagogical attributes such as “distractor subtlety.”

To answer RQ2, the authors presented students with pairs of generated problems and recorded which one they chose to attempt. They then trained random‑forest classifiers using the full set of expert‑rated metrics as features, separately for multiple‑choice and numerical items. Feature importance analysis revealed that a small handful of attributes—problem clarity, physical consistency, realism of numerical values, subtlety of distractors, and presence of an explanatory solution—were the strongest predictors of student choice, accounting for over 70 % of the variance. Free‑form exit‑survey responses corroborated these findings: students emphasized realistic contexts, clear statements, and well‑designed wrong answers as key to perceived usefulness.

For RQ3 the authors examined the computational cost of the LLM‑as‑judge pipeline. Using the same JSON prompts, the models returned judgments in an average of 0.8 seconds per problem, with cloud‑compute costs compatible with high‑throughput, real‑time services. By iteratively pruning the metric set and re‑evaluating predictive performance, they identified a minimal core of six to eight metrics that together retained >78 % accuracy in predicting student preference while dramatically reducing evaluation time and cost.

The study concludes that exhaustive scoring of AI‑generated physics items is unnecessary for scalable formative assessment. A carefully curated subset of structural (e.g., physical consistency) and learner‑visible (e.g., clarity, realistic numbers, distractor quality) checks suffices to ensure both technical soundness and student appeal. This “LLM‑as‑judge” framework provides a practical blueprint for deploying on‑demand, AI‑generated practice problems not only in physics but in any quantitative discipline where rapid, automated quality assurance is essential.

When AI Evaluates Its Own Work: Validating Learner-Initiated, AI-Generated Physics Practice Problems

💡 Research Summary

Comments & Academic Discussion

Leave a Comment