Benchmarking Large Language Models for Diagnosing Students' Cognitive Skills from Handwritten Math Work
Students’ handwritten math work provides a rich resource for diagnosing cognitive skills, as it captures intermediate reasoning beyond final answers. We investigate how current large language models (LLMs) perform in diagnosing cognitive skills from such work. However, student responses vary widely, often omitting steps or providing only vague, contextually implicit evidence. Despite recent advances in LLMs’ multimodal and reasoning capabilities, their performance under such conditions remains underexplored. To address this gap, we constructed MathCog, a benchmark dataset containing 3,036 diagnostic verdicts across 639 student responses to 110 math problems, annotated by teachers using TIMSS-grounded cognitive skill checklists with evidential strength labels (Evident/Vague). Evaluating 18 LLMs, we find that (1) all models underperform (F1 < 0.5) regardless of capability, and (2) performance degrades sharply under vague evidence. Error analysis reveals systematic patterns: models frequently misattribute Vague evidence as Evident, overthink minimal cues, and hallucinate nonexistent evidence. We discuss implications for evidence-aware, teacher-in-the-loop designs for LLM-based cognitive diagnosis in educational settings.
💡 Research Summary
The paper introduces MathCog, a novel benchmark designed to evaluate large language models (LLMs) on the task of diagnosing students’ cognitive skills from handwritten mathematics work. Recognizing that handwritten problem‑solving captures intermediate reasoning that is invisible in final answers, the authors collect 639 student responses to 110 middle‑school math problems from the AI‑Hub repository. Each response is paired with a TIMSS‑based diagnostic checklist covering 15 cognitive skills (focused on the “Knowing” and “Applying” domains) and annotated by 15 experienced teachers for both skill correctness (Yes/No) and evidential strength (Evident/Vague). Only items with at least 70 % inter‑rater agreement are retained, yielding a high‑quality set of 3,036 diagnostic judgments.
To probe the capabilities of current LLMs, the authors evaluate 18 models spanning text‑only, multimodal, reasoning‑enhanced, and various size categories (small, medium, large). Prompt engineering follows a chain‑of‑thought (CoT) style: the model restates each checklist item, identifies supporting evidence in the student’s work, provides a brief explanation, and finally outputs a verdict in one of four categories (Yes‑Evident, Yes‑Vague, No‑Evident, No‑Vague). Both OCR‑derived LaTeX transcriptions and, for multimodal models, the original image of the handwritten work are supplied. Few‑shot examples covering all four verdict types are also tested. All runs use temperature 0 to eliminate stochastic variation.
Performance is measured with macro‑averaged F1, overall accuracy, and two evidence‑specific metrics: Evidence Over‑Attribution (the rate at which a model labels a Vague case as Evident) and Evidence False‑Attribution (the rate at which a model assigns Evident to an incorrect skill judgment). Results are sobering: every model achieves macro F1 below 0.5, with the best scores hovering around 0.44. Accuracy ranges from roughly 0.60 to 0.78, but both Over‑Attribution and False‑Attribution are high, especially on Vague evidence. For many models, Over‑Attribution exceeds 0.6, indicating a systematic tendency to over‑state the presence of clear evidence even when the student’s work is ambiguous. False‑Attribution values are also substantial (0.4–0.8), showing that incorrect skill predictions are frequently accompanied by an unjustified claim of strong evidence.
A qualitative error analysis uncovers three recurring failure modes. First, “evidence over‑interpretation”: models treat minor notational cues or a single correct sub‑step as sufficient to declare Evident support. Second, “excessive inference”: the model extrapolates a full cognitive skill from a fragmentary student action, effectively guessing the missing reasoning. Third, “hallucinated evidence”: the model fabricates steps or calculations that never appear in the student’s work, inserting them into its explanation to justify a verdict. These patterns persist across model families and sizes, suggesting that current LLMs lack robust grounding in the visual‑textual nuances of handwritten mathematics and in the pedagogical criteria used by teachers.
The authors discuss two practical implications. One, LLMs should not be deployed as autonomous graders for cognitive diagnosis; instead, a teacher‑in‑the‑loop workflow is advisable, where the model provides an initial draft that educators can verify and correct. Two, future model development must prioritize evidence‑aware training: datasets should explicitly label evidential strength, and training objectives should reward accurate evidence attribution as well as correct skill classification. Multimodal pre‑training that jointly processes image and text, combined with meta‑learning or reinforcement learning from teacher feedback, may mitigate over‑attribution and hallucination.
In conclusion, this work delivers the first large‑scale, expert‑annotated benchmark for the challenging problem of diagnosing cognitive skills from handwritten math work. The systematic evaluation reveals that despite impressive advances in LLM reasoning and multimodality, these models fall short of the nuanced, evidence‑sensitive judgments required in educational assessment. The paper therefore charts a clear research agenda: richer evidence‑centric data, teacher‑guided fine‑tuning, and more grounded multimodal architectures are needed to bring LLMs closer to reliable, scalable tools for mathematics education.
Comments & Academic Discussion
Loading comments...
Leave a Comment