FoundationalASSIST: An Educational Dataset for Foundational Knowledge Tracing and Pedagogical Grounding of LLMs
Can Large Language Models understand how students learn? As LLMs are deployed for adaptive testing and personalized tutoring, this question becomes urgent – yet we cannot answer it with existing resources. Current educational datasets provide only question identifiers and binary correctness labels, rendering them opaque to LLMs that reason in natural language. We address this gap with FoundationalASSIST, the first English educational dataset providing the complete information needed for research on LLMs in education: full question text, actual student responses (not just right/wrong), records of which wrong answers students chose, and alignment to Common Core K-12 standards. These 1.7 million interactions from 5,000 students enable research directions that were previously impossible to pursue, from fine-tuning student models to analyzing misconception patterns. To demonstrate the dataset’s utility, we evaluate four frontier models (GPT-OSS-120B, Llama-3.3-70B, Qwen3-Next-80B variants) on two complementary task families: Knowledge Tracing, testing whether LLMs can predict student performance on questions, and the exact answer a student will give; and \textbf{Pedagogical Grounding}, testing whether LLMs understand the properties that make assessment items effective. Our evaluation reveals significant gaps in current LLM capabilities. Every model barely achieves a trivial baseline on knowledge tracing. All models fall below random chance on item discrimination, indicating that LLMs do not understand what makes one problem more diagnostic than another. Models do show competence at judging relative difficulty (up to 68.6%), but this partial success only highlights the gaps elsewhere. These results establish that substantial advances are needed before LLMs can reliably support personalized learning at scale. We release FoundationalASSIST to support progress on these foundational challenges.
💡 Research Summary
The paper introduces FoundationalASSIST, a novel English‑language educational dataset designed to bridge the gap between large language models (LLMs) and student‑centered learning research. Existing knowledge‑tracing benchmarks (e.g., ASSISTments, EdNet) provide only problem identifiers and binary correctness, which are opaque to LLMs that operate on natural language. FoundationalASSIST contains 1.7 million interactions from 5,000 K‑12 students on 3,400 mathematics problems. For each interaction it records the full problem text, the exact student response (multiple‑choice selection or free‑form answer), the specific distractor chosen when wrong, and a mapping to Common Core State Standards (grade level and knowledge component). This rich, text‑based representation enables LLMs to reason about content, difficulty, and misconception patterns.
To demonstrate the dataset’s utility, the authors define two families of tasks. Knowledge Tracing (KT) asks models to (1) predict whether a student will answer a given question correctly and (2) generate the exact answer the student is likely to produce. Pedagogical Grounding (PG) asks models to compare two items on (a) relative difficulty, (b) relative discrimination, (c) which distractor is most frequently chosen, and (d) which distractor is rarely chosen. Together these six subtasks probe the foundational capabilities required for adaptive testing, item selection, and personalized feedback.
Four state‑of‑the‑art LLMs are evaluated: GPT‑OSS‑120B, Llama‑3.3‑70B‑Instruct, Qwen3‑Next‑80B‑Instruct, and Qwen3‑Next‑80B‑Thinking (the latter generates explicit reasoning chains). All models are used in a zero‑shot or minimally prompted setting; no fine‑tuning on the new data is performed.
Results show that current LLMs are far from ready for educational deployment. In the binary correctness prediction task, a trivial baseline that always predicts “correct” achieves 51.3 % accuracy; the best model improves this by only ~5 percentage points. Moreover, models exhibit a strong optimistic bias: they correctly identify correct responses (e.g., Llama‑3.3‑70B reaches 85.4 % on correctly answered items) but perform near chance (≈12 %) on incorrectly answered items, making them unsuitable for identifying struggling learners.
For the exact‑answer generation subtask, performance is similarly weak, with only marginal gains over random guessing. In the PG suite, models can compare difficulty reasonably well (up to 80 % accuracy when the difficulty gap is large) but completely fail at discrimination judgments, scoring below random chance across all models. Distractor prediction shows an asymmetric pattern: models modestly exceed baseline (≈48 % vs. 36 %) when identifying the most common wrong answer, yet fall far below chance when predicting the least chosen distractor. The “Thinking” variant of Qwen3‑Next‑80B, which produces chain‑of‑thought reasoning, attains the highest discrimination score (≈47 %) but the lowest distractor‑prediction accuracy (≈20 %).
These findings suggest that while LLMs possess strong natural‑language understanding and can reason about problem content, they lack the internal representations of student cognition, misconception prevalence, and psychometric properties needed for effective knowledge tracing and item selection. The authors argue that FoundationalASSIST opens a research avenue for fine‑tuning, prompt engineering, and multimodal training that could close this gap.
The paper also discusses limitations: the dataset is tied to the U.S. Common Core curriculum, which may limit cross‑cultural generalization; the evaluation uses only zero‑shot prompting, leaving open the possibility that domain‑specific fine‑tuning could substantially improve results; and the benchmark focuses on static accuracy metrics rather than dynamic tutoring scenarios (e.g., real‑time hint generation).
In conclusion, FoundationalASSIST provides the first large‑scale, text‑rich, response‑rich, standards‑aligned dataset for evaluating LLMs on both knowledge tracing and pedagogical grounding. The baseline experiments reveal substantial deficiencies in current models, underscoring the need for dedicated model development before LLMs can be reliably integrated into personalized learning systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment