Synthetic Student Responses: LLM-Extracted Features for IRT Difficulty Parameter Estimation
Educational assessment relies heavily on knowing question difficulty, traditionally determined through resource-intensive pre-testing with students. This creates significant barriers for both classroom teachers and assessment developers. We investigate whether Item Response Theory (IRT) difficulty parameters can be accurately estimated without student testing by modeling the response process and explore the relative contribution of different feature types to prediction accuracy. Our approach combines traditional linguistic features with pedagogical insights extracted using Large Language Models (LLMs), including solution step count, cognitive complexity, and potential misconceptions. We implement a two-stage process: first training a neural network to predict how students would respond to questions, then deriving difficulty parameters from these simulated response patterns. Using a dataset of over 250,000 student responses to mathematics questions, our model achieves a Pearson correlation of approximately 0.78 between predicted and actual difficulty parameters on completely unseen questions.
💡 Research Summary
This paper tackles a long‑standing bottleneck in educational assessment: the need for costly, time‑consuming pre‑testing to obtain Item Response Theory (IRT) difficulty parameters for new questions. The authors propose a two‑stage framework that eliminates the requirement for real student responses while still producing psychometrically valid difficulty estimates.
In the first stage, a neural network predicts the probability that a given student will answer a particular mathematics question correctly. The model ingests five sources of information: (1) a semantic embedding of the full question text (including answer options) generated by ModernBERT, (2) traditional linguistic and structural question features (word count, number of symbols, presence of LaTeX, etc.), (3) analogous features derived from the answer options, (4) pedagogical features automatically extracted by a large language model (LLM) – such as estimated solution‑step count, cognitive complexity level according to Bloom’s taxonomy, and likely student misconceptions – and (5) a learned student embedding that captures individual ability differences. By jointly training on 251,851 real student‑question interactions from the Chilean adaptive learning platform Zapien, the network learns to simulate how a diverse population of learners would perform on any unseen item.
The second stage treats the entire predicted correctness matrix as if it were real response data and fits a 1‑Parameter Logistic (1PL) IRT model using maximum‑likelihood estimation. The resulting item difficulty parameters are directly comparable to the “ground‑truth” difficulties obtained by fitting the same 1PL model to the actual student responses.
The dataset comprises 4,696 unique math items and 1,875 students. Original Spanish items were translated into English with Google Gemini 2.0 Flash to enable the use of English‑language embeddings. Items were split into training (70 %), validation (20 %), and hold‑out test (10 %) sets using stratified sampling on both estimated difficulty and average correctness, ensuring comparable distributions across splits.
Empirical results on the hold‑out set show a Pearson correlation of approximately 0.78 between predicted and true difficulty parameters and an RMSE of about 0.42, indicating strong alignment with psychometric standards. An extensive ablation study reveals that the LLM‑derived pedagogical features contribute roughly a 7‑percentage‑point boost in correlation over a baseline that uses only text embeddings and traditional linguistic features. Feature importance analysis further confirms that each component—semantic embeddings, linguistic/structural metrics, option characteristics, and LLM‑extracted signals—adds unique predictive value.
The authors discuss several limitations. First, the reliance on English translation may introduce subtle semantic shifts that affect feature quality. Second, the use of a 1PL model captures only difficulty; extending the approach to 2‑ or 3‑parameter IRT models would enable simultaneous estimation of discrimination and guessing parameters. Third, the student embeddings are learned purely from response patterns without explicit demographic grounding, which limits interpretability for educators.
Future work is outlined along three axes: (a) integrating multi‑parameter IRT models to enrich the psychometric profile of items, (b) employing multilingual LLMs to extract pedagogical features directly from source‑language items, and (c) validating the simulated response patterns against expert teacher judgments and external performance metrics.
Overall, the study demonstrates that combining LLM‑generated educational insights with a response‑simulation neural network can faithfully reproduce the IRT difficulty estimation process without any real‑world testing. This paradigm promises substantial cost savings for test developers, faster turnaround for classroom assessments, and a scalable pathway to generate psychometrically sound item banks in diverse subject domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment