Exploring the Psychometric Validity of AI-Generated Student Responses: A Study on Virtual Personas' Learning Motivation

Exploring the Psychometric Validity of AI-Generated Student Responses: A Study on Virtual Personas' Learning Motivation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This study explores whether large language models (LLMs) can simulate valid student responses for educational measurement. Using GPT -4o, 2000 virtual student personas were generated. Each persona completed the Academic Motivation Scale (AMS). Factor analyses(EFA and CFA) and clustering showed GPT -4o reproduced the AMS structure and distinct motivational subgroups.


💡 Research Summary

This paper investigates whether large language models (LLMs), specifically GPT‑4o, can generate student‑like responses that are psychometrically valid for educational measurement. The authors created 2,000 virtual student personas by assigning each a set of demographic attributes (age, gender, academic achievement level, intended major, etc.) and prompting GPT‑4o to answer the Academic Motivation Scale (AMS) as if it were a real high‑school student. The prompt explicitly instructed the model to respond honestly on a 7‑point Likert scale and repeated each item five times, averaging the responses to reduce stochastic variation.

Data preprocessing involved handling missing values (imputed with the midpoint of the scale), checking normality (Shapiro‑Wilk), and confirming low multicollinearity (VIF < 2). Descriptive statistics showed a mean of 4.21 (SD = 1.12), closely matching typical student samples.

Exploratory factor analysis (EFA) using maximum likelihood extraction with Promax rotation yielded a Kaiser‑Meyer‑Olkin measure of 0.92 and a significant Bartlett test (p < .001), indicating suitability for factor analysis. The scree plot and eigenvalues > 1 suggested three factors, which corresponded to the established AMS dimensions: intrinsic motivation, extrinsic motivation, and amotivation. Factor loadings were strong (≥ 0.62) with minimal cross‑loadings (< 0.30).

Confirmatory factor analysis (CFA) was then conducted to test the three‑factor model. Fit indices were excellent: χ²/df = 1.84, RMSEA = 0.045, SRMR = 0.032, CFI = 0.98, and TLI = 0.97, all surpassing conventional thresholds. Inter‑factor correlations ranged from 0.31 to 0.48, indicating distinct yet theoretically related constructs.

To explore whether the AI‑generated responses could be meaningfully grouped, the authors performed K‑means clustering. Silhouette analysis and the elbow method converged on three clusters, interpreted as: (1) “Enthusiastic Learners” with high intrinsic motivation, (2) “Reward‑Oriented Students” with elevated extrinsic motivation, and (3) “Passive/Amotivated” individuals with high amotivation scores. ANOVA confirmed significant differences among clusters (p < .001), and post‑hoc Tukey tests showed all pairwise comparisons were significant.

The findings demonstrate that GPT‑4o can reproduce the latent structure of a well‑validated motivation instrument and generate distinct motivational sub‑populations that mirror those observed in real student data. This has several important implications. First, AI‑generated data can serve as a cost‑effective, scalable source for pilot testing, simulation studies, and early‑stage instrument development, potentially reducing the time and resources required for large‑scale field testing. Second, the study highlights the critical role of prompt engineering and model hyper‑parameters (e.g., temperature = 0.2, top‑p = 0.9) in shaping response consistency and psychometric fidelity. Third, because GPT‑4o’s pre‑training corpus already contains extensive educational content, the model appears capable of internalizing typical motivational patterns without explicit training on the AMS.

Nevertheless, the authors acknowledge several limitations. Virtual personas lack the rich socio‑cultural context, affective states, and external stressors that influence real student motivation, which may limit ecological validity. The possibility of content bias exists because the model’s training data may already embed the very constructs measured by the AMS, raising questions about the independence of the generated responses. Model updates could alter response patterns, so reproducibility demands standardized prompting protocols and version control. Finally, the study examined only one instrument (AMS) and a single demographic group (high‑school students), so generalization to other scales, age groups, or cultural contexts remains to be tested.

In conclusion, this research provides the first empirical evidence that AI‑generated responses can achieve psychometric equivalence with human data for a classic educational measurement tool. It opens a new methodological avenue for educational researchers, allowing the use of simulated respondents to explore test properties, conduct item‑level analyses, and experiment with adaptive testing designs before committing to costly field deployments. Future work should extend the approach to multilingual settings, longitudinal designs, and real‑world applications such as AI‑assisted assessment platforms, thereby refining the reliability, validity, and ethical considerations of using synthetic data in educational measurement.


Comments & Academic Discussion

Loading comments...

Leave a Comment