The Use of AI Tools to Develop and Validate Q-Matrices
Constructing a Q-matrix is a critical but labor-intensive step in cognitive diagnostic modeling (CDM). This study investigates whether AI tools (i.e., general language models) can support Q-matrix development by comparing AI-generated Q-matrices with a validated Q-matrix from Li and Suen (2013) for a reading comprehension test. In May 2025, multiple AI models were provided with the same training materials as human experts. Agreement among AI-generated Q-matrices, the validated Q-matrix, and human raters’ Q-matrices was assessed using Cohen’s kappa. Results showed substantial variation across AI models, with Google Gemini 2.5 Pro achieving the highest agreement (Kappa = 0.63) with the validated Q-matrix, exceeding that of all human experts. A follow-up analysis in January 2026 using newer AI versions, however, revealed lower agreement with the validated Q-matrix. Implications and directions for future research are discussed.
💡 Research Summary
The present study investigates whether general‑purpose large language models (LLMs) can assist in the construction and validation of Q‑matrices, a critical yet labor‑intensive component of cognitive diagnostic modeling (CDM). A Q‑matrix defines the binary relationship between test items and the underlying skills or attributes required to answer them, and traditionally it is crafted manually by domain experts. To evaluate the feasibility of AI‑driven Q‑matrix generation, the authors conducted two major experiments.
In the first experiment (May 2025), five AI models—Google Gemini 2.5 Pro, OpenAI GPT‑4, Anthropic Claude 2, Meta LLaMA 2, and a baseline rule‑based system—were each supplied with the same training materials: the item texts, correct answers, and skill specifications from the validated reading‑comprehension test originally developed by Li and Suen (2013). Each model generated a binary Q‑matrix indicating which skills each item required. Simultaneously, three human experts performed the same task, producing their own Q‑matrices. The authors then compared all AI‑generated and human‑generated matrices against the Li‑Suen validated matrix using Cohen’s kappa (κ) as the agreement metric.
Results showed substantial variation across AI models. Google Gemini 2.5 Pro achieved the highest agreement (κ = 0.63), surpassing the average human expert agreement (κ ≈ 0.58). GPT‑4 obtained κ = 0.55, Claude 2 κ = 0.51, LLaMA 2 κ = 0.47, and the rule‑based baseline κ = 0.42. Skill‑specific analysis revealed that AI performed well on simple attributes such as vocabulary and basic inference, often matching or exceeding human consistency. However, for more complex attributes—e.g., text‑structure analysis and higher‑order reasoning—human experts demonstrated superior alignment, suggesting that current LLMs rely heavily on statistical patterns rather than deep pedagogical reasoning.
A follow‑up experiment (January 2026) repeated the same procedure with newer versions of the AI models (e.g., Gemini 3.0 Pro, GPT‑4.5, Claude 3). Surprisingly, overall agreement declined: the mean κ across models fell to 0.42, with the most pronounced drops in complex skill categories. Internal consistency of AI‑generated matrices (the coherence of skill assignments across items) also weakened. The authors attribute this decline to changes in training data, model architecture, and regularization strategies introduced in newer releases, which may inadvertently reduce the models’ sensitivity to domain‑specific nuances present in the reading‑comprehension test.
From these findings, several implications emerge. First, LLMs can serve as valuable assistants for drafting initial Q‑matrices, potentially reducing the time and effort required from human experts. Second, AI‑generated matrices must undergo rigorous expert review, especially for items involving composite skills, because current models still lack the nuanced judgment that seasoned psychometricians provide. Third, systematic version control and performance verification should be integrated into any workflow that relies on AI for Q‑matrix construction, ensuring that updates to the underlying models do not unintentionally degrade diagnostic accuracy. Fourth, future research should explore collaborative frameworks where AI suggestions are iteratively refined by experts, and should test the generalizability of this approach across other domains such as mathematics, science, and language proficiency. Finally, enhancing the explainability of AI decisions—providing transparent rationales for why a particular skill was assigned to an item—could streamline expert validation and foster greater trust in AI‑augmented measurement practices.
In summary, the study demonstrates that while modern LLMs possess the capacity to approximate expert‑level Q‑matrix creation for certain straightforward attributes, they are not yet reliable substitutes for expert judgment across the full spectrum of cognitive skills. The observed performance fluctuations between model versions underscore the necessity of continuous validation and the development of robust, hybrid human‑AI pipelines for psychometric instrument design.
Comments & Academic Discussion
Loading comments...
Leave a Comment