EduEVAL-DB: A Role-Based Dataset for Pedagogical Risk Evaluation in Educational Explanations
This work introduces EduEVAL-DB, a dataset based on teacher roles designed to support the evaluation and training of automatic pedagogical evaluators and AI tutors for instructional explanations. The dataset comprises 854 explanations corresponding to 139 questions from a curated subset of the ScienceQA benchmark, spanning science, language, and social science across K-12 grade levels. For each question, one human-teacher explanation is provided and six are generated by LLM-simulated teacher roles. These roles are inspired by instructional styles and shortcomings observed in real educational practice and are instantiated via prompt engineering. We further propose a pedagogical risk rubric aligned with established educational standards, operationalizing five complementary risk dimensions: factual correctness, explanatory depth and completeness, focus and relevance, student-level appropriateness, and ideological bias. All explanations are annotated with binary risk labels through a semi-automatic process with expert teacher review. Finally, we present preliminary validation experiments to assess the suitability of EduEVAL-DB for evaluation. We benchmark a state-of-the-art education-oriented model (Gemini 2.5 Pro) against a lightweight local Llama 3.1 8B model and examine whether supervised fine-tuning on EduEVAL-DB supports pedagogical risk detection using models deployable on consumer hardware.
💡 Research Summary
**
This paper introduces EduEVAL‑DB, a role‑based dataset designed to support the evaluation and fine‑tuning of automatic pedagogical evaluators and AI tutors for K‑12 instructional explanations. The authors curated 139 questions from the ScienceQA benchmark, covering science, language arts, and social science across three grade bands (K‑5, 6‑8, 9‑12). For each question they collected one human‑teacher explanation and six LLM‑generated explanations, each conditioned on a distinct teacher persona that reflects common instructional styles or shortcomings observed in real classrooms. The six simulated roles are: (1) Confidently Inaccurate Teacher, (2) Concise but Incomplete Teacher, (3) Enthusiastic Rambling Teacher, (4) Overly Advanced and Insensitive Teacher, (5) Sarcastic Teacher, and (6) Exemplary Teacher. Prompt engineering and few‑shot examples were used with the GPT‑5 API to instantiate these personas, explicitly providing the target grade level to ensure age‑appropriate language.
A central contribution is a five‑dimensional pedagogical risk rubric aligned with established educational standards (Instructional Core, Zone of Proximal Development, Cognitive Load Theory) and the three core alignment principles of AI safety (Honesty, Helpfulness, Harmlessness). The dimensions are: (i) Factual Correctness (Epistemic Risk, H1), (ii) Explanatory Depth & Completeness (Pedagogical Risk, H2), (iii) Focus & Relevance (Cognitive Risk, H2), (iv) Student‑Level Appropriateness (Developmental Risk, H3), and (v) Ideological Bias (Normative Risk, H3). Each explanation is annotated with binary risk labels for all five dimensions, yielding 4,270 labels (854 explanations × 5). Annotation followed a semi‑automatic pipeline: an LLM first suggested risk tags, then expert teachers reviewed and corrected them, balancing scalability with label quality. The distribution shows roughly 16 % of explanations flagged for the first four risks, while ideological bias appears in only 2 % of cases, highlighting its relative rarity but continued importance.
The dataset is released publicly (GitHub) together with the role prompts, enabling reproducibility and extension. To validate EduEVAL‑DB, the authors benchmarked two models: Gemini 2.5 Pro (a state‑of‑the‑art education‑oriented model) and Llama 3.1 8B (a lightweight open‑source model suitable for consumer GPUs). Both models were evaluated on the five risk dimensions using the binary labels as ground truth. Gemini achieved high overall accuracy (>90 %) but struggled with ideological bias detection (recall ≈45 %), indicating that even large models have difficulty spotting subtle normative risks. Llama 3.1 8B, after supervised fine‑tuning on EduEVAL‑DB, improved its recall for factual correctness, depth, and relevance by 6–8 % points, while maintaining inference latency of ~30 ms per explanation on an RTX 3080, demonstrating feasibility for real‑time classroom tools.
The experiments show that (1) role‑based explanations expose distinct risk profiles, (2) a multi‑dimensional rubric captures complementary aspects of pedagogical quality beyond mere factuality, and (3) fine‑tuning on a modestly sized, well‑annotated dataset can substantially boost a lightweight model’s risk detection capabilities, making it deployable on consumer hardware.
Limitations include the fixed set of six teacher personas, which may not cover the full spectrum of real‑world teaching behaviors, and the English‑centric source material, limiting cultural and linguistic diversity. The binary nature of the risk labels also precludes nuanced severity grading. Future work should expand the role taxonomy, incorporate multilingual questions, explore multi‑level risk scoring, and conduct longitudinal studies linking risk detection to actual learning outcomes.
In summary, EduEVAL‑DB fills a gap in educational AI resources by providing a role‑aware, risk‑annotated corpus for K‑12 explanations. It offers a concrete benchmark for evaluating pedagogical safety and a practical pathway for fine‑tuning compact models that can operate on everyday devices, thereby advancing the responsible deployment of LLM‑driven tutors in schools.
Comments & Academic Discussion
Loading comments...
Leave a Comment