Rethinking AI Evaluation in Education: The TEACH-AI Framework and Benchmark for Generative AI Assistants

February 09, 2026

Reading time: 11 minute

...

📝 Original Info

Title: Rethinking AI Evaluation in Education: The TEACH-AI Framework and Benchmark for Generative AI Assistants
ArXiv ID: 2512.04107
Date: 2025-11-28
Authors: Shi Ding, Brian Magerko

📝 Abstract

As generative artificial intelligence (AI) continues to transform education, most existing AI evaluations rely primarily on technical performance metrics such as accuracy or task efficiency while overlooking human identity, learner agency, contextual learning processes, and ethical considerations. In this paper, we present TEACH-AI (Trustworthy and Effective AI Classroom Heuristics)-a domain-independent, pedagogically grounded, and stakeholder-aligned benchmark framework with measurable indicators and a practical toolkit for guiding the design, development, and evaluation of generative AI systems in educational contexts. Built on an extensive literature review and synthesis, the ten-component assessment framework and toolkit checklist provide a foundation for scalable, value-aligned AI evaluation in education. TEACH-AI rethinks "evaluation" through sociotechnical, educational, theoretical, and applied lenses, engaging designers, developers, researchers, and policymakers across AI and education. Our work invites the community to reconsider what constructs "effective" AI in education and to design model evaluation approaches that promote co-creation, inclusivity, and long-term human, social, and educational impact.

📄 Full Content

As generative AI systems increasingly become intelligent assistant in learning environments, they challenge the traditional roles of teachers, tools, and humans [1,2]. AI education interventions are often designed for and evaluated based on the efficacy of the AI technology in terms of its behavior, sensing capabilities, and reasoning [3,4] centered on agent-human interactions. Rarely do these works involve the broader learning context of their designs and evaluations [5,6,7,8]. Therefore, rather than benchmarking against the status quo or competing models, this article attempts to enable researchers to evaluate how well their AI-based interventions work with a multi-faceted framework and a practical toolkit that captures both the top-down and bottom-up factors related to human success with generative AI tools. We aim to investigate our research question: What criteria define effective, value-aligned human-AI collaboration in educational settings, and how might these criteria guide the development of the TEACH-AI benchmark and practical toolkit for evaluating generative AI assisting tools?

User experience (UX) evaluation plays a crucial role in assessing the effectiveness and acceptability of educational AI systems, particularly in human-centered contexts. Frameworks typically combine core evaluation criteria such as accuracy, clarity, feedback usefulness or engagement potential [24,25,26]. Human evaluation remains critical [27,28]. However, studies show that AI-generated feedback, although often perceived as an immediate assistant, also poses unique challenges related to explainability, bias, and ethical use [29,30,31]; Trust, fairness, academic integrity, and need of AI literacy among educators and humans [32,33,34]. A notable limitation in existing UX evaluation research is its emphasis on domain specific systems, such as tutoring systems for STEAM like code.org [15]or writing tasks [35]. Our work addresses this gap by proposing a domain independent UX evaluation framework that can generalize across creative, interdisciplinary learning environments such as Scratch, a block-based programming platform that supports storytelling, games [36]; Teachable Machine, train machine learning models through images, sounds [37,38], or Earsketch, expressive programming learning platform that support teaching both music and coding [39]. These platforms support diverse, cross-disciplinary learning, but lack standardized frameworks to assess outcomes like adaptability, ethical awareness, human values, and stakeholder alignment. A flexible, domain independent benchmark is needed to capture the broader educational impact of these tools across varied contexts [40,41]. In this paper, we define “Domain Independent Evaluation” as evaluating AI across multiple subject areas, requiring generalizable, content-neutral metrics [42,40].

For decades, traditional Intelligent Tutoring Systems (ITS) have aimed to deliver individualized instruction by modeling student knowledge and guiding problem-solving (inner loop) and instructional sequencing (outer loop) through predefined rule-based decision trees. These systems have shown positive learning outcomes through features such as immediate feedback and adaptivity [4,43]. However, their reliance on rules limits responsiveness to dynamic learning scenarios and diverse student behaviors. In contrast, generative AI tutors are emerging to address these limitations by using adaptive techniques such as retrieval augmented generation for producing context-aware and coherent response [44,45], reinforcement learning to optimize teaching strategies based on students feedback [46], deep knowledge tracing for modeling and predicting student understand over time [47], and long-term retention and self-regulated learning over immediate correctness to support deeper learning [25,48].

Benchmarks are critical for evaluating educational AI systems, offering standardized tasks, datasets, and metrics to assess performance. In this context, we adopt the definition of benchmarking as “a combination of task, dataset, and metric” used to evaluate how AI systems support learning [42,25]. But most existing benchmarks for large language models (LLMs) focus on general reasoning or factual recall [40], with few targeting pedagogical efficacy in real-world learning contexts [49]. This gap raises concerns about the lack of stakeholder validation and limited alignment with teaching and learning needs and context [42]. As Shute and Ventura emphasize, educational evaluation must move beyond correctness to include formative, contextual, and human-centered outcomes [25,28]. Anderson et al. [50] further demonstrated how AI tutors can be benchmarked to support procedural knowledge through structured feedback. Building on this, our work addresses the need for pedagogically grounded, stakeholder-aligned benchmarks that reflect how generative AI supports learning in authentic, situated contexts [6,7].

Our contribution to this paper is to address these research gaps through proposing the TEACH-AI (Trustworthy and Effective AI Classroom Heuristics) Benchmark Framework-a domain independent, value-aligned human-centered conceptual benchmark, along with a practical toolkit for evaluating generative AI tutors. Informed by a synthesis of over 126 publications, our framework serve as a start point to guide the design and evaluation of pedagogically meaningful AI-driven learning experiences.

We conducted a scoping review following Arksey and O’Malley’s framework [51,52] to examine how AI agents are evaluated in educational environments. Guided by the question “How are AI agents evaluated in educational environments?”, we performed targeted searches across major venues (e.g., CHI, NeurIPS, IDC, AIED) and Google Scholar. In total, we reviewed 126 relevant sources, including 27 conference papers, 78 journal articles, and 21 books and gray literature. These were categorized into three thematic phases: the pre-LLM era (pre-2017, focused on early ITS and HCI) with 37 papers, the transformer era (2017-2022, marked by the rise of XAI and AI literacy) [53,34] with 43 papers, and the generative AI phase (2023-present, emphasizing co-design and agent collaboration) [54,55], with 36 papers.

Through iterative coding and synthesis, we identified ten recurring components relevant to humancentered evaluation, including explainability, adaptivity, usability, ethical use, and accessibility. These insights informed a practical toolkit of reflective prompts [56,7] and a simplified scoring structure inspired by Meadows’ leverage points [57]. Regular weekly meetings with a senior faculty advisor facilitated thematic validation and iterative refinement of interpretations, ensuring conceptual rigor and alignment with human-centered AI evaluation principles in both the TEACH-AI benchmark and toolkit design.

In this section, we revisit our research question and present an initial benchmark framework along with a practical toolkit, drawing on existing literature, to address what evaluation components construct effective, value-aligned human-AI collaboration in educational domains. We define each component in detail and synthesize these findings into preliminary design implications to inform future benchmark development for generative AI tutoring agents. This benchmark framework adopts a value-sensitive human-centered perspective and structures the analysis to address the gap in existing evaluation approaches by strengthening the focus across cognitive and sociotechnical arguments and offers a foundation for iterative refinement through future research.

To address the first part of the research question: What criteria define effective, value-aligned human-AI collaboration in education? We first define ten core components that form the basis of our evaluation framework (see Table 1): explainability, helpfulness, adaptivity, consistency, creative exploration, system usability, ethical responsibility, accessibility, workflow, and refinement. We then provide a detailed table outlining sub-components with indicators or metrics, and relevant key references.

Explainability: The agent’s ability to present its reasoning and decision making in clear, contextual meaningful, and human-understandable terms [58,59,60].

The extent to which the agent supports educational stakeholders such as teachers and humans’ needs in achieving their goals through actionable, pedagogically appropriate assistance [61,62,16]. Adaptivity: The system’s responsiveness to human preferences, contexts, and needs through personalization and dynamic guidance. This includes flexible exploration to foster humans autonomy and confidence [63,64,65,26].

The stability and trustworthy of system outputs under similar conditions and alignment of behavior, languages, and situations across tasks [58,26].

Learning Exploration: The agent’s capacity to foster curiosity, support diverse solution paths, and encourage reflective, open-ended inquiry, long-term human autonomy [6,66,16,67,26].

System Usability: The effectiveness and ease of interaction that support efficient, intuitive, roleshifting, and error-resistant interactions between users and AI systems [24,26,68].

Responsiblity and Ethics: The system’s ability to act in alignment with human values, legal, ethical, and educational norms, and cultural sensitivities, even under adversarial conditions. It requires agents to avoid harm, ensure fairness, protect privacy and safeguard student data and voice [69,70,31,71,40].

The extent to which the system is usable and equitable access by humans with diverse abilities, including those using assistive technologies [72,73,55,74,75].

The agent’s ability to support multi-steps, human-AI collaboration between teachers, students, and other stakeholders, while maintaining adaptability in a dynamic learning context [55,24]. Refinement: The system’s ability to support iterative improvement through a) the user correcting AI errors, b) user adjustment of vague or biased feedback, and c) ethical traceable revisions [76,69,26].

Overall, TEACH-AI benchmark framework address three critical interconnected arguments in evaluation: (1) the agent’s capacity for explainability, adaptability, helpfulness, and consistency, including interpretable, context-aware justifications [6,63], dynamic adaptation to human needs [62,77,78,79], and stable, reliable outputs scross similar conditions [26,80]; (2) the extent to which the agent fosters creative exploration, emotional engagement, and deep thinking, by scaffolding open-ended problem-solving, supporting divergent approaches, encouraging productive struggle, and enabling transferable learning process cross domains [24,43,81,63,26]; and (3) the degree to which the agent operates responsibly, accessibly, and is open to refinement, including ethical behavior under adversarial conditions [69,31,40,82], equitable access for diverse humans, and support for iterative improvement through feedback, error recovery, and coordination in multi-step, multi-stakeholder workflows [26,83].

To illustrate how TEACH-AI can inform early-stage evaluation, we outline how the TEACH-AI framework could be applied to evaluate domain-independent generative AI assistants in educational settings [110]. The framework’s ten components can be selectively applied depending on research goals, stakeholder roles, and contextual factors. For instance, studies involving a single agent may emphasize components such as helpfulness or explainability, whereas multi-agent settings may prioritize coordination or workflow support. Similarly, accessibility considerations should be adapted based on the characteristics and needs of the target user population.

More broadly, TEACH-AI encourages researchers and designers to reflect on how generative AI systems support education, creativity, values, and human agency. By applying the framework iteratively, practitioners can identify where the system meets expectations and where further refinement is needed, guiding more thoughtful and contextually grounded algorithmic design decisions.

We also introduce a preliminary toolkit intended to help practitioners apply TEACH-AI in practice. The toolkit offers a set of reflective questions aligned with each framework component, supporting structured evaluation across different educational and design contexts. Rather than serving as a prescriptive checklist, these prompts help users identify strengths, gaps, and opportunities for improvement in an AI system’s behavior and alignment with human-centered values. The goal of this tool is to guide consistent reflection and comparison across contexts, whether in classroom use, design, or model development reviews, or early research prototyping. Future iterations will refine these prompts and explore ways to support broader, scalable evaluation workflows. This checklist can be used by educators, researchers, and designers to assess human-centered AI alignment. The checklist is intended to support reflective practice rather than function as a prescriptive to-do list. It translates abstract values (e.g., explainability) into actionable criteria that can be applied across technical design, policymaking, training, and research contexts [20]. In classroom settings, including those using tools like ChatGPT, the checklist guide scalable evaluation by allowing raters to assess each criterion using either a simple Yes/No option or a progressive scale [57]. This approach provides a clear foundation for assessing an AI system’s alignment with human values, contextual demands, and broad AI development principles such as transparency and safety across diverse educational environments. The resulting TEACH-AI index can support both reflective classroom practice and quantitative research analysis, supporting consistent comparisons and guiding ongoing model evaluation efforts in the educational domain.

In summary, we introduce TEACH-AI, a ten-component, human-centered benchmark framework and toolkit for evaluating generative AI systems in education. While the current version is primarily conceptual, it highlights the need for evaluation approaches that align with emerging educational needs, ethical design principles, and human values. Importantly, TEACH-AI bridges human-generated feedback and LLM-generated feedback by providing a unified structure that supports both human evaluators and LLM-as-judge methods.

Moving forward, our work will involve co-design with diverse stakeholders and iterative refinement of the framework across different educational contexts. We also plan to explore technical development, such as integrating TEACH-AI into a scalable digital prototype for large-scale benchmarking. This direction aligns with broader trends in human-centered AI evaluation, for example, the use of LLM-asjudge methods for automated assessment, and research on Reinforcement Learning from AI Feedback (RLAIF), which highlights the growing emphasis on reliable feedback signals in AI behavior. Our long-term goal is to support the development of accessible, responsible, and pedagogically aligned AI evaluation ecosystems that drive meaningful impact in real educational settings.

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on open access ArXiv data.

Rethinking AI Evaluation in Education: The TEACH-AI Framework and Benchmark for Generative AI Assistants

📝 Original Info

📝 Abstract

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

📄 Full Content

📸 Image Gallery

Reference

Start searching

No results found