중국 K12 교육용 대형언어모델 평가 벤치마크 EduEval
📝 Abstract
Large language models (LLMs) demonstrate significant potential for educational applications. However, their unscrutinized deployment poses risks to educational standards, underscoring the need for rigorous evaluation. We introduce EduEval, a comprehensive hierarchical benchmark for evaluating LLMs in Chinese K-12 education. This benchmark makes three key contributions: (1) Cognitive Framework: We propose the EduAbility Taxonomy, which unifies Bloom’s Taxonomy and Webb’s Depth of Knowledge to organize tasks across six cognitive dimensions including Memorization, Understanding, Application, Reasoning, Creativity, and Ethics. (2) Authenticity: Our benchmark integrates real exam questions, classroom conversation, student essays, and expertdesigned prompts to reflect genuine educational challenges; (3) Scale: EduEval comprises 24 distinct task types with over 11,000 questions spanning primary to high school levels. We evaluate 14 leading LLMs under both zero-shot and few-shot settings, revealing that while models perform well on factual tasks, they struggle with classroom dialogue classification and exhibit inconsistent results in creative content generation. Interestingly, several open source models outperform proprietary systems on complex educational reasoning. Few-shot prompting shows varying effectiveness across cognitive dimensions, suggesting that different educational objectives require tailored approaches. These findings provide targeted benchmarking metrics for developing LLMs specifically optimized for diverse Chinese educational tasks. The EduEval dataset and code are publicly available at https://github.com/Maerzs/E_edueval .
💡 Analysis
Large language models (LLMs) demonstrate significant potential for educational applications. However, their unscrutinized deployment poses risks to educational standards, underscoring the need for rigorous evaluation. We introduce EduEval, a comprehensive hierarchical benchmark for evaluating LLMs in Chinese K-12 education. This benchmark makes three key contributions: (1) Cognitive Framework: We propose the EduAbility Taxonomy, which unifies Bloom’s Taxonomy and Webb’s Depth of Knowledge to organize tasks across six cognitive dimensions including Memorization, Understanding, Application, Reasoning, Creativity, and Ethics. (2) Authenticity: Our benchmark integrates real exam questions, classroom conversation, student essays, and expertdesigned prompts to reflect genuine educational challenges; (3) Scale: EduEval comprises 24 distinct task types with over 11,000 questions spanning primary to high school levels. We evaluate 14 leading LLMs under both zero-shot and few-shot settings, revealing that while models perform well on factual tasks, they struggle with classroom dialogue classification and exhibit inconsistent results in creative content generation. Interestingly, several open source models outperform proprietary systems on complex educational reasoning. Few-shot prompting shows varying effectiveness across cognitive dimensions, suggesting that different educational objectives require tailored approaches. These findings provide targeted benchmarking metrics for developing LLMs specifically optimized for diverse Chinese educational tasks. The EduEval dataset and code are publicly available at https://github.com/Maerzs/E_edueval .
📄 Content
EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education Guoqing Ma1, Jia Zhu1, Hanghui Guo1, Weijie Shi2, Yue Cui2, Jiawei Shen1, Zilong Li1, Yidan Liang1 1Zhejiang Normal University, Zhejiang, China 2Hong Kong University of Science and Technology, Hong Kong, China Abstract Large language models (LLMs) demonstrate significant potential for educational applica- tions. However, their unscrutinized deployment poses risks to educational standards, underscor- ing the need for rigorous evaluation. We in- troduce EduEval, a comprehensive hierarchical benchmark for evaluating LLMs in Chinese K- 12 education. This benchmark makes three key contributions: (1) Cognitive Framework: We propose the EduAbility Taxonomy, which uni- fies Bloom’s Taxonomy and Webb’s Depth of Knowledge to organize tasks across six cogni- tive dimensions including Memorization, Un- derstanding, Application, Reasoning, Creativ- ity, and Ethics. (2) Authenticity: Our bench- mark integrates real exam questions, class- room conversation, student essays, and expert- designed prompts to reflect genuine educa- tional challenges; (3) Scale: EduEval com- prises 24 distinct task types with over 11,000 questions spanning primary to high school lev- els. We evaluate 14 leading LLMs under both zero-shot and few-shot settings, revealing that while models perform well on factual tasks, they struggle with classroom dialogue classifi- cation and exhibit inconsistent results in cre- ative content generation. Interestingly, sev- eral open source models outperform propri- etary systems on complex educational reason- ing. Few-shot prompting shows varying ef- fectiveness across cognitive dimensions, sug- gesting that different educational objectives require tailored approaches. These findings provide targeted benchmarking metrics for de- veloping LLMs specifically optimized for di- verse Chinese educational tasks. The EduE- val dataset and code are publicly available at https://github.com/Maerzs/E_edueval . 1 Introduction The rapid advancement of large language models (LLMs) in natural language processing (NLP) and artificial intelligence (AI) has opened new possibil- ities for their application in education. With strong capabilities in language understanding, reason- ing, and text generation, LLMs have been widely adopted in intelligent tutoring, automated grading, and educational content generation, significantly enhancing personalized learning and instructional efficiency (Brown et al., 2020; Ouyang et al., 2022; Wei et al., 2022). Their impact now extends from primary and secondary education to higher educa- tion and vocational training (Glaser et al., 2001; Kasneci et al., 2023; Peng et al., 2023), with great potential shown in real world teaching activities like lesson preparation assistance and interactive classroom support (Kasneci et al., 2023; Chang et al., 2024; Gan et al., 2023; Huang et al., 2025). However, evaluating LLMs in authentic edu- cational settings remains a significant challenge (Guo et al., 2021). The educational process is in- herently dynamic and heterogeneous (Koopmans, 2020), with diverse student cognitive profiles re- quiring constant adaptation (Haelermans, 2022). Current mainstream evaluation methods rely heav- ily on static assessments (Chen et al., 2024), such as standardized multiple choice questions, which fail to capture the complex reasoning, generative expression, and ethical decision making required in real world classrooms (Allen and Kendeou, 2024; Yan et al., 2024). As a result, these evaluations tend to underestimate the true educational potential of LLMs (Srivastava et al., 2022). While general pur- pose benchmarks such as MMLU (Hendrycks et al., 2020), BIG-bench (Srivastava et al., 2022), and HELM (Liang et al., 2022) cover broad abilities, they fall short in addressing the specific needs of educational applications. Similarly, Chinese bench- marks like E-Eval (Hou et al., 2024), AGIEval (Zhong et al., 2024), and C-Eval (Huang et al., 2023) are still dominated by low level cognitive questions and single-turn formats, lacking system- atic evaluation of higher-order abilities and provid- ing limited diversity in task contexts and cognitive depth (Long et al., 2024). arXiv:2512.00290v1 [cs.CL] 29 Nov 2025 To bridge this gap, a robust benchmark must as- sess a spectrum of cognitive abilities grounded in authentic pedagogical contexts. Established frame- works like Bloom’s Taxonomy of Educational Ob- jectives (Krathwohl, 2002) and Webb’s Depth of Knowledge (DOK) model (Masharipova, 2024) provide a foundation for structuring cognitive lev- els, but a comprehensive evaluation requires a syn- thesized framework tailored specifically for the multifaceted demands of K-12 education. We in- troduce EduEval, a hierarchical benchmark de- signed to meet this need by systematically evaluat- ing LLMs on Chinese K-12 educational tasks. EduEval prioritizes educational authenticity by incorpora
This content is AI-processed based on ArXiv data.