TeachBench: A Syllabus-Grounded Framework for Evaluating Teaching Ability in Large Language Models

TeachBench: A Syllabus-Grounded Framework for Evaluating Teaching Ability in Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) show promise as teaching assistants, yet their teaching capability remains insufficiently evaluated. Existing benchmarks mainly focus on problem-solving or problem-level guidance, leaving knowledge-centered teaching underexplored. We propose a syllabus-grounded evaluation framework that measures LLM teaching capability via student performance improvement after multi-turn instruction. By restricting teacher agents to structured knowledge points and example problems, the framework avoids information leakage and enables reuse of existing benchmarks. We instantiate the framework on Gaokao data across multiple subjects. Experiments reveal substantial variation in teaching effectiveness across models and domains: some models perform well in mathematics, while teaching remains challenging in physics and chemistry. We also find that incorporating example problems does not necessarily improve teaching, as models often shift toward example-specific error correction. Overall, our results highlight teaching ability as a distinct and measurable dimension of LLM behavior.


💡 Research Summary

TeachBench introduces a syllabus‑grounded evaluation framework that shifts the focus of large language model (LLM) assessment from pure problem‑solving to genuine teaching ability. The authors construct a hierarchical knowledge tree from the Chinese National College Entrance Examination (Gaokao) syllabus, using Gemini‑3 and GPT‑5 to extract and refine topics, then manually verify the structure. Each exam question is tagged with one or more root‑to‑leaf knowledge paths via a depth‑first LLM‑based tagger, establishing a fine‑grained mapping between items and the underlying concepts that should be learned.

For every leaf‑level knowledge point, a question generator (Gemini‑2.5‑Pro) creates three example problems of increasing difficulty (easy, medium, hard). The generator first attempts web retrieval; if suitable items are unavailable or low‑quality, it synthesizes new problems while enforcing constraints that prevent direct reuse of official Gaokao items. A second‑pass verifier checks alignment with the knowledge point, answer correctness, solution validity, and difficulty calibration, iterating until the item passes all checks. This pipeline yields a curated set of pedagogical materials that the teacher LLM can use without ever seeing the target exam questions, thereby eliminating information leakage.

The teaching scenario involves two agents: a teacher LLM (the model under evaluation) and a student LLM of fixed, moderate capability that serves as a proxy for a human learner. The teacher receives only the knowledge points and the generated example problems, not the actual test items. It conducts a multi‑turn dialogue, providing explanations, hints, and feedback for each knowledge point. After each student response, the teacher decides whether the concept has been mastered or whether further clarification is needed. The interaction continues until the teacher judges all target concepts mastered, at which point the session ends.

Evaluation proceeds in three steps. First, the student LLM attempts the held‑out test questions to obtain a pre‑instruction accuracy baseline. Second, the teaching dialogue runs from scratch. Third, the student LLM, now equipped with the full dialogue history, re‑answers the same test questions. The difference between post‑ and pre‑instruction accuracy constitutes the “teaching effectiveness score.” This metric directly quantifies how much the teacher model improves a learner’s performance, rather than measuring the teacher’s ability to produce a correct answer itself.

Experiments cover multiple subjects (Mathematics, Physics, Chemistry, Biology, History, Geography, Politics) and a range of contemporary LLMs, including Qwen‑3‑235B‑A22B‑Instruct, Claude‑4, GPT‑5‑mini, DeepSeek‑V3.2, and others. Results reveal several key patterns:

  1. Model‑by‑subject variance – Qwen‑3‑235B‑A22B‑Instruct achieves the largest gain in Mathematics (+7.63 points), while most models show modest or negligible improvements in Physics and Chemistry. This suggests that current LLMs excel at teaching domains where knowledge points map directly to procedural steps, but struggle when deep integration of concepts is required.

  2. Impact of example problems – Supplying example problems does not uniformly boost teaching effectiveness. Many models gravitate toward “example‑based error correction,” focusing on fixing mistakes in the provided examples rather than delivering syllabus‑grounded explanations. Consequently, the student’s performance on unseen test items sometimes deteriorates, indicating a lack of generalization.

  3. Dialogue efficiency – The average number of turns needed to complete a teaching session varies widely. Claude‑4 often requires >12 turns, whereas Qwen‑3‑235B‑A22B‑Instruct typically finishes within 6–8 turns, reflecting more concise instructional strategies. Longer dialogues do not guarantee higher gains, highlighting the importance of effective feedback timing.

  4. Leakage control and reusability – By restricting teacher inputs to knowledge points and curated examples, the framework prevents the teacher from simply regurgitating target questions. This design enables the reuse of existing exam benchmarks (Gaokao, CMMLU, etc.) for teaching evaluation without compromising test integrity.

The authors conclude that teaching ability constitutes a distinct, measurable dimension of LLM behavior, separate from raw problem‑solving prowess. TeachBench provides a systematic, reproducible methodology for quantifying this dimension and exposes current limitations: inadequate handling of complex, integrative subjects and over‑reliance on example‑driven correction. Future work is suggested in three directions: extending the framework to other curricula (e.g., SAT, International Baccalaureate), conducting human‑in‑the‑loop studies to validate proxy‑student results, and developing teacher models that incorporate meta‑learning or knowledge‑graph reasoning to better integrate disparate concepts.

Overall, TeachBench represents a pioneering step toward evaluating and ultimately improving LLMs as genuine educators, offering a benchmark that aligns AI development with the pedagogical goals of real‑world educational systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment