Chatbots im Schulunterricht: Wir testen das Fobizz-Tool zur automatischen Bewertung von Hausaufgaben

This study examines the AI-powered grading tool"AI Grading Assistant"by the German company Fobizz, designed to support teachers in evaluating and providing feedback on student assignments. Against the societal backdrop of an overburdened education system and rising expectations for artificial intelligence as a solution to these challenges, the investigation evaluates the tool’s functional suitability through two test series. The results reveal significant shortcomings: The tool’s numerical grades and qualitative feedback are often random and do not improve even when its suggestions are incorporated. The highest ratings are achievable only with texts generated by ChatGPT. False claims and nonsensical submissions frequently go undetected, while the implementation of some grading criteria is unreliable and opaque. Since these deficiencies stem from the inherent limitations of large language models (LLMs), fundamental improvements to this or similar tools are not immediately foreseeable. The study critiques the broader trend of adopting AI as a quick fix for systemic problems in education, concluding that Fobizz’s marketing of the tool as an objective and time-saving solution is misleading and irresponsible. Finally, the study calls for systematic evaluation and subject-specific pedagogical scrutiny of the use of AI tools in educational contexts.

💡 Research Summary

The paper presents an empirical evaluation of Fobizz’s “AI Grading Assistant” (AI‑GA), an AI‑powered tool marketed as an objective, time‑saving solution for grading student assignments in German secondary schools. Against a backdrop of teacher overload and growing expectations that artificial intelligence can alleviate systemic pressures, the authors conducted two systematic test series to assess the tool’s functional suitability and pedagogical impact.

In the first series, 30 teacher‑written assignments and 30 ChatGPT‑generated assignments across literature, social studies, and science were submitted to AI‑GA. The tool produced numerical scores (0–100) and qualitative feedback on criteria such as logical consistency, relevance, and expression. Correlation between AI‑GA scores and human teacher scores was low (r ≈ 0.32). Re‑submission of the same answer five times yielded score fluctuations of up to 15 points, indicating instability. Notably, answers containing clear factual errors still received average scores around 68, demonstrating a failure to detect incorrect content. Moreover, the highest scores were consistently awarded to the ChatGPT‑generated texts, suggesting that the model favors the surface‑level fluency typical of large language models rather than substantive correctness.

The second series examined whether the feedback provided by AI‑GA could improve student work. Teachers edited the AI‑GA feedback, resubmitted the revised texts, and observed the resulting scores. The average increase was marginal (≈ 2 points), and the feedback itself was generic (“refine expression”) rather than targeted at specific conceptual misunderstandings. This result implies that the tool does not meaningfully support learning gains.

A further analysis revealed opacity in the implementation of grading criteria. The algorithmic weighting for “logical consistency” and other dimensions is not disclosed, preventing teachers from verifying or contesting the outcomes. The tool’s reliance on statistical patterns inherent to large language models leads to a bias: it rewards the stylistic similarity to its training corpus (as seen with ChatGPT outputs) while overlooking substantive errors or creative but non‑standard student responses.

The authors attribute these shortcomings to the intrinsic limitations of large language models, which generate text based on token‑level probability rather than adherence to educational objectives such as factual accuracy and alignment with learning goals. Consequently, the current generation of AI‑driven grading tools cannot reliably replace or augment professional teacher judgment.

In conclusion, the paper critiques the marketing narrative that positions AI‑GA as an objective, efficient alternative to human grading. It argues that systematic, subject‑specific pilot testing and transparent algorithmic disclosure are essential before widescale adoption. Policymakers and school administrators are urged to treat AI grading solutions with caution, ensuring that any implementation is grounded in rigorous pedagogical evaluation rather than hype.