Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Automated Essay Scoring (AES) has been explored for decades with the goal to support teachers by reducing grading workload and mitigating subjective biases. While early systems relied on handcrafted features and statistical models, recent advances in Large Language Models (LLMs) have made it possible to evaluate student writing with unprecedented flexibility. This paper investigates the application of state-of-the-art open-weight LLMs for the grading of Austrian A-level German texts, with a particular focus on rubric-based evaluation. A dataset of 101 anonymised student exams across three text types was processed and evaluated. Four LLMs, DeepSeek-R1 32b, Qwen3 30b, Mixtral 8x7b and LLama3.3 70b, were evaluated with different contexts and prompting strategies. The LLMs were able to reach a maximum of 40.6% agreement with the human rater in the rubric-provided sub-dimensions, and only 32.8% of final grades matched the ones given by a human expert. The results indicate that even though smaller models are able to use standardised rubrics for German essay grading, they are not accurate enough to be used in a real-world grading environment.

💡 Research Summary

This paper investigates the feasibility of using state‑of‑the‑art open‑weight large language models (LLMs) for automated essay scoring (AES) of Austrian A‑Level German examinations. The authors assembled a dataset of 101 anonymised student essays drawn from the 2023 and 2024 exam cycles, covering three text types (Commentary, Letter to the Editor, Literary Interpretation). Each essay is evaluated according to a nationally standardised rubric that splits grading into four sections (Content, Structure, Style & Expression, Language Conventions) and ultimately produces a final grade on a five‑point scale (1 = best, 5 = worst).

Four LLMs were selected based on size and German language capability: DeepSeek‑R1 (32 B parameters), Qwen3 (30 B), Mixtral (8 × 7 B) and Llama 3.3 (70 B). Llama 3.3 received the most thorough analysis because the other models exhibited pronounced flaws. The experimental design explores three major axes: (1) prompting strategy (baseline system prompt vs. few‑shot prompting), (2) context provision via Retrieval‑Augmented Generation (RAG), and (3) the optional use of Chain‑of‑Thought (CoT) reasoning.

RAG contexts were implemented in three variants. “Best‑Average‑Worst” supplies a fixed set of three exemplar essays representing grades 1, 3 and 5 for every task, ensuring the model sees the full grading spectrum. “Most‑similar‑matches” builds a vector store using a German RoBERTa sentence transformer and retrieves the most semantically similar past essays for each candidate. “Range‑of‑examples” selects five essays covering all grades (1‑5) to provide a balanced distribution.

Few‑shot prompting was realised as a turn‑based interaction: the model first predicts a grade, receives the correct grade as feedback, and then proceeds to the next essay. Various ordering schemes of the exemplar grades were tested (e.g., good→average→bad, inverted, mixed). The most stable configuration turned out to be the “Best‑Average‑Worst” fixed context combined with a sequential ordering from high to low grades. Adding CoT instructions did not yield measurable improvements, suggesting that the current models rely more on pattern matching than on explicit reasoning for this task.

Evaluation employed Quadratic Weighted Kappa (QWK) for each rubric sub‑dimension and overall grade agreement. The highest sub‑dimension agreement achieved was 40.6 % (QWK), and only 32.8 % of final grades matched the human expert’s scores. Performance varied by text type: longer Literary Interpretation essays showed the greatest discrepancy, while shorter Letter to the Editor pieces attained relatively higher alignment. Model size mattered modestly; Llama 3.3 consistently outperformed the smaller counterparts but still fell far short of human reliability.

The authors discuss several limitations. First, OCR errors and the exclusion of handwritten essays reduced the usable sample to 101, potentially biasing results. Second, open‑source LLMs are trained on heterogeneous data and may lack the fine‑grained German educational language exposure that commercial models (e.g., GPT‑4) possess. Third, the rubric’s nuanced criteria can be interpreted in multiple ways, making it difficult for a model to infer the exact weighting without extensive domain‑specific tuning.

In conclusion, while open‑weight LLMs can process German A‑Level essays and produce rubric‑based scores, their current accuracy is insufficient for deployment as primary graders in high‑stakes examinations. Nevertheless, they may serve as useful auxiliary tools to reduce teacher workload, provided that further work is undertaken: (i) collecting larger, high‑quality German essay corpora, (ii) applying parameter‑efficient fine‑tuning (e.g., LoRA) or domain‑specific instruction tuning, (iii) refining prompt engineering and feedback loops, and (iv) exploring ensemble methods that combine multiple models’ judgments. The paper fills a gap in the literature by presenting the first systematic evaluation of open LLMs on Austrian German A‑Level AES under a strict rubric, and it outlines a clear roadmap for future improvements.

Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring

💡 Research Summary

Comments & Academic Discussion

Leave a Comment