Can We Improve Educational Diagram Generation with In-Context Examples? Not if a Hallucination Spoils the Bunch

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Generative artificial intelligence (AI) has found a widespread use in computing education; at the same time, quality of generated materials raises concerns among educators and students. This study addresses this issue by introducing a novel method for diagram code generation with in-context examples based on the Rhetorical Structure Theory (RST), which aims to improve diagram generation by aligning models’ output with user expectations. Our approach is evaluated by computer science educators, who assessed 150 diagrams generated with large language models (LLMs) for logical organization, connectivity, layout aesthetic, and AI hallucination. The assessment dataset is additionally investigated for its utility in automated diagram evaluation. The preliminary results suggest that our method decreases the rate of factual hallucination and improves diagram faithfulness to provided context; however, due to LLMs’ stochasticity, the quality of the generated diagrams varies. Additionally, we present an in-depth analysis and discussion on the connection between AI hallucination and the quality of generated diagrams, which reveals that text contexts of higher complexity lead to higher rates of hallucination and LLMs often fail to detect mistakes in their output.

💡 Research Summary

This paper tackles the growing reliance on generative AI for producing instructional diagrams in computing education, a domain where visual representations are essential yet time‑consuming to create. While large language models (LLMs) can generate diagram code (e.g., Graphviz DOT), their outputs often suffer from verbosity, poor layout, and especially hallucinations—outputs that are factually incorrect or misaligned with the supplied context. Reported hallucination rates for similar tasks range dramatically from 3 % to 86 %, even for top‑performing models such as GPT‑4.

To mitigate these problems, the authors introduce a novel in‑context learning (ICL) strategy that leverages Rhetorical Structure Theory (RST). RST provides a hierarchical discourse representation by breaking a text into elementary discourse units (EDUs) and linking them with coherence relations (mononuclear or multinuclear). The authors hypothesize that by selecting in‑context examples whose RST structures closely match the input text, an LLM will be guided to produce diagrams that are more faithful to the source material and less prone to hallucination.

Two RST‑guided pipelines are implemented:

RST1 supplies the LLM with pairs of raw source text and hand‑crafted Graphviz code as demonstrations.
RST2 supplies the LLM with the RST analysis of the source text (the EDU‑relation tree) together with the corresponding code.

Both pipelines share a seven‑step generation workflow: (1) RST analysis of the input; (2) similarity search against a pre‑built example dictionary; (3) construction of the most relevant demonstration; (4) ICL‑driven code generation; (5) automatic repair of rendering errors; (6) layout‑improvement prompting; and (7) a final repair loop to guarantee a renderable image. A baseline zero‑shot (0‑shot) condition skips steps 1‑3 and directly prompts the model with generic diagram‑generation rules.

The experimental setup uses 25 educational texts. For each text, diagrams are generated with three methods (RST1, RST2, 0‑shot) and two LLMs (GPT‑4o and GPT‑3.5), yielding 150 diagrams. Four experts in computing education and computational linguistics evaluate each diagram using a five‑point rubric covering:

C1 – Logical Organization (flow, clarity, adherence to flow‑chart conventions),
C2 – Connectivity (uniformity, presence of orphan nodes),
C3 – Layout Aesthetic (crossings, obscured elements, color/size readability, symmetry, alignment, width, homogeneity).

In addition, hallucinations are classified following Huang et al. (2025) into factual hallucination (H_fact) and three faithfulness sub‑types: adherence to layout instructions (H_ae), logical inconsistencies (H_log), and mismatches with the supplied context (H_c). Each diagram receives two independent raters; inter‑rater reliability is measured with Krippendorff’s α and Kendall’s W.

Results show that both RST‑guided pipelines outperform zero‑shot generation. Average scores improve by 0.42 (C1), 0.35 (C2), and 0.48 (C3) points. Factual hallucination drops from 28 % in the zero‑shot condition to 22 % (RST1) and 20 % (RST2). However, overall hallucination (including faithfulness types) remains around 30 %, indicating that while RST helps, it does not eliminate the problem. A key finding is that text complexity—measured by the number and depth of RST relations—correlates positively with hallucination probability; more complex discourse structures increase the chance of both factual and faithfulness errors.

The multi‑step pipeline also reveals error propagation: an early rendering failure that requires repair can introduce subtle layout choices that later steps inherit, sometimes leading to new hallucinations.

To answer whether automated evaluation can replace human judgment, three automatic methods are tested: (E1) implicit learning with nine ICL examples, (E2) explicit instruction plus ICL examples, and (E3) instruction‑based reflection. The explicit‑instruction model (E2) achieves the highest correlation with human scores (r ≈ 0.71) but only 68 % accuracy in detecting hallucinations, underscoring the difficulty of fully automated quality control.

The authors conclude that RST‑guided example selection is a promising avenue for improving LLM‑generated educational diagrams, as it reduces factual hallucination and boosts structural quality. Nonetheless, stochastic model behavior, sensitivity to input complexity, and error propagation across pipeline stages remain significant challenges. Future work is suggested in three directions: (1) improving the accuracy and efficiency of automatic RST parsing, (2) developing meta‑learning strategies for dynamic example selection, and (3) building robust post‑generation verification mechanisms (e.g., multimodal consistency checks) to catch hallucinations before diagrams reach learners. Extending the approach beyond Graphviz to UML, sequence diagrams, and other visual notations is also highlighted as a natural next step.

Can We Improve Educational Diagram Generation with In-Context Examples? Not if a Hallucination Spoils the Bunch

💡 Research Summary

Comments & Academic Discussion

Leave a Comment