Evaluating Large Language Models on Solved and Unsolved Problems in Graph Theory: Implications for Computing Education

Evaluating Large Language Models on Solved and Unsolved Problems in Graph Theory: Implications for Computing Education
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models are increasingly used by students to explore advanced material in computer science, including graph theory. As these tools become integrated into undergraduate and graduate coursework, it is important to understand how reliably they support mathematically rigorous thinking. This study examines the performance of a LLM on two related graph theoretic problems: a solved problem concerning the gracefulness of line graphs and an open problem for which no solution is currently known. We use an eight stage evaluation protocol that reflects authentic mathematical inquiry, including interpretation, exploration, strategy formation, and proof construction. The model performed strongly on the solved problem, producing correct definitions, identifying relevant structures, recalling appropriate results without hallucination, and constructing a valid proof confirmed by a graph theory expert. For the open problem, the model generated coherent interpretations and plausible exploratory strategies but did not advance toward a solution. It did not fabricate results and instead acknowledged uncertainty, which is consistent with the explicit prompting instructions that directed the model to avoid inventing theorems or unsupported claims. These findings indicate that LLMs can support exploration of established material but remain limited in tasks requiring novel mathematical insight or critical structural reasoning. For computing education, this distinction highlights the importance of guiding students to use LLMs for conceptual exploration while relying on independent verification and rigorous argumentation for formal problem solving.


💡 Research Summary

The paper investigates how large language models (LLMs) perform on mathematically rigorous tasks in graph theory, focusing on both a solved problem and an open problem. The solved problem asks whether the line graph L(G) of a non‑graceful graph G must be graceful; this implication is known to be false, and counter‑examples exist in the literature. The open problem asks the converse: if L(G) is graceful, must G be graceful? No general answer is known, making it a suitable test of the model’s behavior under genuine uncertainty.

To evaluate the LLM, the authors designed an eight‑stage protocol that mirrors the typical workflow of students using AI for mathematical inquiry: (1) problem understanding and restatement, (2) brainstorming possible approaches, (3) identifying related sub‑areas, (4) recalling relevant theorems and results, (5) forming high‑level strategies, (6) attempting a formal proof, (7) self‑evaluation of the proof, and (8) revision and final presentation. The same prompt templates were used for both problems, with explicit instructions to avoid fabricating theorems or claims and to flag uncertainty when appropriate.

Experiments were conducted with ChatGPT‑5.1 via its web interface, providing the model with two background papers on line graphs and graceful labeling. All interactions were saved verbatim, and no corrective feedback was given during the process. A graph‑theory expert, unfamiliar with AI specifics, qualitatively assessed each output against criteria tailored to each stage (definition accuracy, relevance of identified concepts, correctness of recalled theorems, logical validity of proof steps, and the quality of self‑critique).

Results show a clear dichotomy. For the solved problem, the model accurately restated the question, listed all necessary definitions (line graph, graceful labeling, non‑graceful graph), recalled pertinent results without hallucination, and produced a step‑by‑step proof that matched known counter‑examples. The expert confirmed the proof’s correctness and praised the model’s disciplined adherence to the “do not invent” instruction. For the open problem, the model again demonstrated solid comprehension and generated plausible exploratory strategies (e.g., investigating structural invariants, searching for families of graphs with identical line graphs, considering known partial results). However, it did not produce any new theorem, counter‑example, or substantive advance toward a solution. Importantly, the model explicitly expressed uncertainty and refrained from conjecturing unsupported statements, aligning with the prompting constraints.

The authors interpret these findings as evidence that current LLMs excel at organizing, recalling, and applying existing mathematical knowledge but struggle with tasks that require original insight, creative abstraction, or the synthesis of novel arguments. From an educational perspective, the paper recommends using LLMs as “conceptual scaffolds” for students—helpful for clarifying definitions, surveying related literature, and brainstorming—but stresses that final proof construction, verification, and critical assessment should remain the student’s responsibility, supported by instructor oversight. The study also highlights the value of structured prompting and self‑evaluation stages to mitigate hallucination and to encourage the model to flag its own limitations.

In conclusion, the work contributes an empirical characterization of LLM behavior on both a well‑solved and an unsolved graph‑theoretic problem, demonstrating reliable performance on the former and bounded exploratory usefulness on the latter. It underscores the need for pedagogical designs that harness LLM strengths while guarding against over‑reliance, and it points to future research directions such as extending the protocol to other mathematical domains, refining prompts to elicit deeper reasoning, and developing automated tools for detecting and correcting model‑generated logical errors.


Comments & Academic Discussion

Loading comments...

Leave a Comment