Root Cause Analysis of Radiation Oncology Incidents Using Large Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Purpose To evaluate the reasoning capabilities of large language models (LLMs) in performing root cause analysis (RCA) of radiation oncology incidents using narrative reports from the Radiation Oncology Incident Learning System (RO-ILS), and to assess their potential utility in supporting patient safety efforts. Methods and Materials Four LLMs, Gemini 2.5 Pro, GPT-4o, o3, and Grok 3, were prompted with the ‘Background and Incident Overview’ sections of 19 public RO-ILS cases. Using a standardized prompt based on AAPM RCA guidelines, each model was instructed to identify root causes, lessons learned, and suggested actions. Outputs were assessed using semantic similarity metrics (cosine similarity via Sentence Transformers), semi-subjective evaluations (precision, recall, F1-score, accuracy, hallucination rate, and four performance criteria: relevance, comprehensiveness, justification, and solution quality), and subjective expert ratings (reasoning quality and overall performance) from five board-certified medical physicists. Results LLMs showed promising performance. GPT-4o had the highest cosine similarity (0.831), while Gemini 2.5 Pro had the highest recall (0.799) and accuracy (0.918). Hallucination rates ranged from 11% to 61%. Gemini 2.5 Pro outperformed others across performance criteria and received the highest expert rating (4.8/5). Statistically significant differences in accuracy, hallucination, and subjective scores were observed (p < 0.05). Conclusion LLMs show emerging promise as tools for RCA in radiation oncology. They can generate relevant, accurate analyses aligned with expert judgment and may support incident analysis and quality improvement efforts to enhance patient safety in clinical practice.

💡 Research Summary

This paper investigates the capability of contemporary large language models (LLMs) to perform root cause analysis (RCA) on radiation oncology incidents, using narrative reports from the Radiation Oncology Incident Learning System (RO‑ILS). Nineteen publicly available RO‑ILS cases were selected to represent a range of incident types and complexities. For each case, only the “Background and Incident Overview” section—an unstructured narrative—was supplied to four state‑of‑the‑art LLMs: Gemini 2.5 Pro (Google), GPT‑4o (OpenAI), o3 (OpenAI), and Grok 3 (Grok). The models were prompted with a standardized instruction derived from the AAPM RCA guidelines, asking them to produce a chronological sequence, cause‑and‑effect diagram, explicit causal statements, and concise bullet‑point lists of lessons learned and suggested actions.

Model outputs were evaluated through three complementary lenses. First, an objective semantic similarity assessment employed Sentence‑Transformers (all‑mpnet‑base‑v2) to compute cosine similarity between model‑generated text and the ground‑truth sections (“Root Causes/Contributing Factors” and “Lessons Learned/Suggestions and Actions”). GPT‑4o achieved the highest overall similarity (0.831 ± 0.051), while Gemini 2.5 Pro led in the “Root Causes” subsection (0.71 ± 0.12).

Second, a semi‑subjective evaluation was conducted by five board‑certified medical physicists with extensive incident‑analysis experience. Each case was reviewed by two physicists independently, and discrepancies were reconciled. The reviewers scored precision, recall, F1‑score, and accuracy for root‑cause identification, as well as a binary hallucination flag indicating fabricated or unsupported information. Gemini 2.5 Pro recorded the best performance across all four quantitative metrics (precision 0.705 ± 0.297, recall 0.799 ± 0.256, F1 0.727 ± 0.268, accuracy 0.918 ± 0.19). Hallucination rates varied markedly: Gemini 2.5 Pro 11 %, o3 61 %, GPT‑4o 32 %, and Grok 3 29 %.

Third, qualitative quality was measured on four 5‑point Likert scales—relevance, comprehensiveness, quality of justification, and quality of solutions—plus two overall subjective scores (reasoning capability and overall performance). Gemini 2.5 Pro consistently received the highest averages (≈4.6/5) and the top overall performance rating of 4.8/5 from the expert panel.

Statistical analysis using Friedman’s test revealed significant differences among models for accuracy (p = 0.002), hallucination rate (p < 0.001), and the two subjective scores (p < 0.05). Differences were also significant for the separate text‑similarity scores of the “Root Causes” (p = 0.010) and “Lessons Learned/Suggestions” (p = 0.001) sections, though not for the combined text (p = 0.126).

The authors conclude that LLMs, particularly Gemini 2.5 Pro, can generate RCA outputs that align closely with expert judgments, suggesting a promising role as assistive tools in radiation oncology safety programs. However, the variability in hallucination rates underscores the necessity for rigorous verification and possibly human‑in‑the‑loop oversight before clinical deployment. The paper recommends future work to expand the case corpus, explore prompt engineering (e.g., chain‑of‑thought or tree‑of‑thought techniques), test model ensembles, and develop integrated human‑AI workflows to mitigate hallucinations and enhance reliability. Ultimately, while LLMs show emerging utility for incident analysis and quality‑improvement initiatives, careful validation and domain‑specific safeguards remain essential for safe adoption in high‑stakes radiation therapy environments.

Root Cause Analysis of Radiation Oncology Incidents Using Large Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment