Enhancing Pancreatic Cancer Staging with Large Language Models: The Role of Retrieval-Augmented Generation

Enhancing Pancreatic Cancer Staging with Large Language Models: The Role of Retrieval-Augmented Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Purpose: Retrieval-augmented generation (RAG) is a technology to enhance the functionality and reliability of large language models (LLMs) by retrieving relevant information from reliable external knowledge (REK). RAG has gained interest in radiology, and we previously reported the utility of NotebookLM, an LLM with RAG (RAG-LLM), for lung cancer staging. However, since the comparator LLM differed from NotebookLM’s internal model, it remained unclear whether its advantage stemmed from RAG or inherent model differences. To better isolate RAG’s impact and assess its utility across different cancers, we compared NotebookLM with its internal LLM, Gemini 2.0 Flash, in a pancreatic cancer staging experiment. Materials and Methods: A summary of Japan’s pancreatic cancer staging guidelines was used as REK. We compared three groups - REK+/RAG+ (NotebookLM with REK), REK+/RAG- (Gemini 2.0 Flash with REK), and REK-/RAG- (Gemini 2.0 Flash without REK) - in staging 100 fictional pancreatic cancer cases based on CT findings. Staging criteria included TNM classification, local invasion factors, and resectability classification. In REK+/RAG+, retrieval accuracy was quantified based on the sufficiency of retrieved REK excerpts. Results: REK+/RAG+ achieved a staging accuracy of 70%, outperforming REK+/RAG- (38%) and REK-/RAG- (35%). For TNM classification, REK+/RAG+ attained 80% accuracy, exceeding REK+/RAG- (55%) and REK-/RAG- (50%). Additionally, REK+/RAG+ explicitly presented retrieved REK excerpts, achieving a retrieval accuracy of 92%. Conclusion: NotebookLM, a RAG-LLM, outperformed its internal LLM, Gemini 2.0 Flash, in a pancreatic cancer staging experiment, suggesting that RAG may improve LLM’s staging accuracy. Furthermore, its ability to retrieve and present REK excerpts provides transparency for physicians, highlighting its applicability for clinical diagnosis and classification.


💡 Research Summary

This study investigates the impact of Retrieval‑Augmented Generation (RAG) on the performance of large language models (LLMs) in the specific clinical task of pancreatic cancer staging based on computed tomography (CT) findings. The authors previously demonstrated that NotebookLM, a Google‑developed RAG‑LLM, outperformed other models in lung cancer staging, but the comparison involved different underlying LLMs, leaving the contribution of RAG itself ambiguous. To isolate the effect of RAG, the current work directly compares NotebookLM (which internally uses the Gemini 2.0 Flash model) with the same Gemini 2.0 Flash model run without RAG.

A reliable external knowledge (REK) source was constructed by summarizing the latest Japanese pancreatic cancer staging guidelines (the eighth edition of the Japan Pancreas Society classification) into a 4 376‑word document. Two radiologists authored CT narrative reports for 100 fictional pancreatic cancer patients, assigning ground‑truth values for T, N, M categories, local invasion factors (CH, DU, S, RP, PV, A, PL, OO), and resectability (R, BR, UR). These cases were reviewed by additional radiologists and a gastroenterologist to ensure consistency.

Three experimental conditions were defined: (1) REK+/RAG+ – NotebookLM with the REK uploaded to its web interface, enabling automatic retrieval; (2) REK+/RAG‑ – Gemini Flash with the same REK manually inserted into the prompt, but without retrieval; (3) REK‑/RAG‑ – Gemini Flash without any REK, relying solely on its internal knowledge. All models received the same five sequential prompts (diagnose local invasion, determine T, N, M, and resectability) together with each case’s CT description.

Performance was measured at two levels. Primary outcome was overall staging accuracy, defined as correct classification of every component (TNM, local invasion, resectability) for a given case. Secondary outcomes included accuracy for each individual component and, for the REK+/RAG+ group, retrieval accuracy – the proportion of cases where the excerpts retrieved from REK contained sufficient information to support the correct staging decision.

Results showed a clear hierarchy. NotebookLM with RAG achieved 70 % overall staging accuracy, substantially higher than Gemini Flash with REK (38 %) and Gemini Flash without REK (35 %). For the TNM classification alone, NotebookLM reached 80 % accuracy, outperforming the other two groups (55 % and 50 % respectively). The advantage was most pronounced for the T and N factors; resectability classification showed a smaller gap. Retrieval accuracy for NotebookLM was 92 %, indicating that in the vast majority of cases the model correctly identified and displayed the relevant guideline excerpts. Error analysis revealed a few failure modes: (i) occasional mis‑retrieval of irrelevant or incomplete passages (e.g., missing “Resectable: R” wording), leading to incorrect resectability decisions; (ii) misinterpretation of anatomical terms (confusing splenic vein with portal vein) despite correct retrieval, resulting in a wrong T stage.

The discussion emphasizes that RAG provides two synergistic benefits. First, it supplies up‑to‑date, authoritative medical knowledge that the base LLM may not have internalized, thereby reducing hallucinations and improving multi‑criterion decisions such as cancer staging. Second, by presenting the retrieved excerpts alongside its answers, the model offers traceability, allowing clinicians to verify the reasoning and increase trust. The authors note that simply inserting REK into the prompt (RAG‑) yields modest gains over no REK, but falls far short of the automated retrieval pipeline.

Limitations include reliance on synthetic CT narratives rather than real imaging data, which may not capture the full variability and ambiguity of clinical reports. The REK itself omitted some later guideline sections (e.g., detailed resectability criteria), which constrained retrieval completeness. Moreover, the study evaluated only one cancer type and a single language (Japanese), limiting generalizability.

In conclusion, the experiment demonstrates that RAG can markedly improve LLM performance in a complex, guideline‑driven medical task, delivering higher accuracy and greater transparency. Future work should test RAG‑LLMs on authentic patient cohorts, expand to multimodal inputs (e.g., image embeddings), and explore integration of multiple guideline sources across languages and regions to build robust, clinically‑ready AI assistants.


Comments & Academic Discussion

Loading comments...

Leave a Comment