내시경 이미지 분류와 대형 언어 모델 기반 임상 추론 통합 프레임워크

Reading time: 6 minute
...

📝 Abstract

Medical image classifiers detect gastrointestinal diseases well, but they do not explain their decisions. Large language models can generate clinical text, yet they struggle with visual reasoning and often produce unstable or incorrect explanations. This leaves a gap between what a model sees and the type of reasoning a clinician expects. We introduce a framework that links image classification with structured clinical reasoning. A new hybrid model, MobileCoAtNet, is designed for endoscopic images and achieves high accuracy across eight stomach-related classes. Its outputs are then used to drive reasoning by several LLMs. To judge this reasoning, we build two expert-verified benchmarks covering causes, symptoms, treatment, lifestyle, and follow-up care. Thirty-two LLMs are evaluated against these gold standards. Strong classification improves the quality of their explanations, but none of the models reach human-level stability. Even the best LLMs change their reasoning when prompts vary. Our study shows that combining DL with LLMs can produce useful clinical narratives, but current LLMs remain unreliable for high-stakes medical decisions. The framework provides a clearer view of their limits and a path for building safer reasoning systems. The complete source code and datasets used in this study are available at https://github.com/souravbasakshuvo/DL3M .

💡 Analysis

Medical image classifiers detect gastrointestinal diseases well, but they do not explain their decisions. Large language models can generate clinical text, yet they struggle with visual reasoning and often produce unstable or incorrect explanations. This leaves a gap between what a model sees and the type of reasoning a clinician expects. We introduce a framework that links image classification with structured clinical reasoning. A new hybrid model, MobileCoAtNet, is designed for endoscopic images and achieves high accuracy across eight stomach-related classes. Its outputs are then used to drive reasoning by several LLMs. To judge this reasoning, we build two expert-verified benchmarks covering causes, symptoms, treatment, lifestyle, and follow-up care. Thirty-two LLMs are evaluated against these gold standards. Strong classification improves the quality of their explanations, but none of the models reach human-level stability. Even the best LLMs change their reasoning when prompts vary. Our study shows that combining DL with LLMs can produce useful clinical narratives, but current LLMs remain unreliable for high-stakes medical decisions. The framework provides a clearer view of their limits and a path for building safer reasoning systems. The complete source code and datasets used in this study are available at https://github.com/souravbasakshuvo/DL3M .

📄 Content

The Gastrointestinal (GI) system is vital for digesting food and maintaining health through organs like the stomach and intestines. However, this system is susceptible to disruptive disorders like polyps, colon cancer, and ulcerative colitis, which can affect organs such as the stomach, intestines, liver, and pancreas (Ahamed et al., 2024a;Şener and Ergen, 2025). If left unmanaged, some of these abnormalities can ultimately develop into cancer (Wang et al., 2025). GI cancers represent a major global health challenge, accounting for approximately 4.8 million new cases and 3.2 million deaths in 2022, which together constitute nearly one third of all cancer related mortalities worldwide (Ahamed et al., 2024b;Staff, 2020). A considerable proportion of this burden lies in Asia, with East Asia alone reporting 1,469,225 new gastrointestinal cancer cases and 837,360 deaths in the same year, corresponding to 43.1 percent and 41.7 percent of the global totals, respectively (Chong et al., 2024). Doctors use several tools to examine and treat problems in the GI tract. These include colonoscopy, upper GI endoscopy, capsule endoscopy, radiographic imaging, and lab tests (Kulinna-Cosentini et al., 2024). However, diagnosis results can vary depending on the doctor’s experience, which makes the process unreliable at times. Reviewing thousands of endoscopic images manually is also a major challenge, increasing the chance of missed diagnoses and delays in treatment (Xie et al., 2024;Fernandes et al., 2024), leading to conditions that can quickly progress into serious stages.

Endoscopic image analysis faces issues like heavy visual clutter, inconsistent lighting, overlapping tissue patterns, and unwanted artifacts (Wang et al., 2026;Hayat et al., 2025;Lin et al., 2024). Many diseases from different categories appear very similar, while the same disease can show different patterns in different patients. These factors reduce the accuracy of diagnosis (Zeng et al., 2024). The quality and detail of the captured endoscopic images also play a big role in whether a condition is correctly identified (Tomazic et al., 2021;Mc-Cafferty et al., 2018).

To improve diagnosis, several classification methods have been developed. These methods can classify images with better precision, helping doctors in detecting and diagnosing conditions more efficiently. Moreover, in underdeveloped and developing countries ,the doctor-to-patient ratio is quite iimbalanced (Karan et al., 2021;Nawaz et al., 2025). As a result, doctors can not pay enough attention to low-risk medical conditions like stomach infections (naming of the disease) (Su and Liu, 2025). This raise a serious concern to develop an automated system that could reduce the workload of doctors (Gumilar et al., 2025;Ghorbian et al., 2025), average patient waiting time, and increase reliability (He et al., 2020;Yilmaz et al., 2024). The automated system should be able to collect, process, classify, and give reasoning for normal and abnormal findings to serve as a reliable medical tool (El Ogri et al., 2026). This would lead to faster, more accurate treatment decisions and help lower the number of deaths from GI diseases worldwide (John et al., 2024).

Over the past few years, a significant number of studies have been conducted on the classification of stomach infections using deep learning (Naseem et al., 2025;Siddiqui et al., 2025;Shuvo and Chowdhury, 2024). Several studies have explored the usability of large language models (LLMs) in medical domain for clinical question answering, report generation, and decision support. In some studies, LLMs have been evaluated for medical text understanding, diagnosis, and prediction from textual data. However, none of these studies combined image-based disease classification with reasoning tasks.

In our experiment, we explore the effectiveness of LLMs for deep learning model assisted reasoning generation. This approach is necessary as LLMs are text-based models and cannot directly interpret the visual data, such as endoscopic images. If we send image in a different format (like a NumPy array or a matrix) to a text-based LLM using prompt, there will be a large information gap and the model will fail to extract important features from that image. Our solution incorporates DL models to first detect the disease from images, and then uses LLMs to provide appropriate reasoning and initial clinical advice on the diagnosis. We developed a benchmark dataset to evaluate and select best performing LLMs for medical reasoning.

Therefore, to summarize the research insights, this study focuses on the following Research Questions (RQs):

• RQ1: How does classification influence LLMs? • RQ2: How effective are LLMs in clinical reasoning when DL is used as classifier? • RQ3: Can LLMs simulate human-like expertise for reasoning? Most of the existing studies either focus on classifying medical images without explaining the results or evaluate LLMs only through text-based t

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut