MedChat: A Multi-Agent Framework for Multimodal Diagnosis with Large Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The integration of deep learning-based glaucoma detection with large language models (LLMs) presents an automated strategy to mitigate ophthalmologist shortages and improve clinical reporting efficiency. However, applying general LLMs to medical imaging remains challenging due to hallucinations, limited interpretability, and insufficient domain-specific medical knowledge, which can potentially reduce clinical accuracy. Although recent approaches combining imaging models with LLM reasoning have improved reporting, they typically rely on a single generalist agent, restricting their capacity to emulate the diverse and complex reasoning found in multidisciplinary medical teams. To address these limitations, we propose MedChat, a multi-agent diagnostic framework and platform that combines specialized vision models with multiple role-specific LLM agents, all coordinated by a director agent. This design enhances reliability, reduces hallucination risk, and enables interactive diagnostic reporting through an interface tailored for clinical review and educational use. Code available at https://github.com/Purdue-M2/MedChat.

💡 Research Summary

MedChat introduces a multi‑agent framework that couples deep‑learning based glaucoma detection with large language models (LLMs) to produce clinically reliable diagnostic reports. The system first processes retinal fundus images using two vision modules: a Swin‑V2 classifier that outputs a probability of glaucoma, and a SegFormer segmenter that delineates the optic disc and optic cup. The probability is discretized into four verbal grades (no glaucoma, possible glaucoma, likely glaucoma, glaucoma detected) and the cup‑to‑disc ratio (CDR) is computed from the segmentation masks and expressed in natural language. These quantitative outputs, together with any optional clinician notes, are concatenated into a “core prompt” that serves as the factual basis for all downstream language agents.

A GPT‑4.1 model is then prompted to generate a set of clinically relevant roles for the case (e.g., ophthalmologist, optometrist, pharmacist, glaucoma specialist). For each role, a dedicated GPT‑4.1 instance receives the core prompt plus a role‑specific instruction such as “As a {role}, please analyze this case from your domain expertise. Only include observations and recommendations relevant to your specialty.” Each role‑specific agent independently produces a sub‑report that reflects its professional perspective, avoiding overlap with other agents and refraining from mentioning the underlying vision models.

All sub‑reports are aggregated and fed to a second GPT‑4.1 instance, the director agent. The director’s prompt asks it to synthesize the information, identify consensus, resolve minor contradictions, and generate a unified diagnostic report in a formal, neutral tone suitable for a medical record. The director does not cite the sources of the sub‑reports, ensuring the final output reads as a single authoritative document.

Key design insights include: (1) anchoring LLM reasoning to explicit, verifiable visual evidence (probability and CDR) to suppress hallucinations; (2) distributing reasoning across role‑specific agents to capture the breadth of multidisciplinary clinical judgment; (3) providing transparency by preserving each sub‑report, allowing clinicians to inspect individual viewpoints or ask follow‑up questions via an interactive chat interface. The modular architecture permits swapping vision models (e.g., OCT‑based 3D CNNs), fine‑tuning or replacing LLMs with domain‑adapted variants, and adding new roles without redesigning the whole pipeline.

The authors built a complete web platform: a Python backend orchestrates image analysis, prompt construction, LLM calls, and report synthesis; a React frontend lets users upload fundus images and optional clinical notes, trigger report generation, download the final report as PDF, and engage in a Q&A dialogue with the system. Evaluation on the publicly available Harvard‑FAIRVision fundus dataset shows that MedChat’s multi‑agent approach improves diagnostic accuracy, report consistency, and reduces hallucination frequency compared with single‑agent baselines such as ChatCAD.

In summary, MedChat demonstrates that a structured multi‑agent collaboration can overcome the reliability and interpretability limitations of single‑LLM diagnostic pipelines, delivering more trustworthy, comprehensive, and transparent AI‑assisted reports. The framework is readily extensible to other imaging modalities and diseases, and future work will explore fine‑tuning role agents on medical corpora, real‑time integration into clinical workflows, and broader cross‑domain validation.

MedChat: A Multi-Agent Framework for Multimodal Diagnosis with Large Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment