Multimodal Large Language Models (LLMs) introduce an emerging paradigm for medical imaging by interpreting scans through the lens of extensive clinical knowledge, offering a transformative approach to disease classification. This study presents a critical comparison between two fundamentally different AI architectures: the specialized open-source agent MedGemma and the proprietary large multimodal model GPT-4 for diagnosing six different diseases. The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), demonstrated superior diagnostic capability by achieving a mean test accuracy of 80.37% compared to 69.58% for the untuned GPT-4. Furthermore, MedGemma exhibited notably higher sensitivity in high-stakes clinical tasks, such as cancer and pneumonia detection. Quantitative analysis via confusion matrices and classification reports provides comprehensive insights into model performance across all categories. These results emphasize that domain-specific fine-tuning is essential for minimizing hallucinations in clinical implementation, positioning MedGemma as a sophisticated tool for complex, evidence-based medical reasoning.
Chronic diseases affect the lives of approximately 589 million adults (20-79 years) around the globe [1]. Key examples underscore this severity: Breast cancer sees 2.3 million new cases annually [2]. Pneumonia accounts for 14% of all under 5 deaths [3]. Alzheimer's or other dementia diseases impact over 55 million globally [4]. Cardiovascular diseases caused 19.8 million deaths in 2022 [5], and chronic kidney disease affects over 674 million people globally [6]. These numbers illustrate the severity of chronic diseases and the increasing need to handle them. While traditional deep learning models demonstrated excellent disease classification accuracy, they often function as black boxes that offer high accuracy but limited transparency. These models frequently fail to provide the reasoning behind a diagnosis, making it difficult for clinicians to trust or verify their outputs without manual review. AI-powered disease classification uses convolutional neural networks (CNNs) and other deep learning techniques to analyze images for diagnostic purposes. The earlier works often neglected textual clinical data and multimodal integration, which limits their applicability to diverse datasets [7]. In contrast, Modern Large Language Models (LLMs) and specialized multimodal frameworks bridge this gap by seamlessly integrating textual clinical records with medical imaging, providing a comprehensive diagnostic context. LLMs' potential has been studied in medical question answering and diagnostic performance. The LLMs have shown insufficiency in handling complex medical terminology when deployed in the real world [8]- [10]. General-purpose LLMs like GPT-4 struggle with personalized medicine and genomic data integration [11]. Moreover, general LLMs like GPT-4 pose challenges in transparency and misinformation risks when integrated into clinical workflows [12]. Since the Generalpurpose LLMs are not specifically trained on medical data, gaps are also addressed in conversational diagnostics, medical care, and standardized reporting [13]. This lack of specialized training makes it difficult for general-purpose models to provide the clear, evidence-based reasoning that healthcare professionals require.
This study overcomes these limitations by evaluating MedGemma, a medically specialized multimodal model, against widely used general purpose LLM GPT-4 across datasets of six chronic diseases. These data are on skin cancer, Alzheimer’s disease, breast cancer, cardiovascular disease, pneumonia, and chronic kidney disease. A comprehensive evaluation is performed and using the features of a wide variety of datasets, this study explores the potential of medically specialized agentic AI MedGemma in clinical applications.
General-purpose LLMs like GPT-4 have revolutionized natural language processing tasks. They have achieved stateof-the-art performance in daily tasks such as text generation, sentiment analysis, and question answering [14][15]. GPT models have also been trained with medical datasets for diagnosis, classification and reasoning [16]. GPT-4 can perform remarkably well on medical licensing exams and suggest differential diagnoses based on textual symptoms. However, they are medical generalists and can suffer from hallucinations or a lack of grounding in specific clinical guidelines [17]. GPT models are not well-suited for clinical text classification. They have a lack of interpretation of medical abbreviations [18]. LLMs like PaLM can encode clinical knowledge but their general-purpose nature limits precision in clinical diagnostic tasks. Med-PaLM initially achieves only 67.6% on MedQA, later improved to 86.5% with fine-tuning [19]. Med-PaLM 2 model achieved 86.5% on MedQA surpassing prior models through medical-specific fine-tuning [19]. In a similar study, generative AI models Author, Title achieved an overall diagnostic accuracy of 52.1%, which is comparable to a random guess rather than a clinically reliable prediction [20]. It restates the limitations of general-purpose models.
Medically specialized models, such as BioBERT, ClinicalBERT, and MedGemma, have been trained to handle these limitations. They are pretrained on biomedical corpora, including PubMed abstracts, clinical notes, and medical guidelines [20]- [22]. The BioBERT model is pretrained on PubMed and PMC. It excels in tasks like named entity recognition and relation extraction. ClinicalBERT is finetuned on MIMIC-III clinical notes. It performs strongly in clinical text summarization [21]. MedGemma, a multimodal agentic AI, leverages a transformer architecture. It is pretrained on a diverse medical corpus and captures specific medical patterns effectively [22]. This study [24] investigates the integration of Explainable Artificial Intelligence (XAI) as a vital mechanism for bridging the gap between highperformance diagnostic algorithms and clinical trust. While deep learning models demonstrate superior diagnostic potential, their black-box architecture pr
This content is AI-processed based on open access ArXiv data.