Clinical Validation of Medical-based Large Language Model Chatbots on Ophthalmic Patient Queries with LLM-based Evaluation
Domain specific large language models are increasingly used to support patient education, triage, and clinical decision making in ophthalmology, making rigorous evaluation essential to ensure safety and accuracy. This study evaluated four small medical LLMs Meerkat-7B, BioMistral-7B, OpenBioLLM-8B, and MedLLaMA3-v20 in answering ophthalmology related patient queries and assessed the feasibility of LLM based evaluation against clinician grading. In this cross sectional study, 180 ophthalmology patient queries were answered by each model, generating 2160 responses. Models were selected for parameter sizes under 10 billion to enable resource efficient deployment. Responses were evaluated by three ophthalmologists of differing seniority and by GPT-4-Turbo using the S.C.O.R.E. framework assessing safety, consensus and context, objectivity, reproducibility, and explainability, with ratings assigned on a five point Likert scale. Agreement between LLM and clinician grading was assessed using Spearman rank correlation, Kendall tau statistics, and kernel density estimate analyses. Meerkat-7B achieved the highest performance with mean scores of 3.44 from Senior Consultants, 4.08 from Consultants, and 4.18 from Residents. MedLLaMA3-v20 performed poorest, with 25.5 percent of responses containing hallucinations or clinically misleading content, including fabricated terminology. GPT-4-Turbo grading showed strong alignment with clinician assessments overall, with Spearman rho of 0.80 and Kendall tau of 0.67, though Senior Consultants graded more conservatively. Overall, medical LLMs demonstrated potential for safe ophthalmic question answering, but gaps remained in clinical depth and consensus, supporting the feasibility of LLM based evaluation for large scale benchmarking and the need for hybrid automated and clinician review frameworks to guide safe clinical deployment.
💡 Research Summary
This study provides a systematic clinical validation of four small‑parameter, medical‑domain large language models (LLMs) – Meerkat‑7B, BioMistral‑7B, OpenBioLLM‑8B, and MedLLaMA3‑v20 – on a curated set of 180 ophthalmology‑related patient queries. Each query was presented to the models using an identical prompt that instructed the model to act as an experienced ophthalmologist and to produce a clinically accurate, comprehensive, and concise answer. All models were run in 4‑bit quantized form on identical hardware (Google Colab T4 GPUs) with the same decoding settings (max 512 tokens, temperature 0.3, top‑k 100, top‑p 0.6) to ensure that performance differences could be attributed to intrinsic model capabilities rather than experimental variance.
The generated 2,160 responses were evaluated using the S.C.O.R.E. framework, which rates five dimensions—Safety, Consensus & Context, Objectivity, Reproducibility, and Explainability—on a 5‑point Likert scale. Three ophthalmologists of varying seniority (Senior Consultant, Consultant, Resident) independently scored each response, and each question was re‑prompted twice to assess reproducibility. In parallel, GPT‑4‑Turbo performed the same scoring automatically, allowing a direct comparison between human and LLM‑based evaluation.
Key findings:
- Meerkat‑7B consistently achieved the highest mean scores (Senior Consultant 3.44, Consultant 4.08, Resident 4.18) and excelled in Safety and Explainability, indicating fewer hallucinations and clearer reasoning. However, it still produced clinically relevant errors on complex surgical queries (e.g., mis‑describing LASIK Xtra and DCR complications).
- BioMistral‑7B and OpenBioLLM‑8B performed at an intermediate level, showing reasonable contextual relevance and objectivity but lacking the overall robustness of Meerkat‑7B.
- MedLLaMA3‑v20 performed poorest; 25.5 % of its answers contained hallucinations or fabricated terminology (e.g., “laser photophosphorylation,” “treacle”), posing a clear patient safety risk.
- GPT‑4‑Turbo’s automated grading correlated strongly with the ophthalmologists (Spearman ρ = 0.80, Kendall τ = 0.67). Alignment was strongest with Resident scores and weaker with Senior Consultant scores, reflecting the more conservative grading style of senior clinicians. Kernel density estimates showed the automated system and Residents clustering around high scores (4–5), while Senior Consultants displayed a broader distribution (3–4).
The study demonstrates that LLM‑based automatic evaluation can serve as an efficient first‑pass filter for large‑scale benchmarking of medical chatbots, dramatically reducing the labor required for manual expert review. Nevertheless, the observed discrepancies—especially in high‑risk clinical scenarios and among senior clinicians—underscore the necessity of a hybrid validation pipeline that couples automated scoring with expert oversight. The authors recommend extending the evaluation to larger, more diverse ophthalmic datasets, incorporating additional subspecialties, and testing newer, larger models (e.g., LLaMA‑3 70 B) and retrieval‑augmented generation techniques to improve factual grounding.
In conclusion, while small medical LLMs show promise for safe and comprehensible ophthalmic patient education, gaps remain in clinical depth, consensus, and error mitigation. Hybrid frameworks that blend LLM‑based scoring with specialist review are essential to ensure patient safety and to guide the responsible deployment of AI assistants in ophthalmology and broader healthcare contexts.
Comments & Academic Discussion
Loading comments...
Leave a Comment