Ran Score: a LLM-based Evaluation Score for Radiology Report Generation
Chest X-ray report generation and automated evaluation are limited by poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language, including negation and ambiguity. We develop a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports and use it to define Ran Score, a finding-level metric for report evaluation. Using three non-overlapping MIMIC-CXR-EN cohorts from a public chest X-ray dataset and an independent ChestX-CN validation cohort, we optimize prompts, establish radiologist-derived reference labels and evaluate report generation models. The optimized framework improves the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort, exceeds the CheXbert benchmark by 15.7 percentage points on directly comparable labels, and shows robust generalization on the ChestX-CN validation cohort. Here we show that clinician-guided prompt optimization improves agreement with a radiologist-derived reference standard and that Ran Score enables finding-level evaluation of report fidelity, particularly for low-prevalence abnormalities.
💡 Research Summary
The paper introduces Ran Score, a finding‑level evaluation metric for chest X‑ray report generation, and a clinician‑guided Human‑LLM collaborative framework to produce high‑quality multi‑label annotations. The authors first constructed a 21‑category taxonomy of radiographic abnormalities by exploratory mining of 3,000 MIMIC‑CXR reports, deliberately extending beyond the limited CheXbert label set to include low‑prevalence findings such as tracheobronchial abnormalities, cavities, cysts, mediastinal changes, calcifications, and vascular abnormalities. Six board‑certified thoracic radiologists independently annotated a development set of 300 MIMIC‑CXR‑EN reports and a validation set of 150 ChestX‑CN reports using this taxonomy; a majority vote (≥4/6) defined the reference standard, preserving inherent inter‑reader variability while excluding uncertain cases (‑1).
The core of the methodology is an iterative prompt‑engineering loop applied to the open‑weight LLM Qwen‑3‑14B. Initial prompts enforced strict binary classification and excluded ambiguous labels. Through three rounds of Delphi consensus with the radiologists, the authors set a target label‑specific accuracy of 90 % (Cohen’s κ ≥ 0.90). Error analysis highlighted three failure modes: synonym variation, negation handling, and ambiguous phrasing. The prompt was refined by (1) adding core terms and synonyms, (2) clarifying vague descriptions, and (3) inserting explicit positive and negative examples for each label, especially for under‑represented abnormalities. After several iterations, Qwen‑3‑14B achieved a macro‑averaged F1 of 0.956 on the development set, surpassing CheXbert by 15.7 percentage points on directly comparable labels. The same prompt was evaluated on other LLMs (Qwen‑Plus, GPT‑3.5‑Turbo, GPT‑4o‑mini, DeepSeek‑R1), confirming that the optimized prompt consistently improves performance across models.
To assess report generation, seven state‑of‑the‑art chest X‑ray report generators (RGRG, XrayGPT, R2GenGPT, PromptMRG, LLM‑RG4, Libra‑1.0‑3B, MedKit) produced reports for a held‑out test set of 3,000 MIMIC‑CXR‑EN studies. The frozen Qwen‑3‑14B extraction pipeline was applied to both the generated reports and the original reference reports, yielding binary predictions for the 21 labels. The macro‑averaged F1 between these two label sets is defined as Ran Score. Because each label contributes equally, Ran Score is particularly sensitive to omissions of low‑prevalence findings, a known weakness of surface‑level metrics such as BLEU, ROUGE, or CIDEr. Ran Score correlated more strongly with radiologist judgments than traditional lexical overlap scores and demonstrated that many generation models still miss clinically important, rare abnormalities.
Cross‑lingual robustness was examined using the independent ChestX‑CN cohort (Chinese reports). Applying the same taxonomy and optimized prompt resulted in a macro‑averaged F1 of 0.938, indicating minimal degradation despite language differences and confirming the framework’s generalizability.
The study’s contributions are threefold: (1) a clinician‑in‑the‑loop prompt optimization strategy that dramatically improves LLM‑based multi‑label extraction without model retraining; (2) the release of a large‑scale, publicly available 21‑label annotation set covering the entire MIMIC‑CXR corpus (~377 k reports); and (3) the definition of Ran Score, a finding‑level, label‑balanced metric that better reflects clinical fidelity, especially for rare pathologies. Limitations include the exclusion of uncertain cases, the fixed 21‑label schema (which may omit finer‑grained diagnoses), and reliance on binary labeling that may not capture disease severity. Future work should explore probabilistic labeling of uncertainty, expansion of the taxonomy, multimodal end‑to‑end training that jointly optimizes image encoding and report generation, and real‑time integration of the Human‑LLM framework into clinical workflows.
Comments & Academic Discussion
Loading comments...
Leave a Comment