CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation
We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety. Unlike prior metrics, CRIMSON incorporates full clinic…
Authors: Mohammed Baharoon, Thibault Heintz, Siavash Raissi
CRIMSON: A Clinically-Grounded LLM-Based Metric for Generativ e Radiology Rep ort Ev aluation Mohammed Baharo on 1,*, † , Thibault Hein tz 2,* , Sia v ash Raissi 1,* , Mahmoud Alabbad 3 , Mona Alhammad 4 , Hassan AlOmaish 5 , Sung Eun Kim 1 , Oishi Banerjee 1 , and Pranav Ra jpurk ar 1 1 Departmen t of Biomedical Informatics, Harv ard Medical School, Boston, MA 2 Departmen t of Radiation Oncology , Mass General Brigham, Boston, MA 3 King F ahad Hospital, Al-Ahsa Health Cluster, Al Hofuf, Saudi Arabia 4 Ras-T an ura General Hospital, Ministry of Health, Eastern Pro vince, Saudi Arabia 5 Departmen t of Medical Imaging, King Ab dulaziz Medical Cit y , Ministry of National Guard, Riy adh, Saudi Arabia * These authors contributed equally Abstract W e in tro duce CRIMSON , a clinically grounded ev aluation framework for chest X-ray re- p ort generation that assesses rep orts based on diagnostic correctness, contextual relev ance, and patien t safet y . Unlik e prior metrics, CRIMSON incorporates full clinical con text, including patien t age, indication, and guideline-based decision rules, and preven ts normal or clinically in- significan t findings from exerting disproportionate influence on the ov erall score. The framew ork categorizes errors in to a comprehensive taxonomy cov ering false findings, missing findings, and eigh t attribute-lev el errors (e.g., lo cation, severit y , measurement, and diagnostic o verin terpreta- tion). Eac h finding is assigned a clinical significance level (urgent, actionable non-urgen t, non- actionable, or expected/b enign), based on a guideline developed in collab oration with attending cardiothoracic radiologists, enabling sev erity-a ware weigh ting that prioritizes clinically conse- quen tial mistakes ov er b enign discrepancies. CRIMSON is v alidated through strong alignment with clinically significant error counts annotated by six b oard-certified radiologists in ReXV al (Kendall’s τ = 0 . 61 – 0 . 71 ; P earson’s r = 0 . 71 – 0 . 84 ), and through t wo additional b enchmarks that w e introduce. In R adJudge , a targeted suite of clinically c hallenging pass–fail scenarios, CRIM- SON shows consistent agreemen t with exp ert judgment. In R adPr ef , a larger radiologist prefer- ence b enc hmark of ov er 100 pairwise cases with structured error categorization, severit y mo del- ing, and 1–5 o verall quality ratings from three cardiothoracic radiologists, CRIMSON achiev es the strongest alignmen t with radiologist preferences. W e release the metric, the ev aluation b enc hmarks, RadJudge and RadPref, and a fine-tuned MedGemma mo del to enable reproducible ev aluation of rep ort generation, all av ailable at https://gith ub.com/ra jpurk arlab/CRIMSON. 1 In tro duction Automated radiology rep ort generation has adv anced rapidly with the emergence of large vision- language models, y et reliable ev aluation remains a fundamen tal c hallenge [27, 23, 11]. Recen t radiology-sp ecific metrics hav e mo ved b ey ond surface-level text similarity and instead assess factual † Con tact: MohammedSalimAB@outlo ok.com 1 Reference Re port : “The lungs d emonstrate b ibasilar atelecta sis. The car diac silhoue tte is mildly enla rged. There is aortic a theroscle rosis with va scular c alcifica tion. ” Age: 25 y Indicati on: Ch est pain Age: 82 y Indicati on: Rout ine preoperative evaluation Radiologi st Expectation < Patient Cont ext Sensitivit y Reference Re port : “2cm nod ule in left lowe r lobe. No inf iltrations or ef fusion. Normal heart and mediastinum.” “No infiltra tions or effusio n. Normal heart and mediastinum.” “2cm nod ule in the left lower lobe.” < Normal Findin g Handling Clinical Significance Weighting “Malpositioned ETT, tip in the r ight bronchu s.” “ ETT is we ll positio ned. Tra ce right ple ural effusion. Mild bibasil ar atelectasis.” < Reference Re port : “Mispositioned ETT, termina ted in right ma in bronchus. Tr ace right pleural effus ion. Mild bi basilar at electasi s.” Candidate Report: “Bibasila r atelec tasis is pre sent. The hea rt is mildly enla rged.” Fi g u r e 1 – A) 3 e x a mpl e t e s t s . T e s t 1 : C R I M S O N c a n di s t i n g u i s h w h e n t h e a g e a n d in d ic at io n o f a p at ie n t d e t e r m in e t h e s ig n if ic an c e o f a f in d in g , ad ju s t in g t h e s c o r e ac c o r d in g l y . T e s t 2 : N o r m al f in d in g r e p o r tin g v ar ie s s ig n if ic an tl y b e tw e e n d o c to r s , C R I M S O N i g n o r e s t h e m . T e s t 3 : Lu c e n c y c a n b e a s i g n o f p n e u m o t h o r a x . C R I M S O N th e r e f o r e g i v e s p ar ti al c r e d i t. CR I M SO N (o u r s ) GR E E N Ra d G r a p h Ch e X b e r t Ra T E S c o r e 1 2 1 2 2 1 CRIMSON (ours) GREEN RadGraph CheXbe rt RaTEScore 2 1 1 1 2 CRIMSON (ours) GREEN RadGraph CheXbe rt RaTEScore 2 CRIMSON (ours) GREEN RadGraph CheXbe rt RaTEScore Figure 1: Represen tative RadJudge cases illustrating core design principles of CRIMSON. T op: P atient con text sensitivit y . The clinical impact of an omission (e.g., aortic atherosclerosis) v aries b y age and indication, and CRIMSON adjusts sev erity accordingly . Middle: Normal finding handling. CRIMSON do es not reward mentioning normal findings, preven ting score inflation. Bottom: Clin- ical significance weigh ting. Errors are weigh ted by consequence, prioritizing clinically imp ortan t findings. In each case, CRIMSON aligns with radiologist exp ectations, whereas prior metrics fail. correctness through structured error coun ting and finding-level comparison [15, 29, 8, 28, 21, 5]. These approac hes represen t imp ortan t progress tow ard clinically meaningful ev aluation b y explicitly detecting hallucinations and omissions. Despite these adv ances, current metrics largely treat detected errors as either uniformly imp or- tan t or binary (significan t vs. not significan t), and ev aluate findings in relative isolation from broader clinical con text. In practice, the clinical consequences of errors v ary substan tially . F or instance, failing to rep ort a life-threatening pneumothorax is categorically differen t from missing age-related aortic calcification. Moreov er, the relev ance and in terpretation of findings depend on a patien t’s age and indication. Existing ev aluation frameworks do not explicitly enco de clinical sev erity as a fine-grained sp ectrum; instead, they collapse findings into coarse categories, lac k sufficien t clinical con text to determine true significance, or rely on LLM judgments without structured in-context guidelines [15, 8, 21]. Consequently , these frameworks conflate minor, clinically inconsequential discrepancies with omissions that directly impact patien t safet y . 2 T o address these limitations, we in tro duce CRIMSON, a clinically grounded LLM-based ev al- uation framew ork designed to align automated assessmen t with real-world radiologic reasoning. CRIMSON ev aluates rep orts at the lev el of individual findings while incorp orating full clinical con- text, including patient age and indication. The framew ork models a comprehensive taxonomy of errors—false findings, missing findings, attribute-lev el errors (e.g., lo cation, severit y , measurement) and significance labels (urgent, actionable non-urgent, non-actionable, or exp ected/b enign)—defined according to a rubric dev eloped with cardiothoracic radiologists. The severit y lab els determine w eights within a principled scoring formulation that prioritizes clinically consequential errors o ver b enign discrepancies and supp orts partial credit for partially correct findings under fine-grained attribute rules. W e v alidate CRIMSON through alignment with radiologist-annotated clinically sig- nifican t error coun ts in ReXV al [24, 23] (PhysioNet Creden tialed Health Data License 1.5.0). W e also in tro duce and v alidate the metric using RadJudge, a targeted ranking test suite of clinically chal- lenging scenarios, and RadPref, a comprehensiv e radiologist preference b enc hmark, demonstrating impro ved agreement with exp ert judgment compared to existing metrics. T o facilitate repro ducibil- it y and adoption, we publicly release the metric and additionally fine-tune MedGemma [17] to generate CRIMSON predictions, enabling priv acy-preserving, fully local deplo ymen t for hospitals and institutions without transmitting patient rep orts to external APIs. 2 Related W ork Radiology Rep ort Ev aluation. Early ev aluation of radiology report generation relied on general- purp ose metrics such as BLEU [16], ROUGE [13], and METEOR [3], which measure lexical ov erlap and are p o orly aligned with clinical correctness. Radiology-specific frameworks suc h as CheXb ert [19] and RadGraph [9] shift ev aluation tow ard label-based or en tit y-based comparisons; how ev er, they remain constrained by predefined lab el spaces or extraction pip elines and do not explicitly mo del v ariation in clinical sev erit y across findings. BER TScore [25] offers em b edding-based se- man tic similarit y but do es not capture clinical correctness and lacks in terpretability . RadCliQ [23] com bines multiple automated metrics to approximate radiologist judgmen t through a comp osite score. RaTEScore [29] ev aluates rep orts using entit y-a ware semantic similarit y to b etter align with clinical conten t. Both metrics, ho wev er, lack explicit finding-lev el in terpretability and do not incor- p orate clinical con text to mo dulate error severit y . LLM-Based Ev aluation F rameworks. More recently , ev aluation metho ds hav e leveraged large language mo dels (LLMs) to assess factual correctness through error categorization and coun t- ing [15, 21, 8, 12, 4, 28, 10, 6]. Compared to lab el- or graph-based approac hes, LLM-based ev al- uators offer greater flexibility in identifyin g n uanced discrepancies and can pro duce structured ra- tionales alongside error coun ts, impro ving in terpretability and transparency . GREEN [15] and LLM-RadJudge [21] b oth leverage LLMs to identify and classify errors in candidate rep orts, pro- ducing interpretable ev aluations v alidated against radiologist judgmen ts. How ev er, these and most other LLM-based metrics either treat all errors uniformly or rely on a binary significan t v ersus insignifican t distinction, without mo deling clinical sev erity along a fine-grained sp ectrum or incor- p orating full patient-lev el context [15, 21, 12, 28]. FineRadScore [8] in tro duces error w eighting within a line-level correction framework; ho wev er, its severit y assessment remains implicitly de- termined b y the LLM without explicit clinician-defined guidelines or comprehensiv e patien t-level con text. Other LLM-based metrics take alternative approac hes: CLEAR [10] transforms rep orts in to structured condition–attribute tables for attribute-lev el fidelity assessment, while ICARE [6] emplo ys a dual-agent question-answering framew ork for clinically grounded ev aluation. Zhu et al. [30] demonstrate that incorporating professional radiologists’ exp ertise in to LLM-based ev aluation 3 pip elines can substan tially impro ve alignment with clinical judgment. 3 CRIMSON: Clinically-Grounded Rep ort Ev aluation CRIMSON ev aluates a candidate radiology rep ort against a reference rep ort b y identifying and categorizing discrepancies at the finding level, weighing each error b y its clinical significance, and computing a normalized score that reflects clinical preferences. GPT-5.2 [18] is used as the back- b one throughout this pip eline. The framework op erates in three stages: (1) finding extraction and clinical significance assignment, (2) error detection and classification, and (3) severit y-aw are score computation. 3.1 Finding Extraction and Clinical Significance Assignmen t Giv en a reference rep ort R ref and a candidate report R cand , CRIMSON first extracts all abnormal findings from each rep ort. Normal findings are excluded from ev aluation b ecause including them can in tro duce spurious v ariabilit y due to st ylistic differences across radiologists [2, 1]. Each extracted finding f is assigned a clinical significance w eight w ( f ) based on standard radiological practice and a structured sev erity framework adapted from [20] with input from attending cardiothoracic radiologists. The clinical significance w eight w ( f ) is defined according to the following: w ( f ) = 1 . 0 if urgent 0 . 5 if actionable, not urgen t 0 . 25 if not actionable, not urgent 0 . 0 if exp ected/b enign (1) Findings classified as ur gent corresp ond to abnormalities requiring immediate in terven tion or indicating life-threatening conditions, such as a tension pneumothorax. A ctionable non-ur gent find- ings are those that alter patient management but are not immediately critical, including no dules, mo derate pleural effusions, or consolidations. Not actionable are findings that represen t minimal clinical impact but remain w orth do cumenting, suc h as a cervical rib or appropriately positioned supp ort devices. Exp e cte d b enign findings include exp ected or age-appropriate changes with no im- pact on care, such as degenerative spine changes or a tortuous aorta. W e separate non-actionable from expected b enign findings because rep orting of expected b enign findings v aries substantially by radiologist st yle; p enalizing based on them in tro duces unnecessary randomness. Clinical significance assignmen t incorp orates patient con text when a v ailable, including age and indication. F or example, aortic calcification in a 75-y ear-old patient is classified as expected b enign, whereas the same finding in a 25-year-old patien t ma y b e considered actionable non-urgent due to at ypical early onset. 3.2 Error T axonom y and Classification CRIMSON c haracterizes discrepancies through three primary error categories: false findings, missing findings, and attribute errors. F alse findings are abnormal findings present in R cand but absent from R ref , represen ting a hallucination. Missing findings are abnormal findings presen t in R ref but absent from R cand , representing diagnostically meaningful omissions. Findings that app ear in b oth rep orts are considered “matched.” F or suc h findings, CRIMSON ev aluates attribute-lev el correctness across eight dimensions: (1) anatomical lo cation or laterality , (2) severit y or extent, (3) morphological descriptors, (4) quan titative measuremen ts, (5) certaint y 4 lev el, (6) diagnostic underin terpretation, (7) ov erin terpretation, and (8) temp oral or comparison descriptors. Eac h attribute error e is assigned a sev erit y-based weigh t w attr ( e ) , where w attr ( e ) = 0 . 5 if the error is lab eled as significan t, and w attr ( e ) = 0 . 0 if it is lab eled as negligible. Significant attribute errors are those that could alter treatment decisions or patient management, whereas negligible errors corresp ond to clinically inconsequen tial differences. F or example, incorrect lung laterality is considered significan t, whereas p ositional differences within the same lob e, suc h as ‘apical’ vs ‘lateral’ are considered negligible. F or pulmonary no dules smaller than 6 mm, measuremen t discrepancies exceeding 2 mm are considered significant; for nodules 6 mm or larger, discrepancies exceeding 4 mm are considered significant, reflecting established practice [14]. Changes in severit y descriptors that affect urgency , such as “small” versus “large,” are significant, whereas as “small” versus “tin y” are negligible. A single matc hed finding ma y con tain m ultiple attribute errors, eac h ev aluated indep enden tly . 3.3 Sev erit y-A ware Scoring The framew ork produces a score in the range of ( − 1 , 1] that can b e easily in terpreted in clinical w orkflows. The scale is grounded at 0 , which corresp onds to a normal candidate report. This reflects a practical assumption that a radiologist b egins from a normal template and mo difies the rep ort by adding abnormal findings. A score greater than zero indicates that the candidate rep ort con tains more correct findings than errors after severit y weigh ting. A score equal to zero indicates that the rep ort is no more informative than submitting a normal template, except when the reference rep ort is also normal, in which case a correct normal rep ort receives a score of 1 . A score less than zero indicates that the rep ort contains more errors than correct findings, implying that a radiologist w ould likely sp end more effort correcting it than editing a normal template. The upp er b ound of 1 represen ts a p erfect rep ort with no missed findings, no false p ositiv es, and no significan t attribute errors. Negativ e scores approac h − 1 asymptotically because errors are theoretically unbounded: a candidate can alwa ys b ecome worse by in tro ducing additional false findings. F or eac h matched finding m i , let its clinical significance weigh t b e w i = w ( m i ) . A ttribute- lev el p enalties are aggregated as E attr ,i = P j w attr ( e j,i ) , where e j,i referes to attribute error j for finding i , and total credit, C , across matched findings is C = P i ∈ matched w i · w i w i + E attr ,i . Let W ref = P f ∈ R ref w ( f ) denote the total w eighted clinical significance of the reference rep ort, and let E false = P f ∈ false w ( f ) denote the weigh ted sum of false p ositiv e findings. The raw score is defined as: S = C − E false W ref if W ref > 0 , − E false if W ref = 0 and E false > 0 . 1 if W ref = 0 and E false = 0 , (2) T o bound the negativ e range while preserving relativ e ordering, let A = E false − C , whic h represen ts the excess weigh ted errors relative to correct findings. The final score is: CRIMSON = S if S ≥ 0 − A 1 + A if S < 0 (3) 5 3 0 1 0 1 3 1 1 0 3 0 0 0 0 0 0 0 0 3 0 0 2 0 0 0 0 0 3 1 1 2 0 0 0 1 0 3 2 0 2 1 0 0 1 0 3 3 0 0 0 0 0 0 0 3 1 1 0 1 2 1 1 2 3 1 0 0 3 1 0 0 3 3 0 0 2 2 2 0 0 1 3 2 2 0 0 0 0 0 0 Fa l s e Find ing P e na l iz a t ion P a t ie nt C onte x t Se ns it iv it y N or ma l Find ing H a nd l ing P a r a p hr a s e R ob u s t ne s s C l inica l P r a ctica l it y L oca t ion E r r or H a nd l ing Me a s u r e me nt E r r or Se ns it iv it y Dia g nos t ic P r e cis ion C l inica l Sig nific a nce W e ig ht ing P a r t ia l C r e d it A s s ig nme nt C R I MSO N GR E E N R ad Graph C h e Xb e r t R aTE Sco r e R ad C li Q - v1 R O U GE - L B L E U B E R TSc o r e 1 2 3 4 5 6 7 8 9 10 To t al ( /3 0 ) 30 10 5 8 8 8 2 4 6 R adJ udge Figure 2: RadJudge results. F or each case, metrics are ev aluated based on whether their relative ranking of m ultiple candidate rep orts agrees with the exp ected ordering determined with agreemen t across three attending cardiothoracic radiologists. Eac h category contains three cases; entries are cases passed (out of 3), with totals out of 30. 4 Results W e p erform three complementary forms of v alidation: correlation with radiologist-annotated clini- cally significant error counts (Section 4.1), a radiologist-guided pass–fail clinical judgment test on RadJudge (Section 4.2), and large-scale radiologist preference alignment on RadPref (Section 4.3). All metrics except CRIMSON were computed using RadEv al [22]. 4.1 Correlation with Radiologist-Annotated Significant Errors W e ev aluated CRIMSON on 50 cases from ReXV al [24] annotated b y six board-certified radiolo- gists. W e computed Kendall’s τ and Pearson r correlations b et ween automatic metric scores and radiologist-deriv ed clinically significant error coun ts. As sho wn in T able 1, CRIMSON demonstrates strong alignment with these exp ert annotations. F urthermore, error coun ts ( E ) exhibited ev en stronger alignment, and severit y-w eighted errors (W eighted E ) ac hieved the highest correlations o verall, demonstrating that explicitly mo deling clinical consequence further impro v es agreement with exp ert judgment. 4.2 Radiologist-Guided Clinical Judgmen t T est W e also dev elop ed RadJudge, a targeted pass–fail test suite reflecting real-world radiologist in tuition. RadJudge comprises 30 curated cases across 10 clinically nuanced categories in which m ultiple candidate reports are compared and indep enden tly review ed b y three cardiothoracic radiologists, 6 T able 1: Kendall τ and P earson r correlations (95% CI) b et ween automatic metrics and radiologist- deriv ed clinically significan t error coun ts ( n = 50 ). Columns refer to differen t candidate rep orts on ReXV al, eac h of which w as chosen to optimize a sp ecific metric [24]. GREEN E and CRIMSON E denote the total (unw eigh ted) error count, while CRIMSON W eighted E applies clinical sever- it y–based weigh ting to errors. All other metrics were calculated using RadEv al [22]. ∗ CRIMSON results are av eraged across 5 runs due to non-deterministic API outputs. Correlation with Significant Error Counts Metric CheXb ert BER TScore Kendall τ Pearson r Kendall τ Pearson r RadGraph [9] 0 . 41 [ 0.19,0.61 ] 0 . 59 [ 0.41,0.75 ] 0 . 54 [ 0.36,0.68 ] 0 . 65 [ 0.51,0.78 ] BLEU [16] 0 . 49 [ 0.30,0.65 ] 0 . 60 [ 0.47,0.72 ] 0 . 36 [ 0.16,0.54 ] 0 . 48 [ 0.32,0.63 ] BER TScore [25] 0 . 52 [ 0.35,0.67 ] 0 . 65 [ 0.52,0.78 ] 0 . 49 [ 0.30,0.66 ] 0 . 60 [ 0.44,0.74 ] GREEN [15] 0 . 62 [ 0.46,0.75 ] 0 . 75 [ 0.64,0.86 ] 0 . 67 [ 0.54,0.78 ] 0 . 70 [ 0.59,0.80 ] ROUGE-L [13] 0 . 58 [ 0.44,0.71 ] 0 . 71 [ 0.60,0.81 ] 0 . 54 [ 0.37,0.70 ] 0 . 62 [ 0.46,0.75 ] CheXbert [19] 0 . 46 [ 0.26,0.63 ] 0 . 45 [ 0.18,0.70 ] 0 . 30 [ 0.08,0.51 ] 0 . 34 [ 0.09,0.60 ] RaTEScore [29] 0 . 39 [ 0.17,0.57 ] 0 . 52 [ 0.31,0.69 ] 0 . 49 [ 0.32,0.65 ] 0 . 56 [ 0.37,0.73 ] RadCliQ-v1 [23] 0 . 34 [ 0.21,0.46 ] 0 . 34 [ 0.19,0.52 ] 0 . 35 [ 0.21,0.48 ] 0 . 35 [ 0.22,0.53 ] CRIMSON ∗ 0 . 68 [ 0.54,0.79 ] 0 . 84 [ 0.76,0.90 ] 0 . 71 [ 0.60,0.80 ] 0 . 82 [ 0.74,0.89 ] GREEN E 0 . 71 [ 0.59,0.81 ] 0 . 75 [ 0.65,0.86 ] 0 . 75 [ 0.63,0.85 ] 0 . 85 [ 0.75,0.92 ] CRIMSON E 0 . 73 [ 0.61,0.83 ] 0 . 88 [ 0.79,0.94 ] 0 . 72 [ 0.62,0.80 ] 0 . 86 [ 0.77,0.93 ] CRIMSON W eighted E 0 . 78 [ 0.67,0.86 ] 0 . 90 [ 0.85,0.95 ] 0 . 80 [ 0.71,0.87 ] 0 . 91 [ 0.88,0.95 ] RadGraph BLEU Kendall τ Pearson r Kendall τ Pearson r RadGraph 0 . 59 [ 0.46,0.71 ] 0 . 60 [ 0.50,0.72 ] 0 . 64 [ 0.50,0.75 ] 0 . 75 [ 0.65,0.84 ] BLEU 0 . 13 [ 0.09,0.34 ] 0 . 23 [ 0.02,0.39 ] 0 . 52 [ 0.34,0.68 ] 0 . 67 [ 0.53,0.79 ] BER TScore 0 . 46 [ 0.29,0.61 ] 0 . 54 [ 0.39,0.68 ] 0 . 58 [ 0.41,0.72 ] 0 . 72 [ 0.60,0.83 ] GREEN 0 . 62 [ 0.48,0.74 ] 0 . 65 [ 0.53,0.77 ] 0 . 70 [ 0.55,0.83 ] 0 . 79 [ 0.68,0.89 ] ROUGE-L 0 . 54 [ 0.38,0.67 ] 0 . 60 [ 0.49,0.70 ] 0 . 67 [ 0.52,0.79 ] 0 . 80 [ 0.70,0.87 ] CheXbert 0 . 29 [ 0.08,0.48 ] 0 . 33 [ 0.08,0.55 ] 0 . 18 [ 0.05,0.40 ] 0 . 23 [ 0.04,0.47 ] RaTEScore 0 . 57 [ 0.42,0.70 ] 0 . 62 [ 0.45,0.78 ] 0 . 54 [ 0.39,0.68 ] 0 . 67 [ 0.54,0.79 ] RadCliQ-v1 0 . 12 [ 0.05,0.28 ] 0 . 06 [ 0.11,0.28 ] 0 . 28 [ 0.11,0.43 ] 0 . 16 [ 0.01,0.53 ] CRIMSON ∗ 0 . 61 [ 0.45,0.75 ] 0 . 71 [ 0.53,0.85 ] 0 . 67 [ 0.54,0.79 ] 0 . 81 [ 0.71,0.89 ] GREEN E 0 . 71 [ 0.59,0.81 ] 0 . 75 [ 0.65,0.86 ] 0 . 80 [ 0.71,0.88 ] 0 . 88 [ 0.82,0.93 ] CRIMSON E 0 . 73 [ 0.61,0.83 ] 0 . 86 [ 0.78,0.92 ] 0 . 74 [ 0.61,0.84 ] 0 . 87 [ 0.78,0.93 ] CRIMSON W eighted E 0 . 77 [ 0.67,0.85 ] 0 . 86 [ 0.80,0.93 ] 0 . 78 [ 0.69,0.86 ] 0 . 88 [ 0.82,0.93 ] 7 CheXbert R adGraph BER TScor e R aTEScor e GREEN CRIMSON Inter -rater 0.0 0.2 0.4 0.6 0.8 K e n d a l l b 0.39 0.36 0.36 0.37 0.44 0.45 0.46 0.45 0.46 0.47 0.49 0.47 0.51 0.53 0.54 0.53 0.55 0.58 0.61 0.59 0.63 0.61 0.68 0.64 0.82 0.79 0.78 Cor r elation with R adiologist P r efer ences (R adP r ef) R ater 1 R ater 2 R ater 3 A veraged R ater 1 vs R ater 2 R ater 1 vs R ater 3 R ater 2 vs R ater 3 CheXbert R adGraph BER TScor e R aTEScor e GREEN CRIMSON Inter -rater 0.0 0.2 0.4 0.6 0.8 P e a r s o n r 0.47 0.41 0.42 0.45 0.58 0.58 0.59 0.61 0.59 0.61 0.61 0.63 0.64 0.65 0.66 0.68 0.71 0.72 0.76 0.76 0.79 0.76 0.82 0.82 0.89 0.89 0.87 Figure 3: Radiologist Preference Alignmen t (RadPref ). Correlation b etw een metric score and ra- diologist rating differences across 100 pairwise cases. Each p oin t corresp onds to a case comparing t wo candidate rep orts for the same reference rep ort. with agreement required to establish the reference preference. A metric passes if it ranks rep orts in accordance with exp ert judgmen t or assigns equiv alen t scores when radiologists deem them clinically indistinguishable, defined as differences within a threshold of 0.01. Three representativ e cases are sho wn in Figure 1. The suite prob es challenging scenarios, including urgen t omissions versus benign hallucinations, con text-dep enden t findings, diagnostic ov er- and under-interpretation, and situations reflecting the clinical reality of imp erfect reference rep orts that omit localization or age-exp ected b enign findings. As sho wn in Figure 2, CRIMSON is the only metric that correctly solv es all 30 out of 30 cases, consisten tly ranking candidate rep orts in accordance with expert radiologist judgmen t. In con trast, all prior metrics perform substan tially worse, correctly resolving fewer than 35% of cases, highligh ting their limited ability to capture nuanced clinical judgmen t. 4.3 Radiologist Preference Alignmen t Correlation with error counts ma y not fully capture radiologist preference, as exp erts don’t weigh differen t t yp es of errors equally in o v erall judgmen t. T o directly assess preference alignmen t, we in tro duce RadPref, a b enc hmark of 100 cases, eac h containing a reference rep ort from ReX Gradien t- 160K [26] and tw o candidate rep orts randomly generated using diverse regimes inspired b y [15]: rep ort generation with MedGemma [17], randomly sampled rep orts, BER T similarit y–matched re- p orts, and LLM-based editing, addition, or remov al of findings. Eac h candidate was rated on a 1–5 scale by three cardiothoracic radiologists based on ov erall clinical quality and correctness relative to the reference rep ort. The scale w as defined as: 1 = completely wrong or clinically dangerous; 2 8 = ma jor errors, with most key findings missing or false; 3 = partially correct, with some significant errors; 4 = mostly accurate, with only minor or negligible errors; and 5 = clinically equiv alen t to the ground truth. Scores w ere computed separately for b oth candidates. Figure 3 sho ws that CRIMSON demonstrates the strongest alignment with radiologist pairwise preferences among all ev aluated metrics. A cross Kendall’s τ b and P earson r , CRIMSON consisten tly outp erforms prior approaches and approac hes in ter-rater radiologist agreemen t. These findings indicate that CRIMSON more faithfully reflects exp ert clinical preference in relative report qualit y . 4.4 MedGemma Fine-tuning and Analysis T o enable priv acy-preserving, fully lo cal deploymen t of CRIMSON, we fine-tuned MedGemma [17] on GPT-5–generated CRIMSON annotations using the full ReX Gradien t-160K [26] training set of 140,000 rep orts pairs for 10 ep ochs using LoRA [7]. F or each pair, the candidate rep ort was generated using the same regimes describ ed in Section 4.3 except without using an y image-based generation. GPT-5 generated structured finding-level error lab els and sev erity assignments, whic h serv ed as sup ervision to train MedGemma to replicate CRIMSON-st yle outputs. Additional training details are provided on the mo del’s Hugging F ace page. W e compare the fine-tuned MedGemma (MedGemmaCRIMSON) against GPT-5.2 on RadPref preference alignmen t and severit y categorization (Figure 4). Notably , RadPref pro vides not only pairwise preference ratings but also structured error categorization and clinical severit y annotations across all three cardiothoracic radiologists, enabling direct ev aluation of both preference alignment and clinically significant error mo deling. MedGemmaCRIMSON ac hieves comparable mean ab- solute error to GPT-5.2 across false findings, missing findings, and attribute errors, with similar b eha vior particularly on attribute-level discrepancies. F or clinical significance lab eling, MedGem- maCRIMSON closely mirrors GPT-5.2 in repro ducing radiologist-assigned categories across all three radiologists, achieving agreement rates that are slightly low er but within a narro w margin of GPT- 5.2 (Radiologist 1: 80.3% vs 81.6%; Radiologist 2: 76.7% vs 80.5%; Radiologist 3: 73.5% vs 75.4%), with most disagreements o ccurring b et w een adjacent sev erity lev els rather than extreme misclassi- fications. 5 Discussion W e introduce CRIMSON, a clinically grounded and sev erity-a w are framew ork for fine-grained radi- ology rep ort ev aluation that explicitly mo dels patien t context, diagnostic consequence, and struc- tured attribute-lev el errors. By incorp orating clinician-defined clinical significance weigh ts and score normalization, CRIMSON aligns automated ev aluation more closely with real-world radiologist rea- soning than prior approac hes. While CRIMSON leverages GPT-5.2 for structured ev aluation, w e additionally demonstrate that a fine-tuned op en-weigh t mo del (MedGemmaCRIMSON) can closely appro ximate its b eha vior, enabling priv acy-preserving, fully lo cal deploymen t. A core motiv ation of CRIMSON is the principle that generated rep orts should b e ev aluated ac- cording to how they w ould function under radiologist ov ersigh t, rather than solely through aggregate accuracy or raw error coun ts. Instead of treating all discrepancies equally or as binary (significant vs. not significan t), CRIMSON explicitly mo dels whether an error w ould b e clinically consequen tial or p oten tially dangerous. Missing a life-threatening abnormalit y should dominate the ev aluation, whereas minor descriptor differences should not. This principle also motiv ates CRIMSON’s partial-credit design for attribute errors. When a mo del correctly identifies a clinically important finding but misstates a secondary attribute (e.g., 9 R ad io l ogist 1 R ad io l ogist 2 R ad io l ogist 3 G P T - 5 .2 A B R ad io l ogist C R I M SON Med G e mm a C R I MS ON Figure 4: MedGemmaCRIMSON vs GPT-5.2. A) Mean absolute error across false findings, missing findings, and attribute errors p er radiologist. B) Sev erity categorization confusion matrices b et ween three radiologists and CRIMSON, computed only on matched errors (i.e., findings for which b oth the radiologist and CRIMSON iden tified an error in the same category). Titles sho w the p ercen tage of cases for which the radiologist and CRIMSON agree on error category . Color in tensity represents the within-row p ercen tage. 10 mild severit y mismatch or imprecise lo calization), it may still provide v alue b y directing the radiol- ogist’s attention to the relev an t abnormality . CRIMSON therefore rew ards correct detection while p enalizing clinically meaningful attribute mistakes in a severit y-a ware manner, reflecting that some errors increase downst ream review effort without necessarily creating the same patient-safet y risk as a complete omission or ma jor hallucination. A cross three complemen tary v alidation settings: (1) correlation with radiologist-annotated clin- ically significant error counts, (2) the RadJudge clinical judgmen t suite, and (3) the RadPref radiol- ogist preference b enc hmark, CRIMSON consisten tly demonstrates stronger agreement with exp ert judgmen t than existing metrics. Notably , sev erit y-weigh ted mo deling further improv es alignmen t, highligh ting the imp ortance of distinguishing clinically consequen tial errors from b enign ones. A limitation of this work is that muc h of CRIMSON’s prompting framework, severit y rubric, and structured ev aluation guidelines w ere dev elop ed sp ecifically for c hest X-ra y reports. The clinical significance taxonomy , attribute rules, and measuremen t thresholds were designed in collab oration with cardiothoracic radiologists and tailored to common CXR findings and rep orting con ven tions. Although the underlying ev aluation framework is mo dalit y-agnostic in principle, applying CRIM- SON to other imaging domains will require adaptation of prompts, finding ontologies, and severit y criteria to align with mo dalit y-sp ecific clinical standards. F uture work will extend CRIMSON b e- y ond chest X-ray to additional imaging mo dalities where anatomical detail, multimodal con text, and diagnostic complexity are considerably greater. A c kno wledgmen ts This research w as supp orted in part by Harv ard Medical School Dean’s Innov ation A ward for Ac- celerating F oundation Mo del Research. References [1] Baharo on, M., Ma, J., F ang, C., T oma, A., W ang, B.: Exploring the design space of 3d mllms for ct rep ort generation. In: International Conference on Medical Image Computing and Computer-Assisted Interv en tion. pp. 237–246. Springer (2025) [2] Baharo on, M., Raissi, S., Jun, J.S., Heintz, T., Alabbad, M., Alburk ani, A., Kim, S.E., Klein- sc hmidt, K., Alh umaydhi, A.O., Alghamdi, M.M.G., et al.: Radgame: An ai-p o wered platform for radiology education. arXiv preprint arXiv:2509.13270 (2025) [3] Banerjee, S., La vie, A.: Meteor: An automatic metric for mt ev aluation with improv ed corre- lation with human judgments. In: Pro ceedings of the acl w orkshop on intrinsic and extrinsic ev aluation measures for machine translation and/or summarization. pp. 65–72 (2005) [4] Bann ur, S., Bouzid, K., Castro, D.C., Sc hw aighofer, A., Thieme, A., Bond-T aylor, S., Ilse, M., Pérez-García, F., Salv atelli, V., Sharma, H., et al.: Maira-2: Grounded radiology rep ort generation. arXiv preprint arXiv:2406.04449 (2024) [5] Cha ves, J.M.Z., Huang, S.C., Xu, Y., Xu, H., Usuyama, N., Zhang, S., W ang, F., Xie, Y., Khademi, M., Y ang, Z., et al.: T ow ards a clinically accessible radiology foundation mo del: op en- access and light weigh t, with automated ev aluation. arXiv preprin t arXiv:2403.08002 (2024) 11 [6] Dua, R., Jo on, Y., Dogra, S., F reedman, D., Ruan, D., Nashaw aty , M., Rigau, D., Alb er, D.A., Zhang, K., Cho, K., et al.: Clinically grounded agen t-based rep ort ev aluation: An in terpretable metric for radiology rep ort generation. arXiv preprint arXiv:2508.02808 (2025) [7] Hu, E.J., Shen, Y., W allis, P ., Allen-Zhu, Z., Li, Y., W ang, S., W ang, L., Chen, W., et al.: Lora: Lo w-rank adaptation of large language mo dels. Iclr 1 (2), 3 (2022) [8] Huang, A., Banerjee, O., W u, K., Reis, E.P ., Ra jpurk ar, P .: Fineradscore: A radiology rep ort line-b y-line ev aluation tec hnique generating corrections with sev erity scores. arXiv preprin t arXiv:2405.20613 (2024) [9] Jain, S., Agraw al, A., Sap orta, A., T ruong, S.Q., Duong, D.N., Bui, T., Chambon, P ., Zhang, Y., Lungren, M.P ., Ng, A.Y., et al.: Radgraph: Extracting clinical en tities and relations from radiology rep orts. arXiv preprint arXiv:2106.14463 (2021) [10] Jiang, Y., Chen, C., W ang, S., Li, F., T ang, Z., Merv ak, B.M., Chelala, L., Straus, C.M., Chahine, R., Armato II I, S.G., et al.: Clear: a clinically-grounded tabular framew ork for radiology rep ort ev aluation. arXiv preprint arXiv:2505.16325 (2025) [11] Li, R., Li, J., Jian, B., Y uan, K., Zhu, Y.: Reev almed: Rethinking medical rep ort ev aluation b y aligning metrics with real-world clinical judgment. In: Proceedings of the 2025 Conference on Empirical Metho ds in Natural Language Pro cessing. pp. 11823–11837 (2025) [12] Li, Y., Liu, Y., W ang, Z., Liang, X., Liu, L., W ang, L., Zhou, L.: S-rrg-b enc h: Structured radiology report generation with fine-grained ev aluation framework. Meta-Radiology p. 100171 (2025) [13] Lin, C.Y.: Rouge: A pac k age for automatic ev aluation of summaries. In: T ext summarization branc hes out. pp. 74–81 (2004) [14] MacMahon, H., Naidic h, D.P ., Goo, J.M., Lee, K.S., Leung, A.N., May o, J.R., Meh ta, A.C., Ohno, Y., Po w ell, C.A., Prok op, M., et al.: Guidelines for managemen t of inciden tal pulmonary no dules detected on ct images: from the fleisc hner so ciet y 2017. Radiology 284 (1), 228–243 (2017) [15] Ostmeier, S., Xu, J., Chen, Z., V arma, M., Blank emeier, L., Bluethgen, C., Md, A.E.M., Mose- ley , M., Langlotz, C., Chaudhari, A.S., et al.: Green: Generativ e radiology rep ort ev aluation and error notation. In: Findings of the asso ciation for computational linguistics: EMNLP 2024. pp. 374–390 (2024) [16] P apineni, K., Rouk os, S., W ard, T., Zh u, W.J.: Bleu: a metho d for automatic ev aluation of mac hine translation. In: Pro ceedings of the 40th ann ual meeting of the Asso ciation for Computational Linguistics. pp. 311–318 (2002) [17] Sellergren, A., Kazemzadeh, S., Jaro ensri, T., Kiraly , A., T ra verse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: Medgemma tec hnical rep ort. arXiv preprint arXiv:2507.05201 (2025) [18] Singh, A., F ry , A., P erelman, A., T art, A., Ganesh, A., El-Kishky , A., McLaughlin, A., Low, A., Ostro w, A., Anan thram, A., et al.: Op enai gpt-5 system card. arXiv preprint (2025) 12 [19] Smit, A., Jain, S., Ra jpurk ar, P ., Pareek, A., Ng, A.Y., Lungren, M.P .: Chexbert: com bining automatic labelers and exp ert annotations for accurate radiology rep ort labeling using b ert. arXiv preprint arXiv:2004.09167 (2020) [20] Tian, K., Hartung, S.J., Li, A.A., Jeong, J., Behzadi, F., Calle-T oro, J., A dithan, S., Pohlen, M., Osa yande, D., Ra jpurk ar, P .: Refisco: Rep ort fix and score dataset for radiology report generation. PhysioNet (2023) [21] W ang, Z., Luo, X., Jiang, X., Li, D., Qiu, L.: Llm-radjudge: A chieving radiologist-level ev alu- ation for x-ray rep ort generation. arXiv preprint arXiv:2404.00998 (2024) [22] Xu, J., Zhang, X., Ab derezaei, J., Bauml, J., Bo o doo, R., Haghighi, F., Ganjizadeh, A., Brat- tain, E., V an V een, D., Meng, Z., et al.: Radev al: A framew ork for radiology text ev aluation. In: Proceedings of the 2025 Conference on Empirical Metho ds in Natural Language Processing: System Demonstrations. pp. 546–557 (2025) [23] Y u, F., Endo, M., Krishnan, R., Pan, I., T sai, A., Reis, E.P ., F onseca, E.K.U.N., Lee, H.M.H., Abad, Z.S.H., Ng, A.Y., et al.: Ev aluating progress in automatic chest x-ray radiology rep ort generation. Patterns 4 (9) (2023) [24] Y u, F., Endo, M., Krishnan, R., Pan, I., T sai, A., Reis, E.P ., F onseca, E., Lee, H., Shakeri, Z., Ng, A., et al.: Radiology rep ort exp ert ev aluation (rexv al) dataset (2023) [25] Zhang, T., Kishore, V., W u, F., W einberger, K.Q., Artzi, Y.: Bertscore: Ev aluating text generation with b ert. arXiv preprin t arXiv:1904.09675 (2019) [26] Zhang, X., A costa, J.N., Miller, J., Huang, O., Ra jpurk ar, P .: Rexgradient-160k: A large- scale publicly a v ailable dataset of chest radiographs with free-text reports. arXiv preprin t arXiv:2505.00228 (2025) [27] Zhang, X., Acosta, J.N., Y ang, X., A dithan, S., Luo, L., Zhou, H.Y., Miller, J., Huang, O., Zhou, Z., Hamamci, I.E., et al.: Automated c hest x-ray rep ort generation remains unsolved. In: Biocomputing 2026: Proceedings of the Pacific Symp osium. pp. 236–250. W orld Scientific (2025) [28] Zhang, Z., Lee, K., Deng, W., Zhou, H., Jin, Z., Huang, J., Gao, Z., Marshall, D.C., F ang, Y., Y ang, G.: Gema-score: Granular explainable multi-agen t score for radiology rep ort ev aluation. arXiv preprint arXiv:2503.05347 (2025) [29] Zhao, W., W u, C., Zhang, X., Zhang, Y., W ang, Y., Xie, W.: Ratescore: A metric for radiology rep ort generation. arXiv preprint arXiv:2406.16845 (2024) [30] Zh u, Q., Chen, X., Jin, Q., Hou, B., Mathai, T.S., Mukherjee, P ., Gao, X., Summers, R.M., Lu, Z.: Leve raging professional radiologists’ exp ertise to enhance llms’ ev aluation for radiology rep orts. arXiv preprin t arXiv:2401.16578 (2024) 13
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment