CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

CRIMSON: A Clinically-Grounded LLM-Based Metric for Generativ e Radiology Rep ort Ev aluation Mohammed Baharo on 1,*, † , Thibault Hein tz 2,* , Sia v ash Raissi 1,* , Mahmoud Alabbad 3 , Mona Alhammad 4 , Hassan AlOmaish 5 , Sung Eun Kim 1 , Oishi Banerjee 1 , and Pranav Ra jpurk ar 1 1 Departmen t of Biomedical Informatics, Harv ard Medical School, Boston, MA 2 Departmen t of Radiation Oncology , Mass General Brigham, Boston, MA 3 King F ahad Hospital, Al-Ahsa Health Cluster, Al Hofuf, Saudi Arabia 4 Ras-T an ura General Hospital, Ministry of Health, Eastern Pro vince, Saudi Arabia 5 Departmen t of Medical Imaging, King Ab dulaziz Medical Cit y , Ministry of National Guard, Riy adh, Saudi Arabia * These authors contributed equally Abstract W e in tro duce CRIMSON , a clinically grounded ev aluation framework for chest X-ray re- p ort generation that assesses rep orts based on diagnostic correctness, contextual relev ance, and patien t safet y . Unlik e prior metrics, CRIMSON incorporates full clinical con text, including patien t age, indication, and guideline-based decision rules, and preven ts normal or clinically in- signiﬁcan t ﬁndings from exerting disproportionate inﬂuence on the ov erall score. The framew ork categorizes errors in to a comprehensive taxonomy cov ering false ﬁndings, missing ﬁndings, and eigh t attribute-lev el errors (e.g., lo cation, severit y , measurement, and diagnostic o verin terpreta- tion). Eac h ﬁnding is assigned a clinical signiﬁcance level (urgent, actionable non-urgen t, non- actionable, or expected/b enign), based on a guideline developed in collab oration with attending cardiothoracic radiologists, enabling sev erity-a ware weigh ting that prioritizes clinically conse- quen tial mistakes ov er b enign discrepancies. CRIMSON is v alidated through strong alignment with clinically signiﬁcant error counts annotated by six b oard-certiﬁed radiologists in ReXV al (Kendall’s τ = 0 . 61 – 0 . 71 ; P earson’s r = 0 . 71 – 0 . 84 ), and through t wo additional b enchmarks that w e introduce. In R adJudge , a targeted suite of clinically c hallenging pass–fail scenarios, CRIM- SON shows consistent agreemen t with exp ert judgment. In R adPr ef , a larger radiologist prefer- ence b enc hmark of ov er 100 pairwise cases with structured error categorization, severit y mo del- ing, and 1–5 o verall quality ratings from three cardiothoracic radiologists, CRIMSON achiev es the strongest alignmen t with radiologist preferences. W e release the metric, the ev aluation b enc hmarks, RadJudge and RadPref, and a ﬁne-tuned MedGemma mo del to enable reproducible ev aluation of rep ort generation, all av ailable at https://gith ub.com/ra jpurk arlab/CRIMSON. 1 In tro duction Automated radiology rep ort generation has adv anced rapidly with the emergence of large vision- language models, y et reliable ev aluation remains a fundamen tal c hallenge [27, 23, 11]. Recen t radiology-sp eciﬁc metrics hav e mo ved b ey ond surface-level text similarity and instead assess factual † Con tact: MohammedSalimAB@outlo ok.com 1 Reference Re port : “The lungs d emonstrate b ibasilar atelecta sis. The car diac silhoue tte is mildly enla rged. There is aortic a theroscle rosis with va scular c alcifica tion. ” Age: 25 y Indicati on: Ch est pain Age: 82 y Indicati on: Rout ine preoperative evaluation Radiologi st Expectation < Patient Cont ext Sensitivit y Reference Re port : “2cm nod ule in left lowe r lobe. No inf iltrations or ef fusion. Normal heart and mediastinum.” “No infiltra tions or effusio n. Normal heart and mediastinum.” “2cm nod ule in the left lower lobe.” < Normal Findin g Handling Clinical Significance Weighting “Malpositioned ETT, tip in the r ight bronchu s.” “ ETT is we ll positio ned. Tra ce right ple ural effusion. Mild bibasil ar atelectasis.” < Reference Re port : “Mispositioned ETT, termina ted in right ma in bronchus. Tr ace right pleural effus ion. Mild bi basilar at electasi s.” Candidate Report: “Bibasila r atelec tasis is pre sent. The hea rt is mildly enla rged.” Fi g u r e 1 – A) 3 e x a mpl e t e s t s . T e s t 1 : C R I M S O N c a n di s t i n g u i s h w h e n t h e a g e a n d in d ic at io n o f a p at ie n t d e t e r m in e t h e s ig n if ic an c e o f a f in d in g , ad ju s t in g t h e s c o r e ac c o r d in g l y . T e s t 2 : N o r m al f in d in g r e p o r tin g v ar ie s s ig n if ic an tl y b e tw e e n d o c to r s , C R I M S O N i g n o r e s t h e m . T e s t 3 : Lu c e n c y c a n b e a s i g n o f p n e u m o t h o r a x . C R I M S O N th e r e f o r e g i v e s p ar ti al c r e d i t. CR I M SO N (o u r s ) GR E E N Ra d G r a p h Ch e X b e r t Ra T E S c o r e 1 2 1 2 2 1 CRIMSON (ours) GREEN RadGraph CheXbe rt RaTEScore 2 1 1 1 2 CRIMSON (ours) GREEN RadGraph CheXbe rt RaTEScore 2 CRIMSON (ours) GREEN RadGraph CheXbe rt RaTEScore Figure 1: Represen tative RadJudge cases illustrating core design principles of CRIMSON. T op: P atient con text sensitivit y . The clinical impact of an omission (e.g., aortic atherosclerosis) v aries b y age and indication, and CRIMSON adjusts sev erity accordingly . Middle: Normal ﬁnding handling. CRIMSON do es not reward mentioning normal ﬁndings, preven ting score inﬂation. Bottom: Clin- ical signiﬁcance weigh ting. Errors are weigh ted by consequence, prioritizing clinically imp ortan t ﬁndings. In each case, CRIMSON aligns with radiologist exp ectations, whereas prior metrics fail. correctness through structured error coun ting and ﬁnding-level comparison [15, 29, 8, 28, 21, 5]. These approac hes represen t imp ortan t progress tow ard clinically meaningful ev aluation b y explicitly detecting hallucinations and omissions. Despite these adv ances, current metrics largely treat detected errors as either uniformly imp or- tan t or binary (signiﬁcan t vs. not signiﬁcan t), and ev aluate ﬁndings in relative isolation from broader clinical con text. In practice, the clinical consequences of errors v ary substan tially . F or instance, failing to rep ort a life-threatening pneumothorax is categorically diﬀeren t from missing age-related aortic calciﬁcation. Moreov er, the relev ance and in terpretation of ﬁndings depend on a patien t’s age and indication. Existing ev aluation frameworks do not explicitly enco de clinical sev erity as a ﬁne-grained sp ectrum; instead, they collapse ﬁndings into coarse categories, lac k suﬃcien t clinical con text to determine true signiﬁcance, or rely on LLM judgments without structured in-context guidelines [15, 8, 21]. Consequently , these frameworks conﬂate minor, clinically inconsequential discrepancies with omissions that directly impact patien t safet y . 2 T o address these limitations, we in tro duce CRIMSON, a clinically grounded LLM-based ev al- uation framew ork designed to align automated assessmen t with real-world radiologic reasoning. CRIMSON ev aluates rep orts at the lev el of individual ﬁndings while incorp orating full clinical con- text, including patient age and indication. The framew ork models a comprehensive taxonomy of errors—false ﬁndings, missing ﬁndings, attribute-lev el errors (e.g., lo cation, severit y , measurement) and signiﬁcance labels (urgent, actionable non-urgent, non-actionable, or exp ected/b enign)—deﬁned according to a rubric dev eloped with cardiothoracic radiologists. The severit y lab els determine w eights within a principled scoring formulation that prioritizes clinically consequential errors o ver b enign discrepancies and supp orts partial credit for partially correct ﬁndings under ﬁne-grained attribute rules. W e v alidate CRIMSON through alignment with radiologist-annotated clinically sig- niﬁcan t error coun ts in ReXV al [24, 23] (PhysioNet Creden tialed Health Data License 1.5.0). W e also in tro duce and v alidate the metric using RadJudge, a targeted ranking test suite of clinically chal- lenging scenarios, and RadPref, a comprehensiv e radiologist preference b enc hmark, demonstrating impro ved agreement with exp ert judgment compared to existing metrics. T o facilitate repro ducibil- it y and adoption, we publicly release the metric and additionally ﬁne-tune MedGemma [17] to generate CRIMSON predictions, enabling priv acy-preserving, fully local deplo ymen t for hospitals and institutions without transmitting patient rep orts to external APIs. 2 Related W ork Radiology Rep ort Ev aluation. Early ev aluation of radiology report generation relied on general- purp ose metrics such as BLEU [16], ROUGE [13], and METEOR [3], which measure lexical ov erlap and are p o orly aligned with clinical correctness. Radiology-speciﬁc frameworks suc h as CheXb ert [19] and RadGraph [9] shift ev aluation tow ard label-based or en tit y-based comparisons; how ev er, they remain constrained by predeﬁned lab el spaces or extraction pip elines and do not explicitly mo del v ariation in clinical sev erit y across ﬁndings. BER TScore [25] oﬀers em b edding-based se- man tic similarit y but do es not capture clinical correctness and lacks in terpretability . RadCliQ [23] com bines multiple automated metrics to approximate radiologist judgmen t through a comp osite score. RaTEScore [29] ev aluates rep orts using entit y-a ware semantic similarit y to b etter align with clinical conten t. Both metrics, ho wev er, lack explicit ﬁnding-lev el in terpretability and do not incor- p orate clinical con text to mo dulate error severit y . LLM-Based Ev aluation F rameworks. More recently , ev aluation metho ds hav e leveraged large language mo dels (LLMs) to assess factual correctness through error categorization and coun t- ing [15, 21, 8, 12, 4, 28, 10, 6]. Compared to lab el- or graph-based approac hes, LLM-based ev al- uators oﬀer greater ﬂexibility in identifyin g n uanced discrepancies and can pro duce structured ra- tionales alongside error coun ts, impro ving in terpretability and transparency . GREEN [15] and LLM-RadJudge [21] b oth leverage LLMs to identify and classify errors in candidate rep orts, pro- ducing interpretable ev aluations v alidated against radiologist judgmen ts. How ev er, these and most other LLM-based metrics either treat all errors uniformly or rely on a binary signiﬁcan t v ersus insigniﬁcan t distinction, without mo deling clinical sev erity along a ﬁne-grained sp ectrum or incor- p orating full patient-lev el context [15, 21, 12, 28]. FineRadScore [8] in tro duces error w eighting within a line-level correction framework; ho wev er, its severit y assessment remains implicitly de- termined b y the LLM without explicit clinician-deﬁned guidelines or comprehensiv e patien t-level con text. Other LLM-based metrics take alternative approac hes: CLEAR [10] transforms rep orts in to structured condition–attribute tables for attribute-lev el ﬁdelity assessment, while ICARE [6] emplo ys a dual-agent question-answering framew ork for clinically grounded ev aluation. Zhu et al. [30] demonstrate that incorporating professional radiologists’ exp ertise in to LLM-based ev aluation 3 pip elines can substan tially impro ve alignment with clinical judgment. 3 CRIMSON: Clinically-Grounded Rep ort Ev aluation CRIMSON ev aluates a candidate radiology rep ort against a reference rep ort b y identifying and categorizing discrepancies at the ﬁnding level, weighing each error b y its clinical signiﬁcance, and computing a normalized score that reﬂects clinical preferences. GPT-5.2 [18] is used as the back- b one throughout this pip eline. The framework op erates in three stages: (1) ﬁnding extraction and clinical signiﬁcance assignment, (2) error detection and classiﬁcation, and (3) severit y-aw are score computation. 3.1 Finding Extraction and Clinical Signiﬁcance Assignmen t Giv en a reference rep ort R ref and a candidate report R cand , CRIMSON ﬁrst extracts all abnormal ﬁndings from each rep ort. Normal ﬁndings are excluded from ev aluation b ecause including them can in tro duce spurious v ariabilit y due to st ylistic diﬀerences across radiologists [2, 1]. Each extracted ﬁnding f is assigned a clinical signiﬁcance w eight w ( f ) based on standard radiological practice and a structured sev erity framework adapted from [20] with input from attending cardiothoracic radiologists. The clinical signiﬁcance w eight w ( f ) is deﬁned according to the following: w ( f ) =            1 . 0 if urgent 0 . 5 if actionable, not urgen t 0 . 25 if not actionable, not urgent 0 . 0 if exp ected/b enign (1) Findings classiﬁed as ur gent corresp ond to abnormalities requiring immediate in terven tion or indicating life-threatening conditions, such as a tension pneumothorax. A ctionable non-ur gent ﬁnd- ings are those that alter patient management but are not immediately critical, including no dules, mo derate pleural eﬀusions, or consolidations. Not actionable are ﬁndings that represen t minimal clinical impact but remain w orth do cumenting, suc h as a cervical rib or appropriately positioned supp ort devices. Exp e cte d b enign ﬁndings include exp ected or age-appropriate changes with no im- pact on care, such as degenerative spine changes or a tortuous aorta. W e separate non-actionable from expected b enign ﬁndings because rep orting of expected b enign ﬁndings v aries substantially by radiologist st yle; p enalizing based on them in tro duces unnecessary randomness. Clinical signiﬁcance assignmen t incorp orates patient con text when a v ailable, including age and indication. F or example, aortic calciﬁcation in a 75-y ear-old patient is classiﬁed as expected b enign, whereas the same ﬁnding in a 25-year-old patien t ma y b e considered actionable non-urgent due to at ypical early onset. 3.2 Error T axonom y and Classiﬁcation CRIMSON c haracterizes discrepancies through three primary error categories: false ﬁndings, missing ﬁndings, and attribute errors. F alse ﬁndings are abnormal ﬁndings present in R cand but absent from R ref , represen ting a hallucination. Missing ﬁndings are abnormal ﬁndings presen t in R ref but absent from R cand , representing diagnostically meaningful omissions. Findings that app ear in b oth rep orts are considered “matched.” F or suc h ﬁndings, CRIMSON ev aluates attribute-lev el correctness across eight dimensions: (1) anatomical lo cation or laterality , (2) severit y or extent, (3) morphological descriptors, (4) quan titative measuremen ts, (5) certaint y 4 lev el, (6) diagnostic underin terpretation, (7) ov erin terpretation, and (8) temp oral or comparison descriptors. Eac h attribute error e is assigned a sev erit y-based weigh t w attr ( e ) , where w attr ( e ) = 0 . 5 if the error is lab eled as signiﬁcan t, and w attr ( e ) = 0 . 0 if it is lab eled as negligible. Signiﬁcant attribute errors are those that could alter treatment decisions or patient management, whereas negligible errors corresp ond to clinically inconsequen tial diﬀerences. F or example, incorrect lung laterality is considered signiﬁcan t, whereas p ositional diﬀerences within the same lob e, suc h as ‘apical’ vs ‘lateral’ are considered negligible. F or pulmonary no dules smaller than 6 mm, measuremen t discrepancies exceeding 2 mm are considered signiﬁcant; for nodules 6 mm or larger, discrepancies exceeding 4 mm are considered signiﬁcant, reﬂecting established practice [14]. Changes in severit y descriptors that aﬀect urgency , such as “small” versus “large,” are signiﬁcant, whereas as “small” versus “tin y” are negligible. A single matc hed ﬁnding ma y con tain m ultiple attribute errors, eac h ev aluated indep enden tly . 3.3 Sev erit y-A ware Scoring The framew ork produces a score in the range of ( − 1 , 1] that can b e easily in terpreted in clinical w orkﬂows. The scale is grounded at 0 , which corresp onds to a normal candidate report. This reﬂects a practical assumption that a radiologist b egins from a normal template and mo diﬁes the rep ort by adding abnormal ﬁndings. A score greater than zero indicates that the candidate rep ort con tains more correct ﬁndings than errors after severit y weigh ting. A score equal to zero indicates that the rep ort is no more informative than submitting a normal template, except when the reference rep ort is also normal, in which case a correct normal rep ort receives a score of 1 . A score less than zero indicates that the rep ort contains more errors than correct ﬁndings, implying that a radiologist w ould likely sp end more eﬀort correcting it than editing a normal template. The upp er b ound of 1 represen ts a p erfect rep ort with no missed ﬁndings, no false p ositiv es, and no signiﬁcan t attribute errors. Negativ e scores approac h − 1 asymptotically because errors are theoretically unbounded: a candidate can alwa ys b ecome worse by in tro ducing additional false ﬁndings. F or eac h matched ﬁnding m i , let its clinical signiﬁcance weigh t b e w i = w ( m i ) . A ttribute- lev el p enalties are aggregated as E attr ,i = P j w attr ( e j,i ) , where e j,i referes to attribute error j for ﬁnding i , and total credit, C , across matched ﬁndings is C = P i ∈ matched w i · w i w i + E attr ,i . Let W ref = P f ∈ R ref w ( f ) denote the total w eighted clinical signiﬁcance of the reference rep ort, and let E false = P f ∈ false w ( f ) denote the weigh ted sum of false p ositiv e ﬁndings. The raw score is deﬁned as: S =            C − E false W ref if W ref > 0 , − E false if W ref = 0 and E false > 0 . 1 if W ref = 0 and E false = 0 , (2) T o bound the negativ e range while preserving relativ e ordering, let A = E false − C , whic h represen ts the excess weigh ted errors relative to correct ﬁndings. The ﬁnal score is: CRIMSON =    S if S ≥ 0 − A 1 + A if S < 0 (3) 5 3 0 1 0 1 3 1 1 0 3 0 0 0 0 0 0 0 0 3 0 0 2 0 0 0 0 0 3 1 1 2 0 0 0 1 0 3 2 0 2 1 0 0 1 0 3 3 0 0 0 0 0 0 0 3 1 1 0 1 2 1 1 2 3 1 0 0 3 1 0 0 3 3 0 0 2 2 2 0 0 1 3 2 2 0 0 0 0 0 0 Fa l s e Find ing P e na l iz a t ion P a t ie nt C onte x t Se ns it iv it y N or ma l Find ing H a nd l ing P a r a p hr a s e R ob u s t ne s s C l inica l P r a ctica l it y L oca t ion E r r or H a nd l ing Me a s u r e me nt E r r or Se ns it iv it y Dia g nos t ic P r e cis ion C l inica l Sig nific a nce W e ig ht ing P a r t ia l C r e d it A s s ig nme nt C R I MSO N GR E E N R ad Graph C h e Xb e r t R aTE Sco r e R ad C li Q - v1 R O U GE - L B L E U B E R TSc o r e 1 2 3 4 5 6 7 8 9 10 To t al ( /3 0 ) 30 10 5 8 8 8 2 4 6 R adJ udge Figure 2: RadJudge results. F or each case, metrics are ev aluated based on whether their relative ranking of m ultiple candidate rep orts agrees with the exp ected ordering determined with agreemen t across three attending cardiothoracic radiologists. Eac h category contains three cases; entries are cases passed (out of 3), with totals out of 30. 4 Results W e p erform three complementary forms of v alidation: correlation with radiologist-annotated clini- cally signiﬁcant error counts (Section 4.1), a radiologist-guided pass–fail clinical judgment test on RadJudge (Section 4.2), and large-scale radiologist preference alignment on RadPref (Section 4.3). All metrics except CRIMSON were computed using RadEv al [22]. 4.1 Correlation with Radiologist-Annotated Signiﬁcant Errors W e ev aluated CRIMSON on 50 cases from ReXV al [24] annotated b y six board-certiﬁed radiolo- gists. W e computed Kendall’s τ and Pearson r correlations b et ween automatic metric scores and radiologist-deriv ed clinically signiﬁcant error coun ts. As sho wn in T able 1, CRIMSON demonstrates strong alignment with these exp ert annotations. F urthermore, error coun ts ( E ) exhibited ev en stronger alignment, and severit y-w eighted errors (W eighted E ) ac hieved the highest correlations o verall, demonstrating that explicitly mo deling clinical consequence further impro v es agreement with exp ert judgment. 4.2 Radiologist-Guided Clinical Judgmen t T est W e also dev elop ed RadJudge, a targeted pass–fail test suite reﬂecting real-world radiologist in tuition. RadJudge comprises 30 curated cases across 10 clinically nuanced categories in which m ultiple candidate reports are compared and indep enden tly review ed b y three cardiothoracic radiologists, 6 T able 1: Kendall τ and P earson r correlations (95% CI) b et ween automatic metrics and radiologist- deriv ed clinically signiﬁcan t error coun ts ( n = 50 ). Columns refer to diﬀeren t candidate rep orts on ReXV al, eac h of which w as chosen to optimize a sp eciﬁc metric [24]. GREEN E and CRIMSON E denote the total (unw eigh ted) error count, while CRIMSON W eighted E applies clinical sever- it y–based weigh ting to errors. All other metrics were calculated using RadEv al [22]. ∗ CRIMSON results are av eraged across 5 runs due to non-deterministic API outputs. Correlation with Signiﬁcant Error Counts Metric CheXb ert BER TScore Kendall τ Pearson r Kendall τ Pearson r RadGraph [9] 0 . 41 [ 0.19,0.61 ] 0 . 59 [ 0.41,0.75 ] 0 . 54 [ 0.36,0.68 ] 0 . 65 [ 0.51,0.78 ] BLEU [16] 0 . 49 [ 0.30,0.65 ] 0 . 60 [ 0.47,0.72 ] 0 . 36 [ 0.16,0.54 ] 0 . 48 [ 0.32,0.63 ] BER TScore [25] 0 . 52 [ 0.35,0.67 ] 0 . 65 [ 0.52,0.78 ] 0 . 49 [ 0.30,0.66 ] 0 . 60 [ 0.44,0.74 ] GREEN [15] 0 . 62 [ 0.46,0.75 ] 0 . 75 [ 0.64,0.86 ] 0 . 67 [ 0.54,0.78 ] 0 . 70 [ 0.59,0.80 ] ROUGE-L [13] 0 . 58 [ 0.44,0.71 ] 0 . 71 [ 0.60,0.81 ] 0 . 54 [ 0.37,0.70 ] 0 . 62 [ 0.46,0.75 ] CheXbert [19] 0 . 46 [ 0.26,0.63 ] 0 . 45 [ 0.18,0.70 ] 0 . 30 [ 0.08,0.51 ] 0 . 34 [ 0.09,0.60 ] RaTEScore [29] 0 . 39 [ 0.17,0.57 ] 0 . 52 [ 0.31,0.69 ] 0 . 49 [ 0.32,0.65 ] 0 . 56 [ 0.37,0.73 ] RadCliQ-v1 [23] 0 . 34 [ 0.21,0.46 ] 0 . 34 [ 0.19,0.52 ] 0 . 35 [ 0.21,0.48 ] 0 . 35 [ 0.22,0.53 ] CRIMSON ∗ 0 . 68 [ 0.54,0.79 ] 0 . 84 [ 0.76,0.90 ] 0 . 71 [ 0.60,0.80 ] 0 . 82 [ 0.74,0.89 ] GREEN E 0 . 71 [ 0.59,0.81 ] 0 . 75 [ 0.65,0.86 ] 0 . 75 [ 0.63,0.85 ] 0 . 85 [ 0.75,0.92 ] CRIMSON E 0 . 73 [ 0.61,0.83 ] 0 . 88 [ 0.79,0.94 ] 0 . 72 [ 0.62,0.80 ] 0 . 86 [ 0.77,0.93 ] CRIMSON W eighted E 0 . 78 [ 0.67,0.86 ] 0 . 90 [ 0.85,0.95 ] 0 . 80 [ 0.71,0.87 ] 0 . 91 [ 0.88,0.95 ] RadGraph BLEU Kendall τ Pearson r Kendall τ Pearson r RadGraph 0 . 59 [ 0.46,0.71 ] 0 . 60 [ 0.50,0.72 ] 0 . 64 [ 0.50,0.75 ] 0 . 75 [ 0.65,0.84 ] BLEU 0 . 13 [ 0.09,0.34 ] 0 . 23 [ 0.02,0.39 ] 0 . 52 [ 0.34,0.68 ] 0 . 67 [ 0.53,0.79 ] BER TScore 0 . 46 [ 0.29,0.61 ] 0 . 54 [ 0.39,0.68 ] 0 . 58 [ 0.41,0.72 ] 0 . 72 [ 0.60,0.83 ] GREEN 0 . 62 [ 0.48,0.74 ] 0 . 65 [ 0.53,0.77 ] 0 . 70 [ 0.55,0.83 ] 0 . 79 [ 0.68,0.89 ] ROUGE-L 0 . 54 [ 0.38,0.67 ] 0 . 60 [ 0.49,0.70 ] 0 . 67 [ 0.52,0.79 ] 0 . 80 [ 0.70,0.87 ] CheXbert 0 . 29 [ 0.08,0.48 ] 0 . 33 [ 0.08,0.55 ] 0 . 18 [ 0.05,0.40 ] 0 . 23 [ 0.04,0.47 ] RaTEScore 0 . 57 [ 0.42,0.70 ] 0 . 62 [ 0.45,0.78 ] 0 . 54 [ 0.39,0.68 ] 0 . 67 [ 0.54,0.79 ] RadCliQ-v1 0 . 12 [ 0.05,0.28 ] 0 . 06 [ 0.11,0.28 ] 0 . 28 [ 0.11,0.43 ] 0 . 16 [ 0.01,0.53 ] CRIMSON ∗ 0 . 61 [ 0.45,0.75 ] 0 . 71 [ 0.53,0.85 ] 0 . 67 [ 0.54,0.79 ] 0 . 81 [ 0.71,0.89 ] GREEN E 0 . 71 [ 0.59,0.81 ] 0 . 75 [ 0.65,0.86 ] 0 . 80 [ 0.71,0.88 ] 0 . 88 [ 0.82,0.93 ] CRIMSON E 0 . 73 [ 0.61,0.83 ] 0 . 86 [ 0.78,0.92 ] 0 . 74 [ 0.61,0.84 ] 0 . 87 [ 0.78,0.93 ] CRIMSON W eighted E 0 . 77 [ 0.67,0.85 ] 0 . 86 [ 0.80,0.93 ] 0 . 78 [ 0.69,0.86 ] 0 . 88 [ 0.82,0.93 ] 7 CheXbert R adGraph BER TScor e R aTEScor e GREEN CRIMSON Inter -rater 0.0 0.2 0.4 0.6 0.8 K e n d a l l b 0.39 0.36 0.36 0.37 0.44 0.45 0.46 0.45 0.46 0.47 0.49 0.47 0.51 0.53 0.54 0.53 0.55 0.58 0.61 0.59 0.63 0.61 0.68 0.64 0.82 0.79 0.78 Cor r elation with R adiologist P r efer ences (R adP r ef) R ater 1 R ater 2 R ater 3 A veraged R ater 1 vs R ater 2 R ater 1 vs R ater 3 R ater 2 vs R ater 3 CheXbert R adGraph BER TScor e R aTEScor e GREEN CRIMSON Inter -rater 0.0 0.2 0.4 0.6 0.8 P e a r s o n r 0.47 0.41 0.42 0.45 0.58 0.58 0.59 0.61 0.59 0.61 0.61 0.63 0.64 0.65 0.66 0.68 0.71 0.72 0.76 0.76 0.79 0.76 0.82 0.82 0.89 0.89 0.87 Figure 3: Radiologist Preference Alignmen t (RadPref ). Correlation b etw een metric score and ra- diologist rating diﬀerences across 100 pairwise cases. Each p oin t corresp onds to a case comparing t wo candidate rep orts for the same reference rep ort. with agreement required to establish the reference preference. A metric passes if it ranks rep orts in accordance with exp ert judgmen t or assigns equiv alen t scores when radiologists deem them clinically indistinguishable, deﬁned as diﬀerences within a threshold of 0.01. Three representativ e cases are sho wn in Figure 1. The suite prob es challenging scenarios, including urgen t omissions versus benign hallucinations, con text-dep enden t ﬁndings, diagnostic ov er- and under-interpretation, and situations reﬂecting the clinical reality of imp erfect reference rep orts that omit localization or age-exp ected b enign ﬁndings. As sho wn in Figure 2, CRIMSON is the only metric that correctly solv es all 30 out of 30 cases, consisten tly ranking candidate rep orts in accordance with expert radiologist judgmen t. In con trast, all prior metrics perform substan tially worse, correctly resolving fewer than 35% of cases, highligh ting their limited ability to capture nuanced clinical judgmen t. 4.3 Radiologist Preference Alignmen t Correlation with error counts ma y not fully capture radiologist preference, as exp erts don’t weigh diﬀeren t t yp es of errors equally in o v erall judgmen t. T o directly assess preference alignmen t, we in tro duce RadPref, a b enc hmark of 100 cases, eac h containing a reference rep ort from ReX Gradien t- 160K [26] and tw o candidate rep orts randomly generated using diverse regimes inspired b y [15]: rep ort generation with MedGemma [17], randomly sampled rep orts, BER T similarit y–matched re- p orts, and LLM-based editing, addition, or remov al of ﬁndings. Eac h candidate was rated on a 1–5 scale by three cardiothoracic radiologists based on ov erall clinical quality and correctness relative to the reference rep ort. The scale w as deﬁned as: 1 = completely wrong or clinically dangerous; 2 8 = ma jor errors, with most key ﬁndings missing or false; 3 = partially correct, with some signiﬁcant errors; 4 = mostly accurate, with only minor or negligible errors; and 5 = clinically equiv alen t to the ground truth. Scores w ere computed separately for b oth candidates. Figure 3 sho ws that CRIMSON demonstrates the strongest alignment with radiologist pairwise preferences among all ev aluated metrics. A cross Kendall’s τ b and P earson r , CRIMSON consisten tly outp erforms prior approaches and approac hes in ter-rater radiologist agreemen t. These ﬁndings indicate that CRIMSON more faithfully reﬂects exp ert clinical preference in relative report qualit y . 4.4 MedGemma Fine-tuning and Analysis T o enable priv acy-preserving, fully lo cal deploymen t of CRIMSON, we ﬁne-tuned MedGemma [17] on GPT-5–generated CRIMSON annotations using the full ReX Gradien t-160K [26] training set of 140,000 rep orts pairs for 10 ep ochs using LoRA [7]. F or each pair, the candidate rep ort was generated using the same regimes describ ed in Section 4.3 except without using an y image-based generation. GPT-5 generated structured ﬁnding-level error lab els and sev erity assignments, whic h serv ed as sup ervision to train MedGemma to replicate CRIMSON-st yle outputs. Additional training details are provided on the mo del’s Hugging F ace page. W e compare the ﬁne-tuned MedGemma (MedGemmaCRIMSON) against GPT-5.2 on RadPref preference alignmen t and severit y categorization (Figure 4). Notably , RadPref pro vides not only pairwise preference ratings but also structured error categorization and clinical severit y annotations across all three cardiothoracic radiologists, enabling direct ev aluation of both preference alignment and clinically signiﬁcant error mo deling. MedGemmaCRIMSON ac hieves comparable mean ab- solute error to GPT-5.2 across false ﬁndings, missing ﬁndings, and attribute errors, with similar b eha vior particularly on attribute-level discrepancies. F or clinical signiﬁcance lab eling, MedGem- maCRIMSON closely mirrors GPT-5.2 in repro ducing radiologist-assigned categories across all three radiologists, achieving agreement rates that are slightly low er but within a narro w margin of GPT- 5.2 (Radiologist 1: 80.3% vs 81.6%; Radiologist 2: 76.7% vs 80.5%; Radiologist 3: 73.5% vs 75.4%), with most disagreements o ccurring b et w een adjacent sev erity lev els rather than extreme misclassi- ﬁcations. 5 Discussion W e introduce CRIMSON, a clinically grounded and sev erity-a w are framew ork for ﬁne-grained radi- ology rep ort ev aluation that explicitly mo dels patien t context, diagnostic consequence, and struc- tured attribute-lev el errors. By incorp orating clinician-deﬁned clinical signiﬁcance weigh ts and score normalization, CRIMSON aligns automated ev aluation more closely with real-world radiologist rea- soning than prior approac hes. While CRIMSON leverages GPT-5.2 for structured ev aluation, w e additionally demonstrate that a ﬁne-tuned op en-weigh t mo del (MedGemmaCRIMSON) can closely appro ximate its b eha vior, enabling priv acy-preserving, fully lo cal deploymen t. A core motiv ation of CRIMSON is the principle that generated rep orts should b e ev aluated ac- cording to how they w ould function under radiologist ov ersigh t, rather than solely through aggregate accuracy or raw error coun ts. Instead of treating all discrepancies equally or as binary (signiﬁcant vs. not signiﬁcan t), CRIMSON explicitly mo dels whether an error w ould b e clinically consequen tial or p oten tially dangerous. Missing a life-threatening abnormalit y should dominate the ev aluation, whereas minor descriptor diﬀerences should not. This principle also motiv ates CRIMSON’s partial-credit design for attribute errors. When a mo del correctly identiﬁes a clinically important ﬁnding but misstates a secondary attribute (e.g., 9 R ad io l ogist 1 R ad io l ogist 2 R ad io l ogist 3 G P T - 5 .2 A B R ad io l ogist C R I M SON Med G e mm a C R I MS ON Figure 4: MedGemmaCRIMSON vs GPT-5.2. A) Mean absolute error across false ﬁndings, missing ﬁndings, and attribute errors p er radiologist. B) Sev erity categorization confusion matrices b et ween three radiologists and CRIMSON, computed only on matched errors (i.e., ﬁndings for which b oth the radiologist and CRIMSON iden tiﬁed an error in the same category). Titles sho w the p ercen tage of cases for which the radiologist and CRIMSON agree on error category . Color in tensity represents the within-row p ercen tage. 10 mild severit y mismatch or imprecise lo calization), it may still provide v alue b y directing the radiol- ogist’s attention to the relev an t abnormality . CRIMSON therefore rew ards correct detection while p enalizing clinically meaningful attribute mistakes in a severit y-a ware manner, reﬂecting that some errors increase downst ream review eﬀort without necessarily creating the same patient-safet y risk as a complete omission or ma jor hallucination. A cross three complemen tary v alidation settings: (1) correlation with radiologist-annotated clin- ically signiﬁcant error counts, (2) the RadJudge clinical judgmen t suite, and (3) the RadPref radiol- ogist preference b enc hmark, CRIMSON consisten tly demonstrates stronger agreement with exp ert judgmen t than existing metrics. Notably , sev erit y-weigh ted mo deling further improv es alignmen t, highligh ting the imp ortance of distinguishing clinically consequen tial errors from b enign ones. A limitation of this work is that muc h of CRIMSON’s prompting framework, severit y rubric, and structured ev aluation guidelines w ere dev elop ed sp eciﬁcally for c hest X-ra y reports. The clinical signiﬁcance taxonomy , attribute rules, and measuremen t thresholds were designed in collab oration with cardiothoracic radiologists and tailored to common CXR ﬁndings and rep orting con ven tions. Although the underlying ev aluation framework is mo dalit y-agnostic in principle, applying CRIM- SON to other imaging domains will require adaptation of prompts, ﬁnding ontologies, and severit y criteria to align with mo dalit y-sp eciﬁc clinical standards. F uture work will extend CRIMSON b e- y ond chest X-ray to additional imaging mo dalities where anatomical detail, multimodal con text, and diagnostic complexity are considerably greater. A c kno wledgmen ts This research w as supp orted in part by Harv ard Medical School Dean’s Innov ation A ward for Ac- celerating F oundation Mo del Research. References [1] Baharo on, M., Ma, J., F ang, C., T oma, A., W ang, B.: Exploring the design space of 3d mllms for ct rep ort generation. In: International Conference on Medical Image Computing and Computer-Assisted Interv en tion. pp. 237–246. Springer (2025) [2] Baharo on, M., Raissi, S., Jun, J.S., Heintz, T., Alabbad, M., Alburk ani, A., Kim, S.E., Klein- sc hmidt, K., Alh umaydhi, A.O., Alghamdi, M.M.G., et al.: Radgame: An ai-p o wered platform for radiology education. arXiv preprint arXiv:2509.13270 (2025) [3] Banerjee, S., La vie, A.: Meteor: An automatic metric for mt ev aluation with improv ed corre- lation with human judgments. In: Pro ceedings of the acl w orkshop on intrinsic and extrinsic ev aluation measures for machine translation and/or summarization. pp. 65–72 (2005) [4] Bann ur, S., Bouzid, K., Castro, D.C., Sc hw aighofer, A., Thieme, A., Bond-T aylor, S., Ilse, M., Pérez-García, F., Salv atelli, V., Sharma, H., et al.: Maira-2: Grounded radiology rep ort generation. arXiv preprint arXiv:2406.04449 (2024) [5] Cha ves, J.M.Z., Huang, S.C., Xu, Y., Xu, H., Usuyama, N., Zhang, S., W ang, F., Xie, Y., Khademi, M., Y ang, Z., et al.: T ow ards a clinically accessible radiology foundation mo del: op en- access and light weigh t, with automated ev aluation. arXiv preprin t arXiv:2403.08002 (2024) 11 [6] Dua, R., Jo on, Y., Dogra, S., F reedman, D., Ruan, D., Nashaw aty , M., Rigau, D., Alb er, D.A., Zhang, K., Cho, K., et al.: Clinically grounded agen t-based rep ort ev aluation: An in terpretable metric for radiology rep ort generation. arXiv preprint arXiv:2508.02808 (2025) [7] Hu, E.J., Shen, Y., W allis, P ., Allen-Zhu, Z., Li, Y., W ang, S., W ang, L., Chen, W., et al.: Lora: Lo w-rank adaptation of large language mo dels. Iclr 1 (2), 3 (2022) [8] Huang, A., Banerjee, O., W u, K., Reis, E.P ., Ra jpurk ar, P .: Fineradscore: A radiology rep ort line-b y-line ev aluation tec hnique generating corrections with sev erity scores. arXiv preprin t arXiv:2405.20613 (2024) [9] Jain, S., Agraw al, A., Sap orta, A., T ruong, S.Q., Duong, D.N., Bui, T., Chambon, P ., Zhang, Y., Lungren, M.P ., Ng, A.Y., et al.: Radgraph: Extracting clinical en tities and relations from radiology rep orts. arXiv preprint arXiv:2106.14463 (2021) [10] Jiang, Y., Chen, C., W ang, S., Li, F., T ang, Z., Merv ak, B.M., Chelala, L., Straus, C.M., Chahine, R., Armato II I, S.G., et al.: Clear: a clinically-grounded tabular framew ork for radiology rep ort ev aluation. arXiv preprint arXiv:2505.16325 (2025) [11] Li, R., Li, J., Jian, B., Y uan, K., Zhu, Y.: Reev almed: Rethinking medical rep ort ev aluation b y aligning metrics with real-world clinical judgment. In: Proceedings of the 2025 Conference on Empirical Metho ds in Natural Language Pro cessing. pp. 11823–11837 (2025) [12] Li, Y., Liu, Y., W ang, Z., Liang, X., Liu, L., W ang, L., Zhou, L.: S-rrg-b enc h: Structured radiology report generation with ﬁne-grained ev aluation framework. Meta-Radiology p. 100171 (2025) [13] Lin, C.Y.: Rouge: A pac k age for automatic ev aluation of summaries. In: T ext summarization branc hes out. pp. 74–81 (2004) [14] MacMahon, H., Naidic h, D.P ., Goo, J.M., Lee, K.S., Leung, A.N., May o, J.R., Meh ta, A.C., Ohno, Y., Po w ell, C.A., Prok op, M., et al.: Guidelines for managemen t of inciden tal pulmonary no dules detected on ct images: from the ﬂeisc hner so ciet y 2017. Radiology 284 (1), 228–243 (2017) [15] Ostmeier, S., Xu, J., Chen, Z., V arma, M., Blank emeier, L., Bluethgen, C., Md, A.E.M., Mose- ley , M., Langlotz, C., Chaudhari, A.S., et al.: Green: Generativ e radiology rep ort ev aluation and error notation. In: Findings of the asso ciation for computational linguistics: EMNLP 2024. pp. 374–390 (2024) [16] P apineni, K., Rouk os, S., W ard, T., Zh u, W.J.: Bleu: a metho d for automatic ev aluation of mac hine translation. In: Pro ceedings of the 40th ann ual meeting of the Asso ciation for Computational Linguistics. pp. 311–318 (2002) [17] Sellergren, A., Kazemzadeh, S., Jaro ensri, T., Kiraly , A., T ra verse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: Medgemma tec hnical rep ort. arXiv preprint arXiv:2507.05201 (2025) [18] Singh, A., F ry , A., P erelman, A., T art, A., Ganesh, A., El-Kishky , A., McLaughlin, A., Low, A., Ostro w, A., Anan thram, A., et al.: Op enai gpt-5 system card. arXiv preprint (2025) 12 [19] Smit, A., Jain, S., Ra jpurk ar, P ., Pareek, A., Ng, A.Y., Lungren, M.P .: Chexbert: com bining automatic labelers and exp ert annotations for accurate radiology rep ort labeling using b ert. arXiv preprint arXiv:2004.09167 (2020) [20] Tian, K., Hartung, S.J., Li, A.A., Jeong, J., Behzadi, F., Calle-T oro, J., A dithan, S., Pohlen, M., Osa yande, D., Ra jpurk ar, P .: Reﬁsco: Rep ort ﬁx and score dataset for radiology report generation. PhysioNet (2023) [21] W ang, Z., Luo, X., Jiang, X., Li, D., Qiu, L.: Llm-radjudge: A chieving radiologist-level ev alu- ation for x-ray rep ort generation. arXiv preprint arXiv:2404.00998 (2024) [22] Xu, J., Zhang, X., Ab derezaei, J., Bauml, J., Bo o doo, R., Haghighi, F., Ganjizadeh, A., Brat- tain, E., V an V een, D., Meng, Z., et al.: Radev al: A framew ork for radiology text ev aluation. In: Proceedings of the 2025 Conference on Empirical Metho ds in Natural Language Processing: System Demonstrations. pp. 546–557 (2025) [23] Y u, F., Endo, M., Krishnan, R., Pan, I., T sai, A., Reis, E.P ., F onseca, E.K.U.N., Lee, H.M.H., Abad, Z.S.H., Ng, A.Y., et al.: Ev aluating progress in automatic chest x-ray radiology rep ort generation. Patterns 4 (9) (2023) [24] Y u, F., Endo, M., Krishnan, R., Pan, I., T sai, A., Reis, E.P ., F onseca, E., Lee, H., Shakeri, Z., Ng, A., et al.: Radiology rep ort exp ert ev aluation (rexv al) dataset (2023) [25] Zhang, T., Kishore, V., W u, F., W einberger, K.Q., Artzi, Y.: Bertscore: Ev aluating text generation with b ert. arXiv preprin t arXiv:1904.09675 (2019) [26] Zhang, X., A costa, J.N., Miller, J., Huang, O., Ra jpurk ar, P .: Rexgradient-160k: A large- scale publicly a v ailable dataset of chest radiographs with free-text reports. arXiv preprin t arXiv:2505.00228 (2025) [27] Zhang, X., Acosta, J.N., Y ang, X., A dithan, S., Luo, L., Zhou, H.Y., Miller, J., Huang, O., Zhou, Z., Hamamci, I.E., et al.: Automated c hest x-ray rep ort generation remains unsolved. In: Biocomputing 2026: Proceedings of the Paciﬁc Symp osium. pp. 236–250. W orld Scientiﬁc (2025) [28] Zhang, Z., Lee, K., Deng, W., Zhou, H., Jin, Z., Huang, J., Gao, Z., Marshall, D.C., F ang, Y., Y ang, G.: Gema-score: Granular explainable multi-agen t score for radiology rep ort ev aluation. arXiv preprint arXiv:2503.05347 (2025) [29] Zhao, W., W u, C., Zhang, X., Zhang, Y., W ang, Y., Xie, W.: Ratescore: A metric for radiology rep ort generation. arXiv preprint arXiv:2406.16845 (2024) [30] Zh u, Q., Chen, X., Jin, Q., Hou, B., Mathai, T.S., Mukherjee, P ., Gao, X., Summers, R.M., Lu, Z.: Leve raging professional radiologists’ exp ertise to enhance llms’ ev aluation for radiology rep orts. arXiv preprin t arXiv:2401.16578 (2024) 13

CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment