MediX-R1: Open Ended Medical Reinforcement Learning

MediX-R1: Open Ended Medical Reinf or cement Learning Sahal Shaji Mullappilly * 1 Mohammed Irfan Kurpath * 1 Omair Mohamed 2 Mohamed Zidan 3 Fahad Khan 1 Salman Khan 1 Rao Anwer 1 Hisham Cholakkal 1 1 Mohamed Bin Zayed Univ ersity of Artiﬁcial Intelligence (MBZU AI) 2 Jubilee Mission Medical College and Research Institute, 3 JJM Medical College 4 0 % 4 5 % 5 0 % 5 5 % 6 0 % 6 5 % 7 0 % 7 5 % A v e r a g e A c c u r a c y ( % ) 1 0 K 5 0 K 2 0 0 K 1 M 5 M 3 0 M T r a i n i n g D a t a s e t S i z e P a r a m s : ~ 2 B ~ 8 B ~ 3 0 B = pa ra m s s i z e D a t a A v a i l a b l e D a t a N o t A v a i l a b l e M e d i X - R 1 M e d G e m m a M e d M O H u a t u o G P T B i M e d i X 2 M e d V L M L e s s d a t a , h i g h e r a c c u r a c y M e d G e mma 2 7 B 68 . 4 % M e d M O 8 B 6 2 . 1 % H u a t u o G P T - V 7 B 5 5 . 8 % B i M e d i X 2 8 B 5 5 . 6 % M e d G e mma 4 B 5 6 . 6 % M e d V L M - R 1 2 B 4 2 . 7 % M e di X - R 1 3 0 B 7 3 . 6 % M e di X - R 1 8 B 68 . 8 % M e di X - R 1 2 B 5 5 . 4 % F igur e 1. A verage accuracy across multimodal medical benchmarks vs. training dataset size for r ecent medical VLMs. Colors denote model families; mark er shape/size indicates parameter scale ∼ (2B, 8B, 30B). × denote open-source av ailability of training data (*as of 25/02/2026) . MediX-R1 8B (68.8%) surpasses MedGemma 27B (68.4%) while using signiﬁcantly less training data, and MediX-R1 30B achiev es the highest ov erall accuracy (73.6%). All training and ev aluation resources are av ailable at MediX-R1 . Abstract W e introduce MediX-R1, an open-ended Rein- forcement Learning (RL) framew ork for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 ﬁne- tunes a baseline vision-language backbone with Group Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy rew ard that judges semantic correctness with a strict YES/NO decision, a medical embedding- based semantic rew ard to capture paraphrases and terminology variants, and lightweight format and modality rew ards that enforce interpretable * Equal contribution. reasoning and modality recognition. This multi- signal design provides stable, informative feed- back for open-ended outputs where traditional veriﬁable or MCQ-only rewards fall short. T o measure progress, we propose a uniﬁed ev alua- tion framew ork for both text-only and image+te xt tasks that uses a Reference-based LLM-as-judge in place of brittle string-overlap metrics, captur- ing semantic correctness, reasoning, and conte x- tual alignment. Despite using only ∼ 51 K in- struction examples, MediX-R1 achiev es excellent results across standard medical LLM (te xt-only) and VLM (image + te xt) benchmarks, outperform- ing strong open-source baselines and delivering particularly lar ge gains on open-ended clinical 1 MediX-R1: Open Ended Medical Reinforcement Lear ning Model Diverse Medical Single-Stage Interpretable Open-Ended Annotation-Free Composite Modalities RL Reasoning Responses Reasoning RL Reward MedVLM-R1 ( Pan et al. , 2025 ) ✗ ✓ ✓ ✗ ✓ ✗ BiMediX2 ( Mullappilly et al. , 2024 ) ✓ ✗ ✗ ✓ ✗ ✗ HuatuoGPT -V ( Chen et al. , 2024b ) ✓ ✗ ✗ ✓ ✗ ✗ MedGemma ( Sellergren et al. , 2025 ) ✓ ✗ ✓ ✓ ✗ ✗ MedMO ( Deria et al. , 2026 ) ✓ ✗ ✓ ✓ ✗ ✗ MediX-R1 ✓ ✓ ✓ ✓ ✓ ✓ T able 1. Differences with existing Medical VLMs: MediX-R1 integrates di verse modalities, interpretable reasoning, and composite RL rew ards, enabling practical clinical use. tasks. Our results demonstrate that open-ended RL with comprehensiv e rew ard signals and LLM- based e v aluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models , curated datasets and sour ce code are av ailable at MediX-R1 . 1. Introduction Large medical language and vision-language models are in- creasingly deployed for clinical question answering, triage support, report drafting, and education ( Chen et al. , 2024a ; Sellergren et al. , 2025 ; Pieri et al. , 2024 ). Many of these tasks are inherently open-ended: clinicians expect concise but free-form answers that can ﬂe xibly incorporate con- text, uncertainty , and multimodal evidence. Howe ver , most training and e valuation pipelines remain tailored to Multi- ple Choice Questions (MCQ) or string-matching regimes, which (i) under-re ward v alid clinical paraphrases, (ii) fail to measure reasoning quality or modality recognition, and (iii) do not provide reliable signals for reinforcement learning (RL) in open-ended settings. As a result, models trained only with supervised objecti ves or MCQ-style re wards of- ten struggle to produce faithful, interpretable, and robust clinical responses across div erse modalities. RL has improved reasoning in domains with veriﬁable re- wards (e.g., math and code) as sho wn by DeepSeek models ( Shao et al. , 2024 ; Guo et al. , 2025 ), b ut medical tasks rarely admit ex ecutable checks. Binary exact match is too brittle for clinical phrasing; BLEU/R OUGE can mis-score seman- tically correct answers; and free-form VLM outputs com- plicate visual inference. Moreover , using a single rew ard signal can induce instability or re ward hacking, especially when the signal is noisy or o verly permissi ve. Hence, it is desirable to have a principled approach for training medi- cal MLLMs with open-ended RL that integrates semantic correctness with structural and modality constraints, while remaining data- and compute-efﬁcient. W e present MediX-R1, an open-ended medical RL frame- work that ﬁne-tunes a baseline multimodal backbone with Group Based RL (GRPO/GSPO/D APO) using a composite rew ard tailored for clinical reasoning. Our design com- bines: (1) an LLM-based accuracy reward that enforces a strict YES/NO decision on semantic correctness, (2) a medical embedding-based semantic rew ard that captures paraphrases and terminology variants, (3) a lightweight for - mat reward that elicits interpretable reasoning traces, and (4) a modality recognition reward that discourages cross- modality hallucinations by requiring explicit modality tags. This multi-signal objectiv e stabilizes optimization and sup- plies informativ e feedback where traditional veriﬁable or MCQ-only re wards fall short, enabling single-stage, open- ended RL directly on clinical tasks. Differences with existing Medical VLMs: T able 1 con- trasts MediX-R1 with strong open models across k ey clin- ical capabilities. First, on Diverse Medical Modalities , MediX-R1 supports di verse medical modalities includ- ing X-Ray , CT , MRI, Microscopy/Histopathology , Ultra- sound, Fluoroscopy , Endoscopy , Angiography , Mammog- raphy , Clinical Photography , SPECT (Single Photon Emis- sion Computed T omography), OCT (Optical Coherence T omography), and Fundus imaging, whereas MedVLM- R1 ( Pan et al. , 2025 ) is limited to radiology images. Mod- els like MedGemma ( Seller gren et al. , 2025 ), HuatuoGPT - V ision ( Chen et al. , 2024b ), MedMO ( Deria et al. , 2026 ), and BiMediX2 ( Mullappilly et al. , 2024 ) provide cover - age on clinical modalities b ut they require e xtensiv e multi- stage training. On Single-Stage RL , most baselines rely on multi-stage pipelines (pretraining → SFT → RL), whereas MediX-R1 is trained end-to-end with a single RL stage us- ing our composite re ward (Sec. 2.2 ). This simpliﬁes training and, importantly , enables open-ended RL directly (unlike MedVLM-R1), because the Reference-based LLM-as-judge accuracy signal and medical embeddings provide reliable feedback beyond MCQ e xact match. The composite design (format + LLM judge + embeddings + modality recogni- tion) stabilizes optimization and reduces rew ard hacking (Fig. 5 ), translating into the best av erage performance in T able 2 . For Interpretable Reasoning , MediX-R1 emits explicit reasoning traces enclosed in ... , enforced by a format re ward, making the decision path 2 MediX-R1: Open Ended Medical Reinforcement Lear ning thi n k D o e s t he p at ie nt hav e br ain tu m o r ? The i m a ge a p p e a rs t o be a T2 - w e i ghte d MRI sca n o f t he bra i n. The p re se nc e o f a ma ss w i t h a w e ll - de f i ne d ma rgi n, a lo ng w i t h su rr o unding e de m a , c o uld su gg e st t he p re se nc e o f a bra i n t um o r. Ho w e v e r, f urt he r i m a ging w i t h c o nt ra st a nd a de t a i le d c lini c a l hist o ry w o uld be ne c e ss a ry f o r a de f i nit i v e dia gno sis. R ei nf o rceme nt L ea rni ng mo d a l i ty a n sw er < M RI_SCA N> Y e s, t he p a t i e nt m a y ha v e a bra i n t um o r. Emb ed ding Rewa r d LLM Rewa r d Modal it y Rewa r d F or mat Rew ar d thi n k a n sw er M R I XR AY CT M ediX - R1 G R PO DA PO G S PO F igur e 2. MediX-R1: Overall Architecture The MediX-R1 reinforcement learning framework for open-ended medical reasoning. An input of a medical image and a natural language question is processed by MediX-R1. The model’ s policy is trained using Group Based RL, which leverages a multi-faceted reward signal. This rew ard is composed of: a) an LLM-based reward for e valuating the overall quality and correctness of the output; b) an embedding-based reward to ensure semantic alignment; c) a format re ward to enforce the desired output structure ( and blocks); and d) a modality reward to ensure the response is grounded in the speciﬁed imaging modality . This rew ard-guided approach encourages the model to generate accurate and interpretable reasoning paths. auditable. Sev eral baselines do not reliably produce struc- tured clinical rationales. While multiple models support Open-Ended Responses , MediX-R1 is explicitly optimized for free-form clinical answering with modality recognition, which curbs cross-modality hallucinations and improves VLM robustness. Finally , MediX-R1 achie ves Annotation- F r ee Reasoning : it does not require human-curated ratio- nales or veriﬁed chain-of-thought. The RL rewards operate on the ﬁnal answer only (via Reference based LLM judge and embeddings), signiﬁcantly lo wering data curation cost while still encouraging faithful, interpretable reasoning. T o- gether , these properties explain the consistent gains across both text-only and image+text benchmarks and the practical advantages of MediX-R1 for clinical use. T o measure progress, we introduce a uniﬁed, 3-stage Reference-based LLM-as-judge ev aluation framework that supports both text-only and image+te xt tasks under a com- mon protocol. By replacing brittle string-overlap metrics with instruction-tuned judges served via vLLM ( Kwon et al. , 2023 ), our ev aluation captures semantic correctness, reason- ing adequacy , and contextual alignment, and scales from short-form QA to long-form report generation. This reduces ev aluation-clinical utility mismatch. Despite using only ∼ 51K instruction examples, MediX-R1 achie ves strong re- sults across div erse medical benchmarks. W e ﬁnd that com- posite rew ards not only improv e accuracy b ut also mitigate rew ard hacking and reduce v olatility , yielding stable training and interpretable outputs. Compared to open-source med- ical models (e.g., BiMediX2, MedGemma, HuatuoGPT -V , MedVLM-R1, MedMO), MediX-R1 combines broad modal- ity cov erage with single-stage RL and structured reasoning. Contributions: (i) W e introduce open-ended medical re- infor cement learning by extending Group based RL with tailored rewards for clinical reasoning. (ii) W e design a composite r ewar d with LLM-based accuracy and medical semantic signals that for the ﬁrst time enables open-ended re- sponses with RL in the medical domain and stabilizes train- ing. (iii) W e propose a three-stage Refer ence-based LLM-as- judge e valuation frame work that uniﬁes benchmarking for both LLM (text-only) and VLM (image+text) tasks in the medical setting. (iv) MediX-R1 achieves excellent LLM and VLM results with a single-stag e RL r ecipe using ∼ 51K in- structions , v alidated through both Reference-based LLM-as- judge and human expert ev aluations. (v) Finally , we demon- strate the effecti veness of the proposed composite rew ard on Group based RL algorithms, achieving consistent per- formance gains with GRPO ( Shao et al. , 2024 ), D APO ( Y u et al. , 2025 ) and GSPO ( Zheng et al. , 2025a ). Moreover , we hav e conducted experiments on different baseline VLMs, including Qwen2.5-VL, Qwen3-VL ( T eam , 2025 ), and SmolVLM2 ( Maraﬁoti et al. , 2025 ), and achie ved consistent performance gains across backbones. 2. Open Ended Medical RL MediX-R1 ﬁne-tunes a baseline multimodal backbone for open-ended medical reasoning using Group Based RL. Giv en an image I and question q , the vision encoder pro- duces visual tokens that are fused with text tok ens and fed to the LLM policy π θ . The model generates structured out- puts of the form: [modality tag]free-form clinical reasoningfinal concise answer 2.1. Group-based RL with Composite Rewards Setup: Given an input v (image I + question q ) drawn from P ( V ) , we sample a group of G candidate completions 3 MediX-R1: Open Ended Medical Reinforcement Lear ning E v a l u a t i o n F r a m e w o r k T h r e e - s t a g e p i p e l i n e f o r a s s e s s i n g m e d i c a l A I m o d e l p e r f o r m a n c e S t a g e 1 : G e n e r a t i o n B a t c h e d i n f e r e n c e v i a v L L M • M o d e l u n d e r t e s t p r o c e s s e s i n p u t • G e n e r a t e s s t r u c t u r e d r e s p o n s e s • P e r s i s t s f u l l o u t p u t p e r s a m p l e E X A M P L E I N P U T : " W h a t d o e s t h e d a r k b l u e c o l o r o n t h e l a s e r s p e c k l e c o n t r a s t a n a l y s i s p e r f u s i o n i m a g e r e p r e s e n t ? " M O D E L O U T P U T : ...reasoning... L o w b l o o d f l o w o r l e s s p e r f u s i o n S t a g e 2 : E v a l u a t i o n R e f e r e n c e - b a s e d L L M - a s - j u d g e • B A S E t e m p l a t e f o r Q A / M C Q • M I M I C t e m p l a t e f o r r e p o r t s • B i n a r y d e c i s i o n s & r u b r i c s c o r e s G R O U N D T R U T H R E F E R E N C E ( G T ) : " L o w p e r f u s i o n " C A N D I D A T E A N S W E R ( C A ) : " L o w b l o o d f l o w o r l e s s p e r f u s i o n " J U D G E C O M P A R I S O N : ✔ G T a n d C A a r e s e m a n t i c a l l y a l i g n e d ✔ C A i s c l i n i c a l l y a p p r o p r i a t e t o G T S t a g e 3 : S c o r i n g A g g r e g a t e t o d a t a s e t m e t r i c s • M e a n a c c u r a c y f o r b i n a r y e v a l • A v e r a g e r u b r i c s c o r e s • M a c r o a v e r a g e s a c r o s s b e n c h m a r k s B I N A R Y D E C I S I O N : ✓ C O R R E C T F I N A L S C O R E : 1 . 0 F igure 3. Evaluation Framework Our three-stage ev aluation pipeline: (1) Generation via vLLM inference on the model under test, (2) Evalua- tion using Reference-based LLM- as-judge with BASE and MIMIC templates, and (3) Scoring through aggregation of judgment outputs. The framew ork supports both binary decisions for QA/MCQ tasks and rubric-based scoring for long-form reports, ensuring robust e valuation across div erse medical benchmarks { o i } G i =1 from the frozen behavior policy π θ old ( · | v ) . Each completion receives a scalar rew ard r i computed by our com- posite r ewar d (Sec. 2.2 ). W e then compute a standardized gr oup-relative adv antage: A i = r i − mean( { r j } G j =1 ) std( { r j } G j =1 ) . This remov es the need for a learned value function while preserving a stable relativ e learning signal within group. GRPO objective: GRPO ( Shao et al. , 2024 ) updates π θ using PPO-style clipping on an importance ratio and a KL regularizer to a ﬁx ed reference policy π ref : J GRPO ( θ ) = E v , { o i } " 1 G G X i =1 min  ρ i ( θ ) A i , clip( ρ i ( θ ) , 1 − ϵ, 1 + ϵ ) A i  − β D KL ( π θ ∥ π ref ) # (1) where ρ i ( θ ) = π θ ( o i | v ) π θ old ( o i | v ) , and ϵ, β ≥ 0 control clipping and regularization strength. Composite reward with differ ent optimizers: In addition to GRPO, we run the same composite re ward with two re- cent Group based RL family optimizers, D APO ( Y u et al. , 2025 ) and GSPO ( Zheng et al. , 2025a )and report their com- parison in T able 6 . D APO (efﬁciency-focused r eﬁnements): D APO keeps the GRPO/PPO clipped structure but improv es token ef ﬁciency (as summarized in Y u et al. ( 2025 )): it (i) uses asymmet- ric clipping (“Clip-Higher”) with a larger upper bound to av oid prematurely zeroing gradients for rare-but-good to- kens, and (ii) a verages loss ov er all generated tok ens rather than per-sample averaging (reducing gradient dilution for long responses). A compact form is: J DAPO ( θ ) = E v , { o i } " 1 P G i =1 | o i | G X i =1 | o i | X t =1 min  r i,t ( θ ) A i , clip( r i,t ( θ ) , 1 − ϵ low , 1 + ϵ high ) A i  # (2) where r i,t ( θ ) = π θ ( o i,t | v ,o i,... and compare it to the reference answer using a compact Reference-based LLM-as-judge prompt that forces a strict YES/NO decision. Concretely , a local vLLM endpoint (Qwen3-4B) returns YES if the candidate semantically an- swers the reference, and NO otherwise; we map YES 7→ 1 , NO 7→ 0 . This captures correctness and rob ustness to para- phrasing while keeping the signal discrete. Embedding-Based Reward ( R embed ) T o further encourage semantic alignment, we compute cosine similarity between the predicted answer and the reference using a medical embedding model MedEmbed-large ( Balachandran , 2024 ). W e con vert it to a binary reward via a threshold (default 0.8): R embed =1[cos( e pred , e ref ) ≥ τ ] . This complements the LLM judge and helps capture terminological variants. Format Reward ( R format ) W e enforce structured out- puts by matching the regex for the exact pattern ... ... after normalizing stray whitespace around angle brackets. Out- puts that match receiv e 1 , else 0 . This stabilizes training and improv es interpretability of the medical reasoning. Modality recognition Reward ( R modality ) W e encourage explicit grounding to the imaging modality by requiring the model to emit the predicted modality tag before the block (case-insensitive). W e compare it to the reference modality tag and assign 1 on match, 0 otherwise. This reduces cross-modality hallucinations (e.g., describing CT ﬁndings on an X-ray). 3. Evaluation Framework Our e valuation pipeline has three stages: Generation, Eval- uation, and Scoring. W e ev aluate across both text-only (LLM) and image+text (VLM) tasks covering QA, MCQ, and long-form report tasks. Generation. W e run batched inference via vLLM on the model under test and persist the full response per sample. For models that emit structured reasoning, we retain the entire output but, for scoring, discard internal chains-of- thought by stripping content up to and including the closing tag, e valuating only the ﬁnal answer block. Evaluation: W e employ a separate Reference-based LLM- as-judge, Qwen3-14B ( T eam , 2025 ), served with vLLM for throughput and stability on modest GPUs. T wo prompt families are used: a B ASE template (§ A.8 ) for open-ended, one-word, and MCQ-style questions that yields a binary de- cision, and a MIMIC template (§ A.9 ) for long-form report generation that scores along clinical criteria. For e xample, on a visual question answering item asking “ What does the dark blue color on the laser speckle contrast analysis perfu- sion imag e r epresent? ” with ground truth “ Low perfusion ” a model response that includes hidden reasoning and the ﬁnal answer “ Low blood ﬂow or less perfusion ” is judged correct and assigned a score of 1. The judge compares predicted answers against references, accounting for paraphrase and clinically equiv alent phrasing. Scoring: W e aggregate judgment outputs to dataset-lev el metrics. For binary ev aluations, we report mean accuracy ov er samples. For long-form, we a verage the scalar rubric scores across samples, optionally normalizing for compara- bility . W e also compute macro averages across benchmarks. Why Refer ence-based LLM-as-judge (via vLLM): T ra- ditional string-overlap metrics (BLEU, R OUGE, F1) for reference (ground truth) comparison often under -rew ard cor - rect, clinically appropriate paraphrases and cannot assess justiﬁcation quality or conte xtual alignment. A Reference- based LLM judge captures semantic correctness, clinical reasoning, and adherence to task-speciﬁc criteria through carefully designed prompts, while vLLM serving ensures consistent, fast, and reproducible e valuations. 4. Experiments and Results 4.1. State-of-the-art Comparisons W e ev aluate MediX-R1 on a comprehensiv e suite of med- ical language and vision-language benchmarks, covering both text-only (LLM) and image+text (VLM) tasks. The ev aluation includes standard medical QA, multiple-choice, and open-ended report generation, as well as visual question answering and clinical image interpretation. The datasets used for ev aluation are as follows: LLM (text-only) benchmarks: MMLU-Clinical, MMLU- Bio, MMLU-Med, MMLU-Genetics, MMLU-ProfMed, MMLU-Anat ( Hendrycks et al. , 2020 ), MedMCQA ( Pal et al. , 2022 ), MedQA ( Jin et al. , 2021 ), USMLE-SA ( Han et al. , 2023 ), PubMedQA ( Jin et al. , 2019 ), MIMIC-CXR- Summarization ( Johnson et al. , 2016 ). VLM (image+text) benchmarks: SLAKE-VQA ( Liu et al. , 2021 ), RadVQA ( Lau et al. , 2018 ), PathVQA ( He et al. , 2020 ), PMC-VQA ( Zhang et al. , 2024 ), PMC-VQA-Hard, MIMIC-CXR-Report Generation ( Johnson et al. , 2019 ). For each dataset, we follow the ev aluation protocol de- scribed in the pre vious section, using Reference-based LLM- as-judge scoring for both short-form and long-form re- sponses. T able 2 summarizes performance on our uniﬁed benchmark suite across several open-source medical models, including MedVLM-R1 (2B) , BiMediX2 (8B) , HuatuoGPT - V (7B) , MedGemma (4B/27B) , MedMO (8B) ( Deria et al. , 5 MediX-R1: Open Ended Medical Reinforcement Lear ning Benchmarks MedVLM R1 2B BiMediX2 8B Huatuo GPT -V 7B MedGemma 4B MedGemma 27B MedMO 8B MediX-R1 2B MediX-R1 8B MediX-R1 30B MMLU-Clinical 0.540 0.732 0.721 0.708 0.879 0.864 0.660 0.845 0.894 MMLU-Bio 0.549 0.792 0.708 0.706 0.972 0.951 0.806 0.951 0.993 MMLU-Med 0.451 0.694 0.653 0.605 0.866 0.827 0.699 0.879 0.890 MMLU-Genetics 0.560 0.790 0.710 0.820 0.940 0.900 0.680 0.900 0.980 MMLU-ProfMed 0.500 0.695 0.625 0.713 0.912 0.890 0.581 0.868 0.974 MMLU-Anat 0.519 0.659 0.600 0.556 0.793 0.785 0.563 0.763 0.874 MedMCQA 0.408 0.572 0.511 0.570 0.727 0.662 0.492 0.683 0.781 MedQA 0.400 0.583 0.534 0.621 0.866 0.848 0.497 0.796 0.929 USMLE-SA 0.378 0.591 0.538 0.639 0.895 0.742 0.505 0.822 0.951 PubMedQA 0.520 0.520 0.542 0.470 0.414 0.586 0.472 0.482 0.490 MIMIC-CXR-Sum 0.704 0.672 0.707 0.692 0.767 0.709 0.786 0.746 0.765 SLAKE-VQA 0.434 0.468 0.545 0.678 0.634 0.479 0.654 0.703 0.683 RadVQA 0.404 0.530 0.614 0.659 0.585 0.419 0.539 0.596 0.625 PathVQA 0.239 0.323 0.374 0.317 0.322 0.272 0.428 0.455 0.445 PMC-VQA 0.398 0.482 0.532 0.444 0.478 0.360 0.491 0.554 0.571 PMC-VQA-Hard 0.020 0.229 0.261 0.214 0.354 0.177 0.284 0.317 0.307 MIMIC-CXR-Gen 0.240 0.124 0.316 0.205 0.224 0.084 0.280 0.328 0.350 A VG 0.427 0.556 0.558 0.566 0.684 0.621 0.554 0.688 0.736 T able 2. Evaluation Benchmark. The top section lists LLM (text-only) tasks and the bottom section lists VLM (image+text) tasks. Our three-stage ev aluation setting ev aluates both tasks in a uniﬁed framework. MediX-R1 achiev es the highest av erage score across this div erse suite, demonstrating state-of-the-art performance among open models. Best and second best results are bold and underlined 2026 ) and MediX-R1 (2B/8B/30B) . MediX-R1 achieves the highest average score across all benchmarks, outperforming prior models on both language and vision-language tasks. Notably , it demonstrates strong gains on open-ended and clinically complex tasks such as MIMIC-CXR summarization and report generation, as well as robust performance on standard QA and VQA datasets. These results highlight the ef fectiv eness of our open-ended RL training and composite rew ard design, which enable MediX-R1 to generate accurate, semantically aligned, and clinically grounded responses beyond the capabilities of models trained only with supervised or MCQ objectiv es. W e additionally report results on the Massiv e Multi- discipline Multimodal Understanding and Reasoning (MMMU) benchmark ( Y ue et al. , 2024 ). W e select the Health & Medical validation subset (MMMU-Med-V al), cov ering Basic Medical Science, Clinical Medicine, Diag- nostics and Laboratory Medicine, Pharmacy , and Public Health. Results are shown in T able 3 . 4.2. Ablation Experiments Composite Reward across RL algorithms : Our RL- method ablation in T able 6 shows that the proposed composite-rew ard training transfers across different RL frame works, consistently improving over the baseline across both LLM and VLM tasks. DAPO ( Y u et al. , 2025 ) achie ves the best overall a verage (0.610), compared to GRPO (0.59), GSPO(0.600), and the baseline Qwen2.5-VL (0.570). Reward Design Ablation: T able 4 compares reward v ari- Model MMMU Medical V alidation MedVLM-R1 2B ( Pan et al. , 2025 ) 39.33 MedGemma 4B ( Sellergren et al. , 2025 ) 50.00 HuatuoGPT -V 7B ( Chen et al. , 2024b ) 47.33 BiMediX2 8B ( Mullappilly et al. , 2024 ) 39.33 Qwen3-VL 8B ( Y ang et al. , 2025 ) 62.66 MedMO 8B ( Deria et al. , 2026 ) 62.66 Qwen3-VL 30B ( Y ang et al. , 2025 ) 68.66 MedGemma 27B ( Sellergren et al. , 2025 ) 56.66 Lingshu 32B ( T eam et al. , 2025 ) 62.66 MediX-R1 2B 44.66 MediX-R1 8B 64.00 MediX-R1 30B 75.33 T able 3. MMMU Medical V al results. Accuracy on the Health & Medical validation subset of MMMU (image-te xt). ants that dif fer in which non-format signals are acti ve (all set- tings include the same R format ). The Def ault re ward baseline that uses basic string matching against the ground truth is brittle to paraphrases and clinical synon ymy . Using only the embedding re ward underperforms on text-only e valuations (0.640) and provides gains on VLM (0.409), suggesting that thresholded cosine similarity alone lacks discriminativ e power for nuanced clinical reasoning. Using only the LLM judge impro ves te xt-only accuracy (0.666) b ut does not help VLM (0.400), indicating that the judge alone is insufﬁcient to enforce modality grounding. All rew ard design models are compared with checkpoints before rew ard hacking. Combining LLM + embedding increases robustness to para- phrase and terminology v ariants, improving te xt-only scores (0.686) and yielding a small VLM lift (0.410), which raises the overall av erage to 0.589. Finally , the full MediX-R1 composite (LLM accurac y + embedding semantics + modal- 6 MediX-R1: Open Ended Medical Reinforcement Lear ning T h e ima g e a p p e a rs to be a s e ri e s of mi c r o s c o p i c s e c tio ns , l ike l y of a de v e l o p in g br a in , g i v e n th e p r e s e n c e of l a b e l e d s tru c t u re s s u c h as th e o p ti c n e r v e ( op . n ) a n d o th e r br a in r e g i o n s . S e c tio n G s h o w s a c r o s s - s e c ti o n of th e br a in w ith th e o p ti c tr a c t ( ot ) a n d th e s u p e ri o r th a l a mi c n u c l e u s ( s t m ) v isibl e . T h e re d s t a in in g l ik e l y repr e s e n ts a s p e c if i c ma rke r or p r o t e in , w h ile th e blu e s t a in in g c o u l d be a n u c l e a r c o u n te rst a in or a dif f e r e nt ma rk e r . T h e p r e s e n c e of th e s e s tru c t u re s s u g g e s ts th a t th is is a h ist o l o g i c a l s e c ti o n , p o s s ibly s t a in e d w ith immu n o flu o re s c e n c e to h ig h l ig h t s p e c if ic c e l l u l a r c o mp o n e n ts or p ro te in s . T h e a re a s h o w n in s e c tio n G is th e o p tic tra c t . W h ic h a re a is s h o w n in s e c tio n G o f th e ima g e ? mo d a l i ty MIC RO S C O P Y thi n k a n sw er T h e im a g e p rovide d is a p o s t e roant e ri o r (P A ) c h e s t X - r a y . In a PA v i e w , th e h e a rt is ty p i c a l l y s m a l l e r c o m p a r e d to an a n t e r o p o s t e ri o r ( A P) v i e w be c a u s e th e h e a rt is mo re dire c t l y in th e X - ra y b e a m, w h ic h c a n mak e it a p p ea r l a r g e r . H o w e ve r, in th is ima g e , th e h e a r t s ize a p p e a rs to be w ith in n o r ma l l imits for a PA v ie w . T h e h e a rt s iz e in a PA v i e w is g e n e r a l l y s m a l l e r th a n in an AP v i e w , bu t th e e x a c t c o mp a ri s o n w o ul d re q u ir e a s ide - by - s ide c o mp a ri s o n w ith an AP v ie w fo r a de fi n itiv e a s s e s s me n t . T h e h e a rt s ize in th is ima g e is s ma l l e r th a n if th e ima g e was ta ke n AP Is t h e h e a rt s ize in th is ima g e s ma l l e r o r l a rg e r th a n if th e ima g e w a s ta ke n A P? mo d a l i ty X_ RA Y thi n k a n sw er F igure 4. Qualitative examples of MediX-R1 . (T op, Microscopy) Correctly identiﬁes the optic tract in section G with interpretable reasoning. (Bottom, X-ray) Explains why heart size appears smaller in P A vs. AP view . MediX-R1 generates clinically grounded, open-ended answers across modalities. ity recognition, with shared format control) produces the strongest image+text performance (0.431) while matching the best text-only result (0.687), achie ving the best ov erall av erage (0.597). T ogether with Fig. 5 , which sho ws reduced volatility and fe wer signs of reward hacking, these results suggest that the composite rew ard both improves aggre gate performance and stabilizes optimization. Ke y takeaways: (i) The LLM judge is the strongest single signal for text correctness, and embeddings complement it by reducing false negativ es from paraphrases. (ii) Default string matching degrades substantially on open-ended mul- timodal e valuations. (iii) Modality recognition is important for VLM tasks and driv es substantial gains in image+text tasks; the full composite deliv ers the best overall results. Perf ormance across VLM backbones : W e observe con- sistent improv ements across VLM backbones after training these models in our MediX-R1 framework with our com- posite rew ards as shown in T able 5 . These results show that MediX-R1 enhances open-ended medical reasoning ability across backbone models. Judge Robustness and Evaluator sensitivity . T o ensure robustness, we perform controlled e valuations using deter - Evaluations Default Embedding LLM LLM + MediX-R1 Reward Reward Reward Embedding LLM T asks 0.660 0.640 0.666 0.686 0.687 VLM T asks 0.382 0.409 0.400 0.410 0.431 Overall A VG 0.562 0.558 0.572 0.589 0.597 T able 4. Reward ablation across v alidation benchmarks. Single signals like default string matching, embedding-only or LLM-only are weaker . Combining LLM + embedding improves robustness, and the MediX-R1 composite (LLM accuracy + embedding-based semantics + modality recognition) yields the best overall a verage. ministic generation settings ( temperature=0 , top p=1 ) and report av erages over three runs, observing only ± 0 . 002 variation. F or additional validation, we replaced Qwen3- 14B with GPT -5.1 and GPT -5 mini as e valuators, which resulted in a deviation of only ± 0 . 005 , indicating high con- sistency across judge models. 4.3. Reward Hacking and Mitigation In reinforcement learning, Re ward Hacking occurs when a model maximises its re ward in unintended ways, often by- passing the true objectiv e. It arises when the policy exploits 7 MediX-R1: Open Ended Medical Reinforcement Lear ning Model Baseline + Composite Rewards SmolVLM2-2.2B ( Maraﬁoti et al. , 2025 ) 0.410 0.432 Qwen3-VL-2B ( Y ang et al. , 2025 ) 0.529 0.554 Qwen3-VL-8B ( Y ang et al. , 2025 ) 0.666 0.688 Qwen3-VL-30B ( Y ang et al. , 2025 ) 0.698 0.736 T able 5. Baseline comparison across backbones (o verall A VG). “Baseline” is the original backbone; “+Composite Rew ards” ap- plies our RL with composite rew ards on the same backbone. Evaluations Baseline GRPO GSPO DAPO LLM Evaluations (te xt only) 0.675 0.687 0.689 0.701 VLM Evaluations (image + te xt) 0.376 0.431 0.439 0.445 Overall A VG 0.570 0.597 0.600 0.610 T able 6. RL Method ablation across benchmarks. Using the same composite re ward, dif ferent RL algorithms (GRPO/GSPO/- D APO) consistently outperform the baseline across both LLM and VLM tasks, with D APO achieving the highest o verall a verage. imperfections in a single re ward signal to earn high scores without producing clinically correct answers. W e observed two concrete modes (examples abbre viated): Embedding model exploit When using Embedding mod- els like MedEmbed-large ( Balachandran , 2024 ) short or non-semantic tokens can spuriously yield high co- sine similarity . For instance, a candidate that outputs - for “What does the white arrow point to in image B?” received R embed =1 . 0 against the ground truth “Renal artery , ” despite being incorrect. LLM judge exploit When using LLMs like Qwen3- 4B ( T eam , 2025 ) as a re warder template-like placehold- ers can confuse the judge when the reference is provided for comparison. E.g., The larg est organ in the pictur e is [insert your answer here based on the medical r easoning pr ovided above] . was judged correct ( R llm =1 . 0 ) against the reference “Lung. ” Mitigation in MediX-R1 T o curb these failures, MediX-R1 employs a composite reward and input/output constraints: (i) Composite objective: R llm + R embed + R modality (with shared R format ) reduces reliance on any single brittle signal and penalizes mismatches in content or modality recogni- tion (T able 4 ). (ii) Embedding gating: set R embed =0 for answers belo w a minimum character/w ord length, with high punctuation or non-alphanumeric ratio; strip punctuation before embedding; calibrate the similarity threshold. (iii) Modality r ecognition: R modality requires a correct modality tag, curbing visually ungrounded shortcuts that might still fool text-only re wards. (iv) Structural contr ol and r e gular- ization: R format enforces parseable outputs; Group relati ve advanta ge and a KL penalty to the reference reduce collapse to degenerate hacks by discouraging outlier beha viors. (v) Rewar d coefﬁcient selection (ablation-driven): T o make the rew ard design transparent, we selected coefﬁcients via a Med iX - R1 L L M R ewar d L L M + E mded di ng E mded di ng R ewar d F igure 5. Overall validation reward vs training step across reward designs . T raining with individual signals LLM-only or embedding-only shows volatility and reward hacking, while LLM+embedding reduces but does not eliminate instability . MediX-R1 uses a composite reward which stabilizes learning and deliv ers the highest ﬁnal rew ard and best overall performance. staged procedure rather than an exhausti ve hyperparameter search. Concretely , we ﬁxed w fmt =0 . 10 in all experiments and allocated the remaining 0 . 90 mass to the task-facing signals. This yields the ablation settings in T able 4 , e.g., embedding-only r = 0 . 1 R format + 0 . 9 R embed and LLM-only r = 0 . 1 R format + 0 . 9 R llm . For the combined semantic re- ward, we ev aluated a small set of intuitiv e splits between R llm and R embed while keeping w fmt ﬁxed; performance was similar across these choices on our validation benchmark, and we selected the conﬁguration that slightly fa vored the LLM judge. Finally , when adding modality grounding, we reserved 5% of the non-format budget for R modality , and renormalized the remaining non-format weights. (See Ap- pendix § A.3 for full coef ﬁcient-selection details, including all R llm / R embed splits.) T ogether , these measures mitigate re ward hacking and sta- bilize training, leading to smoother re ward trajectories and higher ﬁnal performance (see Fig. 5 ). 4.4. Human Expert Evaluation T o assess the clinical quality of model outputs, we con- ducted a human expert ev aluation using a blind revie w setup (See Evaluation Protocol in § A.5 ). For a randomly selected subset of questions from our Evaluation bench- mark, responses were generated by four models: MediX-R1, Llama3.2-V ision, MedGemma and HuatuoGPT -V ision. The outputs were anonymized and labeled as Model A, Model B, Model C and Model D with no identiﬁers provided to the revie wers. Medical experts were asked to e valuate the responses against the provided ground truth descriptions for each question. The ev aluation focused on determining which model produced the most accurate, clinically rele vant response along with interpretable reasoning traces. 8 MediX-R1: Open Ended Medical Reinforcement Lear ning The results demonstrate a strong preference for MediX-R1, which was selected as the best response in 72.7% of the cases. In comparison, Llama3.2-V ision was preferred in 13.6% of the cases, MedGemma in 9.2% and HuatuoGPT - V ision in 4.5% of the cases. Additional details on human expert e valuation is av ailable in Sec. § A.5 and Sec. § A.6 . 4.5. Evaluation on Real W orld Clinical Data T o further assess the generalization ability of our model, we conducted additional ev aluation on MedPix 2.0 ( Sira- gusa et al. , 2025 ), a publicly available real-world clinical VQA dataset deriv ed from the original MedPix ( Henigman & K ennedy , 2025 ) database maintained by the U.S. Na- tional Library of Medicine (NIH). MedPix comprises ov er 12,000 anonymized, cro wdsourced clinical cases containing medical images and corresponding textual information such as ﬁndings, diagnoses, and treatments. This ensures both reproducibility and compliance with NIH priv acy standards. The ev aluation on MedPix 2.0 demonstrates that our model, MediX-R1, consistently outperforms other medical vision- language models. Speciﬁcally , MediX-R1 achiev es a score of 51.11%, surpassing strong baselines and previous SO T A Medical Models as sho wn in T able 7 . These results further conﬁrm the robustness and adaptability of MediX-R1 on div erse real-world clinical data, emphasizing its capability to generalize beyond controlled experimental en vironments. Model Score (%) MedVLM-R1 27.57 MedGemma 43.18 LLaV A-Med 44.29 BiMediX2 46.51 HuatuoGPT 48.81 MediX-R1 (Ours) 51.11 T able 7. Performance comparison on the MedPix 2.0 dataset. 4.6. Qualitative Examples Fig. 4 illustrates how MediX-R1’ s structured outputs and composite re ward translate into clinically grounded beha v- ior across modalities. Micr oscopy (top) Giv en a multi-panel histological image and the question “Which area is sho wn in section G of the image?, ” the model (i) correctly emits the modality tag ( MICROSCOPY ), (ii) provides interpretable reasoning inside that references recognizable neu- roanatomical markers (e.g., optic tract “ot, ” superior thala- mic nucleus “stm”), stain patterns, and panel context, and (iii) produces a concise ﬁnal answer: “the optic tract. ” The modality recognition and format rewards ensure the answer is localized to the requested panel and presented cleanly in the block, while the LLM and embedding rew ards bias the policy toward semantically correct iden- tiﬁcation despite diverse phrasing in the reasoning. X-ray (bottom) F or “Is the heart size in this image smaller or larger than if the image w as taken AP?, ” the model tags the modal- ity as X RAY and reasons about projection geometry: P A views reduce cardiac magniﬁcation relative to AP due to a shorter heart-to-detector distance and standard source-to- image distance. The model explains this in and answers “smaller” in . This example shows the model using domain kno wledge rather than superﬁcial pat- tern matching, with the ﬁnal answer isolated for scoring (the judge ignores during e valuation). 5. Conclusion W e presented MediX-R1, an open-ended reinforcement learning framew ork for medical multimodal reasoning that trains a baseline VLM with Group based RL using a compos- ite rew ard. By coupling an LLM judge accuracy signal with medical embedding-based semantic alignment, lightweight format control, and modality recognition, MediX-R1 learns to produce concise, clinically faithful answers with inter- pretable reasoning traces. A uniﬁed vLLM-based ev alua- tion pipeline enables consistent, paraphrase-rob ust scoring across both text-only and image+text tasks. Empirically , MediX-R1 achieves strong results across di verse medical benchmarks and shows impro ved stability and resistance to rew ard hacking compared to single-signal RL variants. Hu- man expert preference studies further corroborate its clinical answer quality , while qualitativ e examples illustrate faith- ful grounding and interpretable reasoning traces. Reward ablations validate that the multi-signal design enhances sta- bility and semantic alignment beyond single-signal conﬁgu- rations. Altogether , the framework demonstrates that care- fully composed, structure-aware rew ards plus standardized LLM-judge e valuation pro vide a practical path to scalable and interpretable medical multimodal RL ﬁne-tuning. Impact Statement This work adv ances methods for open-ended RL of medical MLLMs, with the goal of impro ving semantic correctness under paraphrase, format compliance, and modality ground- ing in medical question answering and report-style tasks. If used appropriately (e.g., as a research and education aid), the proposed composite-re ward RL and uniﬁed Reference- based LLM-as-judge e valuation frame work may reduce re- liance on brittle string-match metrics, enable more realistic benchmarking of free-form clinical responses, and improve transparency through structured outputs (modality tags and separated reasoning/answer blocks). At the same time, the societal and ethical risks are non- trivial. MediX-R1 is a r esear ch pr ototype and is not in- tended for clinical or commercial deployment. Like other 9 MediX-R1: Open Ended Medical Reinforcement Lear ning generati ve models, it may hallucinate ﬁndings, omit key dif- ferentials, or overstate certainty; the Reference-based LLM- as-judge re ward and e valuation could also reinforce subtle biases or false positi ves. Misuse risks include self-diagnosis, generation of misleading medical narrati ves, and adversarial prompting with malicious intent. While we used only pub- licly av ailable, de-identiﬁed datasets under their respective licenses and did not conduct a prospecti ve human-subjects study , residual priv acy risks can remain; do wnstream users should perform auditing prior to redistribution or deploy- ment. W e also highlight risks of demographic and dataset bias and potential ampliﬁcation of health disparities; future work should include fairness analyses where legally and ethically permissible, uncertainty calibration, bias-aware rew ard shaping, and clinician-in-the-loop ev aluation. T o support responsible use and scrutiny , we commit to releasing training/inference code, conﬁgurations, check- points, curated datasets, and RL/ev aluation prompt tem- plates, alongside a model card and clear usage restrictions, under a CC-BY -NC-SA 4.0 license. W e also disclose our use of generati ve AI tools: assisted coding was limited to boilerplate scaffolding and minor refactors with all algorith- mic logic authored and revie wed manually; writing-support models were used for grammar and style, while all techni- cal claims and results were veriﬁed by the authors. These steps aim to ensure transparenc y , auditability , and reliable reproduction of the published results. Acknowledgments This work was partially supported with NVIDIA Academic Grant 2025 and MBZUAI-IITD Research Collaboration Seed Grant. References Balachandran, A. Medembed: Medical-focused embed- ding models, 2024. URL https://github.com/ abhinand5/MedEmbed . Chen, J., Cai, Z., Ji, K., W ang, X., Liu, W ., W ang, R., Hou, J., and W ang, B. Huatuogpt-o1, towards medical complex reasoning with llms, 2024a. URL https:// arxiv.org/abs/2412.18925 . Chen, J., Gui, C., Ouyang, R., Gao, A., Chen, S., Chen, G. H., W ang, X., Zhang, R., Cai, Z., Ji, K., Y u, G., W an, X., and W ang, B. Huatuogpt-vision, to wards in- jecting medical visual knowledge into multimodal llms at scale, 2024b. URL 2406.19280 . Deria, A., K umar , K., Dukre, A. M., Se gal, E., Khan, S., and Razzak, I. Medmo: Grounding and understanding mul- timodal large language model for medical images, 2026. URL . Guo, D., Y ang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., W ang, P ., Bi, X., et al. Deepseek-r1: In- centi vizing reasoning capability in llms via reinforcement learning. arXiv pr eprint arXiv:2501.12948 , 2025. Han, T ., Adams, L. C., Papaioannou, J.-M., Grundmann, P ., Oberhauser , T ., L ¨ oser , A., Truhn, D., and Bressem, K. K. Medalpaca–an open-source collection of medical con versational ai models and training data. arXiv pr eprint arXiv:2304.08247 , 2023. He, X., Zhang, Y ., Mou, L., Xing, E., and Xie, P . Pathvqa: 30000+ questions for medical visual question answer- ing, 2020. URL 10286 . Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- siv e multitask language understanding. arXiv preprint arXiv:2009.03300 , 2020. Henigman, A. and K ennedy , B. Medpix®: database of med- ical images, teaching cases, and clinical topics. Medical Refer ence Services Quarterly , 44(3):328–333, 2025. Jin, D., Pan, E., Ouf attole, N., W eng, W .-H., Fang, H., and Szolovits, P . What disease does this patient hav e? a large-scale open domain question answering dataset from medical exams. Applied Sciences , 11(14):6421, 2021. Jin, Q., Dhingra, B., Liu, Z., Cohen, W . W ., and Lu, X. Pubmedqa: A dataset for biomedical research question answering. arXiv pr eprint arXiv:1909.06146 , 2019. Johnson, A. E., Pollard, T . J., Shen, L., Lehman, L.-w . H., Feng, M., Ghassemi, M., Moody , B., Szolovits, P ., An- thony Celi, L., and Mark, R. G. Mimic-iii, a freely ac- cessible critical care database. Scientiﬁc data , 3(1):1–9, 2016. Johnson, A. E., Pollard, T . J., Berkowitz, S. J., Greenbaum, N. R., Lungren, M. P ., Deng, C.-y ., Mark, R. G., and Horng, S. Mimic-cxr , a de-identiﬁed publicly av ailable database of chest radiographs with free-text reports. Sci- entiﬁc data , 6(1):317, 2019. Kwon, W ., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Y u, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efﬁcient memory management for lar ge language model serving with pagedattention. In Pr oceedings of the A CM SIGOPS 29th Symposium on Operating Systems Principles , 2023. Lau, J. J., Gayen, S., Ben Abacha, A., and Demner- Fushman, D. A dataset of clinically generated visual questions and answers about radiology images. Scientiﬁc data , 5(1):1–10, 2018. 10 MediX-R1: Open Ended Medical Reinforcement Lear ning Liu, B., Zhan, L.-M., Xu, L., Ma, L., Y ang, Y ., and W u, X.- M. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th international symposium on biomedical imag- ing (ISBI) , pp. 1650–1654. IEEE, 2021. Maraﬁoti, A., Zohar , O., Farr ´ e, M., Noyan, M., Bakouch, E., Cuenca, P ., Zakka, C., Allal, L. B., Lozhkov , A., T azi, N., Sriv astav , V ., Lochner , J., Larcher , H., Morlon, M., T unstall, L., von W erra, L., and W olf, T . Smolvlm: Redeﬁning small and efﬁcient multimodal models, 2025. URL . Mullappilly , S. S., Kurpath, M. I., Pieri, S., Alseiari, S. Y ., Cholakkal, S., Aldahmani, K., Khan, F ., Anwer , R., Khan, S., Baldwin, T ., and Findings), H. C. E. . Bimedix2: Bio- medical expert lmm for div erse medical modalities, 2024. URL . Pal, A., Umapathi, L. K., and Sankarasubbu, M. Medmcqa: A large-scale multi-subject multi-choice dataset for medi- cal domain question answering. In Confer ence on Health, Infer ence, and Learning , pp. 248–260. PMLR, 2022. Pan, J., Liu, C., W u, J., Liu, F ., Zhu, J., Li, H. B., Chen, C., Ouyang, C., and Rueckert, D. Medvlm-r1: Incentivizing medical reasoning capability of vision-language mod- els (vlms) via reinforcement learning. In International Confer ence on Medical Image Computing and Computer- Assisted Intervention , pp. 337–347. Springer , 2025. Pieri, S., Mullappilly , S. S., Khan, F . S., Anwer , R. M., Khan, S., Baldwin, T ., and Cholakkal, H. BiMediX: Bilingual medical mixture of e xperts LLM. In Al- Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.), F ind- ings of the Association for Computational Linguis- tics: EMNLP 2024 , pp. 16984–17002, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.ﬁndings- emnlp. 989. URL https://aclanthology.org/2024. findings- emnlp.989/ . Sellergren, A., Kazemzadeh, S., Jaroensri, T ., Kiraly , A., T rav erse, M., Kohlber ger, T ., Xu, S., Jamil, F ., Hughes, C., Lau, C., et al. Medgemma technical report. arXiv pr eprint arXiv:2507.05201 , 2025. Shao, Z., W ang, P ., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., W u, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models. arXiv pr eprint arXiv:2402.03300 , 2024. Siragusa, I., Contino, S., Ciura, M. L., Alicata, R., and Pirrone, R. Medpix 2.0: A comprehensiv e multimodal biomedical data set for advanced ai applications with retriev al augmented generation and knowledge graphs. Data Science and Engineering , pp. 1–17, 2025. T eam, L., Xu, W ., Chan, H. P ., Li, L., Aljunied, M., Y uan, R., W ang, J., Xiao, C., Chen, G., Liu, C., Li, Z., Sun, Y ., Shen, J., W ang, C., T an, J., Zhao, D., Xu, T ., Zhang, H., and Rong, Y . Lingshu: A generalist foundation model for uniﬁed multimodal medical understanding and reason- ing, 2025. URL 07044 . T eam, Q. Qwen3 technical report, 2025. URL https: //arxiv.org/abs/2505.09388 . Y ang, A., Li, A., Y ang, B., Zhang, B., Hui, B., Zheng, B., Y u, B., Gao, C., Huang, C., Lv , C., Zheng, C., Liu, D., Zhou, F ., Huang, F ., Hu, F ., Ge, H., W ei, H., Lin, H., T ang, J., Y ang, J., T u, J., Zhang, J., Y ang, J., Y ang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Y ang, K., Y u, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P ., W ang, P ., Zhu, Q., Men, R., Gao, R., Liu, S., Luo, S., Li, T ., T ang, T ., Y in, W ., Ren, X., W ang, X., Zhang, X., Ren, X., Fan, Y ., Su, Y ., Zhang, Y ., Zhang, Y ., W an, Y ., Liu, Y ., W ang, Z., Cui, Z., Zhang, Z., Zhou, Z., and Qiu, Z. Qwen3 technical report, 2025. URL https: //arxiv.org/abs/2505.09388 . Y u, Q., Zhang, Z., Zhu, R., Y uan, Y ., Zuo, X., Y ue, Y ., Dai, W ., Fan, T ., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., T ong, Y ., Zhang, C., Zhang, M., Zhang, W ., Zhu, H., Zhu, J., Chen, J., Chen, J., W ang, C., Y u, H., Song, Y ., W ei, X., Zhou, H., Liu, J., Ma, W .- Y ., Zhang, Y .-Q., Y an, L., Qiao, M., Wu, Y ., and W ang, M. Dapo: An open-source llm reinforcement learning system at scale, 2025. URL abs/2503.14476 . Y ue, X., Ni, Y ., Zhang, K., Zheng, T ., Liu, R., Zhang, G., Stev ens, S., Jiang, D., Ren, W ., Sun, Y ., W ei, C., Y u, B., Y uan, R., Sun, R., Y in, M., Zheng, B., Y ang, Z., Liu, Y ., Huang, W ., Sun, H., Su, Y ., and Chen, W . Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2024. URL https://arxiv.org/abs/2311.16502 . Zhang, X., W u, C., Zhao, Z., Lin, W ., Zhang, Y ., W ang, Y ., and Xie, W . Pmc-vqa: V isual instruction tuning for medical visual question answering, 2024. URL https: //arxiv.org/abs/2305.10415 . Zheng, C., Liu, S., Li, M., Chen, X.-H., Y u, B., Gao, C., Dang, K., Liu, Y ., Men, R., Y ang, A., Zhou, J., and Lin, J. Group sequence policy optimization, 2025a. URL https://arxiv.org/abs/2507.18071 . Zheng, Y ., Lu, J., W ang, S., Feng, Z., Kuang, D., and Xiong, Y . Easyr1: An efﬁcient, scalable, multi-modality rl train- ing framew ork. https://github.com/hiyouga/ EasyR1 , 2025b. 11 MediX-R1: Open Ended Medical Reinforcement Lear ning A. A ppendix A.1. T raining Data and Modality Distribution W e trained MediX-R1 on 51335 multimodal medical instruction samples spanning 16 modality tags. All samples were drawn from the of ﬁcial train splits of the source datasets: PMC-VQA subset ( Zhang et al. , 2024 ), SLAKE ( Liu et al. , 2021 ), RadVQA ( Lau et al. , 2018 ), and PathVQA ( He et al. , 2020 ). Medical Modality Samples X RA Y 5964 MICR OSCOPY 16399 CLINICAL PHO TOGRAPHY 8979 CT SCAN 7646 GRAPHICS 2205 ANGIOGRAPHY 522 PET SCAN 406 UL TRASOUND 1227 MRI SCAN 6224 FUNDUS PHO TOGRAPHY 314 OCT SCAN 236 ENDOSCOPY 611 MAMMOGRAPHY 106 FLUOR OSCOPY 321 O THER 64 SPECT 111 T otal 51335 Dataset Samples PMC VQA SUBSET 25000 SLAKE 4919 RAD VQA 1793 P A TH 19623 T otal 51335 T able 8. Modality Breakdown and Sour ce Dataset composition A.2. T raining Conﬁguration W e list below the GRPO training conﬁguration used for MediX-R1. Core settings include (i) data ﬁltering and batching, (ii) actor optimization and rollout sampling, (iii) KL-re gularized GRPO advantage computation, and (i v) trainer settings. W e train our models using the EasyR1( Zheng et al. , 2025b ) Github Repository . MediX-R1 was trained using 8×A100 (80 GB) Nvidia GPUs for approximately 25 hours. T raining Conﬁguration Training Configurations "data": { "max_prompt_length": 4352, "max_response_length": 4096, "rollout_batch_size": 512, "val_batch_size": 1024, "shuffle": true, "seed": 1, "min_pixels": 262144, "max_pixels": 4194304, "filter_overlong_prompts": true, "filter_overlong_prompts_workers": 16 }, "worker": { "hybrid_engine": true, "actor": { "strategy": "fsdp", 12 MediX-R1: Open Ended Medical Reinforcement Lear ning "global_batch_size": 128, "micro_batch_size_per_device_for_update": 1, "micro_batch_size_per_device_for_experience": 2, "max_grad_norm": 1.0, "clip_ratio_low": 0.2, "clip_ratio_high": 0.3, "clip_ratio_dual": 3.0, "loss_avg_mode": "token", "padding_free": true, "dynamic_batching": true, "use_torch_compile": true, "optim": { "lr": 1e-6, "betas": [0.9, 0.999], "weight_decay": 0.01, "strategy": "adamw", "lr_scheduler_type": "constant", "training_steps": 200 }, "fsdp": { "enable_full_shard": true, "enable_rank0_init": true, "mp_param_dtype": "bf16", "mp_reduce_dtype": "fp32", "mp_buffer_dtype": "fp32" }, "offload": { "offload_params": true, "offload_optimizer": true }, "use_kl_loss": true, "kl_penalty": "low_var_kl", "kl_coef": 0.01 }, "rollout": { "name": "vllm", "n": 5, "temperature": 1.0, "top_p": 1.0, "seed": 1, "tensor_parallel_size": 2, "max_num_batched_tokens": 8448, "gpu_memory_utilization": 0.6, "val_override_config": { "temperature": 0.6, "top_p": 0.95, "n": 1 }, "prompt_length": 4352, "response_length": 4096 } }, "algorithm": { "adv_estimator": "grpo", "gamma": 1.0, "lam": 1.0, "use_kl_loss": true, "kl_penalty": "low_var_kl", "kl_coef": 0.01, "kl_type": "fixed", "kl_target": 0.1, "kl_horizon": 10000.0 13 MediX-R1: Open Ended Medical Reinforcement Lear ning }, "trainer": { "total_epochs": 2, "nnodes": 1, "n_gpus_per_node": 8, "val_freq": 5, "val_before_train": true, "save_freq": 5, "save_limit": 3 } A.3. Reward Coefﬁcient Selection Details This section details how we selected the composite re ward coef ﬁcients used throughout the paper . Our goal was to (i) keep outputs reliably parseable (format control), while (ii) allocating most of the re ward budget to task-facing correctness signals, and (iii) av oiding an expensiv e hyperparameter search. Fixed f ormat budget. Across all experiments (including T able 4 ), we ﬁxed the format re ward weight to w fmt = 0 . 10 to enforce consistently parseable outputs. The remaining 0 . 90 of the rew ard mass was allocated among the task-facing signals. Single-signal ablations. The embedding-only and LLM-only settings in T able 4 correspond to: r emb-only = 0 . 1 R format + 0 . 9 R embed , r llm-only = 0 . 1 R format + 0 . 9 R llm . Combining semantic r ewards ( R llm + R embed ). Next, we combined the LLM judge and embedding signals while keeping w fmt = 0 . 10 ﬁxed, and ev aluated three intuitive splits of the remaining 0 . 90 mass: v1 (equal split): r = 0 . 1 R format + (0 . 5 ∗ 0 . 9) R llm + (0 . 5 ∗ 0 . 9) R embed , v2 (LLM-fav ored): r = 0 . 1 R format + (0 . 6 ∗ 0 . 9) R llm + (0 . 4 ∗ 0 . 9) R embed , v3 (embed-fav ored): r = 0 . 1 R format + (0 . 4 ∗ 0 . 9) R llm + (0 . 6 ∗ 0 . 9) R embed . On our v alidation benchmark suite, these three variants yielded similar o verall a verages (v1: 0.582, v2: 0.589, v3: 0.579), indicating low sensiti vity within this coarse range. W e therefore selected v2 as the default combined-semantic conﬁguration because it slightly improv ed aggregate performance while placing more weight on the stricter correctness signal ( R llm ). Adding modality grounding . Finally , we introduced modality grounding by reserving 5% of the non-format budget for R modality (i.e., 0 . 05 × 0 . 90 = 0 . 045 of total re ward mass), and renormalizing the remaining non-format weights according to v2: w fmt = 0 . 10 , w mod = 0 . 045 , w llm = 0 . 5175 , w emb = 0 . 3375 , which sums to 1 . 0 and matches the MediX-R1 composite setting used in T able 4 . Compute limitations. W e emphasize that we did not run an e xhaustiv e hyperparameter search o ver re ward coef ﬁcients due to computational constraints. The above staged procedure was intended to mak e coefﬁcient selection transparent and reproducible; future work may further improv e performance by exploring a broader coefﬁcient grid. A.4. Reward Function Source Code Below are the Python implementations of the four rew ard components used in MediX-R1. Each function operates on a predicted model output string and a ground truth string containing the modality tag and reference answer . 14 MediX-R1: Open Ended Medical Reinforcement Lear ning Format r eward def format_reward(predict: str ) -> float : idx = predict.find("") if idx == -1: return 0.0 predict_new = predict[idx:].strip() pattern = re. compile (r". * ?\s * . * ?", re.DOTALL) format_match = re.fullmatch(pattern, predict_new) return 1.0 if format_match else 0.0 LLM-based accuracy reward def accuracy_reward_llm(predict: str , ground_truth: str ) -> float : try : content_match = re.search(r"(. * ?)", predict, re.DOTALL) given_answer = content_match.group(1).strip() if content_match else predict. strip() given_answer = given_answer.strip(’.’) ground_truth = ground_truth.split(’>’, maxsplit=1)[1].strip() ground_truth = ground_truth.strip(’.’) if given_answer == ’’ or len (given_answer) == 1: return 0.0 if given_answer == ground_truth: return 1.0 llm_score = llm_answer_match(given_answer, ground_truth) # external helper return llm_score except Exception: return 0.0 Embedding-based semantic reward def accuracy_reward_embed(predict: str , ground_truth: str , threshold: float = 0.8) -> float : try : content_match = re.search(r"(. * ?)", predict, re.DOTALL) given_answer = content_match.group(1).strip() if content_match else predict. strip() given_answer = given_answer.strip(’.’) ground_truth = ground_truth.split(’>’, maxsplit=1)[1].strip() ground_truth = ground_truth.strip(’.’) if given_answer == ’’ or len (given_answer) == 1: return 0.0 if given_answer == ground_truth: return 1.0 embeddings = embed_model.encode([given_answer, ground_truth], convert_to_tensor=True) similarity = util.pytorch_cos_sim(embeddings[0], embeddings[1]).item() return float (similarity >= threshold) except Exception: return 0.0 15 MediX-R1: Open Ended Medical Reinforcement Lear ning Modality recognition r eward def modality_reward(predict: str , ground_truth: str ) -> float : idx = predict.find("") if idx == -1: return 0.0 predict_new = predict[:idx].strip() # modality tag before modality = ground_truth.split(’>’, maxsplit=1)[0] + ’>’ return 1.0 if predict_new.upper() == modality.upper() else 0.0 A.5. Human Expert Comparative Ev aluation Protocol For a sampled set of multimodal questions, four anon ymized model outputs (A-D) plus a reference description are shown; experts pick the single best response based on clinical correctness, rele vance (no hallucinations), and clarity of reasoning. V otes are aggregated into preference percentages reported in the main text. Evaluation Pr otocol for Medical Experts Instructions for Evaluation Your task is to evaluate the responses provided by three AI models based on a given medical image description (Ground Truth). Follow these steps to make your selection: 1) Read the Ground Truth: Carefully review the provided description of the medical image. This serves as the reference for an accurate and detailed response. 2) Assess the Model Responses: Examine the three model-generated responses (Model A, Model B, and Model C, Model D). Compare their content with the Ground Truth, focusing on the accuracy, completeness, and relevance of the clinical reasoning 3) Select the Best Response: Choose the model response that best aligns with the Ground Truth in terms of: > Clinical Accuracy: Does the response correctly describe the key findings in the image? > Reasoning Traces: Does the models reasoning traces correct and well explained 4) Submit Your Choice: After evaluating the responses, select the one that provides the most accurate and comprehensive explanation. A.6. Human Evaluation: Model Reasoning W e extend our human expert study detailed in (Sec. 4.4 ) to e valuate the reasoning quality of our MediX-R1 model ag ainst MedGemma with the help of medical doctors. Experts assessed outputs for clinical accuracy , reasoning soundness, and practical usefulness in a medical setting. MediX-R1’ s reasoning was preferred in 74.2% of cases o ver MedGemma, indicating stronger clinical coherence. Furthermore, the study shows that in 92.4% of the cases, the model’ s reasoning steps were rated as acceptable and often comparable to a medical doctor’ s thought process, while only 7.6% of the cases were rated as ha ving poor reasoning quality . Moreover , in fewer than 5% of the cases, the model produced ﬂa wed reasoning despite generating the correct ﬁnal answer , indicating that such inconsistencies are rare and that MediX-R1 generally maintains a rob ust and coherent reasoning process. Revie wers comprised ﬁve certiﬁed medical e xperts (MBBS/MD) with specialties in Radiology , General Medicine, and Forensic Medicine, with an inter -rater agreement of 63%. 16 MediX-R1: Open Ended Medical Reinforcement Lear ning A.7. Reinfor cement Learning T raining Prompt The RL training prompt enforces (i) an explicit modality tag, (ii) structured reasoning in ... , and (iii) a concise ﬁnal answer in ... . These structures align with the format re ward ( R format ) and modality rew ard ( R modality ) in our composite objectiv e. During training, only the block is graded by the Reference-based LLM-as-judge ( R llm ) and the embedding-based semantic rew ard ( R embed ); the content is ignored for scoring but improv es interpretability . Ke y points: - Modality tag must be one of the ﬁxed set and appear before . - The ﬁnal decision is ev aluated solely from for R llm and R embed . - Structural compliance (tags present and ordered) is required for R format . Reinfor cement Learning T raining Prompt You are a Medical AI Assistant with advanced reasoning capabilities Your task: 1. First output the image modality tag from this set: , , , , , , , , , , , , , , , (Only output the tag, nothing else.) 2. Then output the thinking and medical reasoning process in ... tags. 3. Finally, provide the correct answer inside ... tags. 4. Do not include any extra information or text outside of these tags. Question: {{ content | trim }} A.8. Evaluation B ASE T emplate (Short-Form QA/MCQ) This judge prompt yields a binary score (0/1) for short-form QA and MCQ-style tasks. It compares the predicted against the reference, allowing paraphrases and option-label matches. Inference is performed with a separate LLM judge (served via vLLM) to reduce ev aluation-training coupling. W e use deterministic settings (e.g., temperature 0) for reproducibility and parse the returned JSON strictly . Evaluation B ASE template Prompt You are a medical expert. Your task is to evaluate whether the Predicted Answer correctly answers the Medical Question, based on the Ground Truth (Correct Answer) provided. Question: {question} Correct Answer: {correct_answer} Predicted Answer: {predicted_answer} Score 1 if the predicted answer matches the correct answer either fully in text or by indicating the correct option label (e.g., "B", "Option B", or a paraphrased version that clearly identifies the correct choice). Score 0 if the predicted answer is incorrect or points to the wrong option. Respond strictly in the following JSON format: ‘‘‘json 17 MediX-R1: Open Ended Medical Reinforcement Lear ning {{ "score": }} ‘‘‘ A.9. Evaluation T emplate for Report Generation For long-form outputs (e.g., report generation or summarization), the judge assigns a rubric score in [0, 5] reﬂecting clinical accuracy , completeness, and relev ance. W e request strict JSON for reliable parsing and average scores across items for dataset-level metrics. Only the model’ s ﬁnal report text is provided to the judge; any hidden reasoning (e.g., within ) is stripped before ev aluation. Evaluation Pr ompt for Report Generation You are a medical expert evaluating the clinical accuracy, completeness, and relevance of a generated medical report or summary. Your task is to compare an AI-generated report or summary to a reference (gold standard) report or summary, based on a clinical instruction or question. Assess the generated output on how well it preserves key clinical information, factual correctness, and clinical reasoning relevant to the task. Assign a score between 0 and 5 using the following scale: 0 - Completely incorrect: Clinically irrelevant, misleading, or factually wrong. No meaningful alignment with the instruction or reference. 1 - Poor match: Barely relevant or mostly incorrect. Contains significant clinical misinformation or omits nearly all critical details. 2 - Weak match: Some fragments of relevant content are present, but major clinical errors or omissions exist. Clinical utility is low. 3 - Fair match: Contains several relevant points, but includes notable errors, missing findings, or misinterpretations that affect clinical reliability. 4 - Good match: Mostly accurate and clinically sound. Minor issues or missing details, but the overall meaning and purpose are preserved. 5 - Perfect or near-perfect match: Clinically accurate, complete, and faithful to the instruction and reference. No significant omissions or errors. Respond only in the following example JSON format: Example JSON format: ‘‘‘json {{ "score": }} ‘‘‘ Now, evaluate the following: ### Clinical Instruction or Question:: {question} ### Reference Report or Summary: {correct_answer} 18 MediX-R1: Open Ended Medical Reinforcement Lear ning ### AI-Generated Report or Summary: {predicted_answer} 19 MediX-R1: Open Ended Medical Reinforcement Lear ning T he i mage is a r adi ogr aph, s p e ci fical l y a che s t X - r ay , w h i ch is us e d to as s e s s the l ungs, he ar t, an d me di a s t i num . T he v i e w a p p e ar s to be p os te r oa nte r i or (P A ) , as i ndi cated by the r e l ati v e he ar t s i ze and or i e nta ti on of the s tr uctur e s . T he d i aphr agm is in a no r mal p os i t i on , s ugg e s ti ng the p ati e nt w as up r i g ht dur i ng the s can . On i ns p e cti on , ther e is i ncr e as e d opac i f i cat i on no te d in the r i g ht l ung field, w h i ch m ay r e p r e s e nt an abno r mal f i ndi ng . T h i s coul d be due to s e v e r al p ot e nti al caus e s s u ch as i nfe cti on (p ne umo ni a), p ul mon ar y e de ma, ma s s l e s i on , or ot he r i nf i l tr ati v e p r oce s s e s . T he l e ft l ung f i e l d app e ar s r e l ati v e l y clear in comp ar i s on . T he he ar t s i ze and me di a s t i num ar e w i thi n no r mal l i m i ts , s ho w i ng no e v i de nce of e nl ar g e me nt or s hi f t . T he di aphr agm is i nta ct w i t h clear cos to p h r e ni c angles, i nd i cat i ng no p l e ur al e ff us i on . T he bon y s tr uctu r e s , i nclud i ng the r i bs , cla v i cles, and s cap ul ae, ar e v i s i bl e and i nta ct w i tho ut e v i de nc e of fr actur e or de fo r mi ty . O v e r al l , the p r i mar y obs e r v ati on is the i ncr e as e d opaci t y in the r i g ht l ung, w hi ch w ar r ant s fur ther cli ni cal c or r e l ati on and r adi ol ogi cal e v al uat i on to de te r mi ne the unde r l y i ng caus e . A de t ai l e d r e v i e w by a q u al i f i e d r adi ol ogi s t is r e comme nde d for confi r mati on and di agnos i s . T he i mage p r ovi de d is a che s t X - r ay . H e r e is a de tai l e d r e p or t bas e d on the v i s i bl e fin di ngs : G e ne r al O bs e r v ati on s : - T he X - r ay is a p os te r oa nter i or (P A ) v i e w , as i ndi cate d by the or i e nta ti on of the he ar t and the p os i t i on of the di aphr agm . - T he p at i e nt' s l e ft s i de is on the r i g ht s i de of the i mag e , and the r i g ht s i de is on th e l e ft s i de of the i mage , w h i ch is ty p i cal for a PA v i e w . Lung Fi e l ds : - T he l ung fields app e ar to h a v e i ncr e as e d opaci f i cat i on , p ar ti cula r l y in the r i g ht l ung fi e l d . T hi s i ncr e as e d opaci f i cat i on coul d be due to s e v e r al r e as on s , i ncludi ng but no t l i mi te d to : - Inf i l tr ati on (e . g . , p ne umo ni a, p ul monar y e de ma, or cons ol i dati on ) . - M as s l e s i on . - P l e ur al e ff us i on . - O ther p ath ol ogi e s s uch as fibr os i s or tumo r . - T he l e ft l ung field app e ar s r e l ati v e l y clear comp ar e d to the r i g ht l ung field . H e ar t : - T he he ar t s i ze app e ar s to be w i thi n no r mal l i mi ts on thi s v i e w . - T he me di as ti num app e ar s to be w i thi n no r mal l i mi ts . Di aphr agm : - T he di aphr agm app e ar s to be i nta ct and w i tho ut any obvi ou s abno r mal i ti e s . - T he cos to p hr e ni c angles app e ar clear , w hi ch is a no r mal fin di ng . Bo ne s : - T he bon y s tr uctu r e s of th e tho r ax , i ncludi ng the r i bs , cla v i cles, and s cap ul ae , ap p e ar i nta ct w i tho ut any obvi ou s fr actur e s or l y ti c l e s i on s . P l e ur al Space s : - T he r e app e ar s to be no e v i de nce of p l e ur al e ff us i on on thi s v i e w . Summa r y : T he X - r ay s ho w s i ncr e as e d opaci f i cat i on in the r i g ht l ung f i e l d, w h i ch coul d be i ndi c at i v e of a v ar i e ty of condi t i on s . Fur t he r cli ni cal cor r e l ati on w i th the p at i e nt' s h i s to r y , s y mp to ms , and ot he r di agnos ti c te s t s (such as a CT s ca n or s p utum ana l y s i s ) w ou l d be ne ce s s ar y to de te r mi ne the e x act caus e of the opaci fic at i on . A r ad i ol ogi s t' s i nter p r e tat i on is r e comme nde d for a de fin i ti v e di agnos i s . R e comme ndat i on s : - T he p ati e nt s ho ul d be e v al uat e d by a he al thcar e p r ovi de r for fur ther i nve s ti g ati on and mana g e me nt . - A ddi ti on al i magi ng or te s ts may be r e q ui r e d to de te r mi ne the caus e of the opaci ficat i on in the r i g ht l ung field . T hi s r e p or t is for i nfor mati o nal p ur p os e s on l y and s ho ul d no t be cons i de r e d a s ubs t i tute for a p r ofe s s i on al me di cal opi ni on . G e ne r ate a de tai l e d r e p or t bas e d on the s can thi n k F igure 6. MediX-R1 - Report Generation: Case 2 20 MediX-R1: Open Ended Medical Reinforcement Lear ning T he i mage p r ovi de d app e ar s to be a r adi ogr aph (X - r ay ) of the che s t . H e r e is a de ta i l e d ana l y s i s ba s e d on the v i s i bl e fe atu r e s : G e ne r al O bs e r v ati on s : - T he i m age is a p os te r i or - ant e r i or (P A ) v i e w of the che s t , as i ndi ca te d by the or i e nta ti o n of the s p i ne and the p os i t i on of the he ar t s hado w . - T he p ati e nt is l i k e l y s tan di ng or s i tti ng up r i g ht dur i ng the s can, as the di aphr agm app e ar s to be in a no r mal p os i ti on . Key Fe atu r e s : 1 . Bo ny Str uctur e s : - T he r i bs ar e v i s i bl e on bot h s i de s , and the cla v i cles ar e al s o v i s i bl e . - T he s p i ne is w e l l - de fin e d, w i th the v e r te br ae app e ar i ng i nta ct and w i tho ut any obvi ou s fr actur e s or di s l ocat i on s . - T he s cap ul ae ar e v i s i bl e on bot h s i de s , and the s ho ul de r j oi nts app e ar to be in a no r mal p os i ti on . 2 . Lungs : - T he l ung fields app e ar to be clear w i tho ut any v i s i bl e opaci ti e s , mas s e s , or i nfi l tr ates . - T he l ung mar k i ngs ar e v i s i bl e and app e ar to be w i thi n no r mal l i mi ts . - T he r e is no e v i de nce of p l e ur al e ff us i on or p ne umo tho r ax . 3 . H e ar t and M e di as ti num : - T he he ar t s i ze app e ar s to be wi thi n no r mal l i mi ts . - T he me di as ti num is no t w i de ne d, and ther e ar e no v i s i bl e mas s e s or abno r mal i ti e s in thi s r e g i on . 4 . Di aphr agm : - T he di aphr agm app e ar s to be i nta ct and in a no r mal p os i ti on . - T he cos to p hr e ni c angles (w he r e the di aphr agm me e ts the r i bs ) app e ar s har p and clear , i ndi cati ng no p l e ur al e ff us i on . P ot e nti al Co nce r ns : - T he r e ar e no obvi ou s fr actur e s or di s l ocat i on s in the bon y s tr uctur e s . - T he l ung fields app e ar clear , and ther e ar e no v i s i bl e mas s e s or i nfi l tr ates . - T he he ar t s i ze and me di as ti num app e ar no r mal . - T he di aphr agm is i nta ct and in a no r mal p os i ti on . Co nclus i on : Ba s e d on the r adi ogr aphi c fin di ngs , thi s che s t X - r ay a p p e ar s to be w i thi n no r mal l i mi ts . T he r e ar e no i mme d i a te conc e r ns no te d in th i s i mag e . H ow e v e r , a de f i ni ti v e i nte r p r e tat i on s ho ul d be made by a q ual i f i e d r ad i ol ogi s t w ho can cons i de r the p ati e nt' s cli ni cal hi s to r y and s y mp to ms . If ther e ar e any s p e ci fic s y mp to ms or con ce r ns , it is r e com me n de d to co ns ul t a he al thcar e p r o v i de r fo r fur t he r e v al uat i on . G e ne r ate a de tai l e d r e p or t bas e d on the s can thi n k T he i mag e app e ar s to be a che s t r adi ogr aph (X - r ay ), l i k e l y take n in a p os te r i or - ant e r i o r (P A ) v i e w , as s ugg e s te d by the p os i t i on i ng of the he ar t and s p i ne . T he p a ti e nt s e e ms to be up r i g ht du r i ng the e x ami nat i on , w i th the d i aphr agm in a n or mal p os i t i on . T he bon y s tr uctur e s , i ncludi ng the r i bs , cla v i cles, s cap ul ae, and v e r te b r ae, a r e cl e ar l y v i s i bl e and s ho w no e v i de nce of fr actur e s , d i s l o cati on s , or de for mi ti e s . T he l ung fields app e ar clear , w i th no s i g ns of cons ol i dat i on , ma s s e s , or abno r mal opaci t i e s . T he l ung mar k i ngs ar e no r mal , and the r e is no v i s i bl e p l e ur al e ff us i on or p ne umo tho r ax . T he he ar t s i ze is w i th i n no r mal l i m i t s , and t he me di as t i num app e ar s c e ntr al w i tho ut w i de ni ng or abno r mal conto u r s . T he di aphr ag m and cos to p h r e ni c angles ar e s h ar p and w e l l - de fin e d, i ndi cat i ng no r mal l ung e x p ans i on and no flui d accumula ti on F igure 7. MediX-R1 - Report Generation: Case 1 21

MediX-R1: Open Ended Medical Reinforcement Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment