Recent reasoning based medical MLLMs have made progress in generating step by step textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on fine-grained visual regions to achieve precise grounding and diagnosis. We introduce Ophiuchus, a versatile, tool-augmented framework that equips an MLLM to (i) decide when additional visual evidence is needed, (ii) determine where to probe and ground within the medical image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved, multimodal chain of thought. In contrast to prior approaches limited by the performance ceiling of specialized tools, Ophiuchus integrates the model's inherent grounding and perception capabilities with external tools, thereby fostering higher-level reasoning. The core of our method is a three-stage training strategy: cold-start training with tool-integrated reasoning data to achieve basic tool selection and adaptation for inspecting key regions; self-reflection fine-tuning to strengthen reflective reasoning and encourage revisiting tool outputs; and Agentic Tool Reinforcement Learning to directly optimize task-specific rewards and emulate expert-like diagnostic behavior. Extensive experiments show that Ophiuchus consistently outperforms both closed-source and open-source SOTA methods across diverse medical benchmarks, including VQA, detection, and reasoning-based segmentation. Our approach illuminates a path toward medical AI agents that can genuinely "think with images" through tool-integrated reasoning. Datasets, codes, and trained models will be released publicly.
Multimodal Large Language Models (MLLMs) have exhibited remarkable performance across a wide range of medical image analysis tasks such as visual question answering (VQA) (Li et al., 2023;Chen et al., 2024a), disease diagnosis (Sun et al., 2025b;Liu & Song, 2025), segmentation (Bai et al., 2024;Wang et al., 2025a), and report generation (Bassi et al., 2025;Wang et al., 2025b). Recent breakthroughs in chain-of-thought (CoT) techniques (Wei et al., 2022) have further advanced the reasoning capabilities of medical MLLMs. Numerous studies employ supervised fine-tuning (SFT) (Sun et al., 2025a) or reinforcement learning (RL) (Lai et al., 2025b;Pan et al., 2025;Su et al., 2025b) to push these models beyond the direct-prediction paradigm, enabling step-by-step reasoning to address complex clinical challenges and improve diagnostic decision support. Despite these advancements, current medical MLLMs still show critical limitations in how they interact with visual information during reasoning processes. First, they often attend to irrelevant regions while missing fine-grained evidence, such as tiny structures, intricate abnormalities, subtle lesion cues, and nuanced pathological semantics (Wang et al., 2025a). Such omissions severely
On this axial abdominal CT there is a wellcircumscribed, homogeneous low-attenuation lesion in the right kidney (mid-pole), imaging appearance most consistent with a simple renal cyst. No focal hepatic mass is evident on this slice, and lungs aren’t included (so “pulmonary nodules” isn’t applicable). C The initial review of the breast histopathology image field shows multiple nucleirich cells embedded within fibrous stroma. The diagnostic task is to determine the number of inflammatory cells within the field. Morphologically, the inflammatory cells appear as smaller, round cells with dense basophilic nuclei compared to surrounding epithelial elements. To ensure accurate identification, I will request automated segmentation of inflammatory cells <tool_call> {“name”: “BiomedParse”, “arguments”: {“description”: “inflammatory cells”,“Image”: “Image-1”}} </tool_call> tool calling execution obs The BiomedParse output highlights two discrete inflammatory cell regions. Each segmented mask corresponds to a single inflammatory cell profile. This confirms that exactly two inflammatory cells are present in this field of view. Therefore, the correct choice is option E (Two). E
Image-1
Image-2 Image-1
Figure 1: Overview of the tool-augmented “thinking with images” paradigm. Compared with (a) text-only CoT, which fails to analyze fine-grained, task-critical image regions and thus limits understanding, increases hallucinations, and produces false positives, (b) our tool-augmented, interleaved vision-language reasoning adaptively generates effective tool-invocation strategies, inspects fine-grained regions, and integrates the resulting evidence into subsequent reasoning, yielding more accurate diagnostic cognition across diverse medical tasks. degrade diagnostic performance. In essence, these weaknesses stem from a static, global-perception paradigm: the models primarily rely on image-level representations and lack the key ability to actively and adaptively probe and explore localized, fine-grained visual details. Moreover, prevailing medical MLLMs express intermediate reasoning steps exclusively in text and lack a “look-again” mechanism during thinking, leading to the loss of critical visual information. Ideally, MLLMs should autonomously perform dynamic, iterative, fine-grained interactions with task-relevant image regions throughout the reasoning process, revisiting and revising earlier answers in light of emerging visual cues to support more accurate perceptual decision-making, as exemplified in Figure 1.
The aforementioned challenges motivate a fundamental rethinking of how medical MLLMs engage more seamlessly with fine-grained visual information during reasoning. This leads us to propose Ophiuchus, a novel and versatile framework capable of interleaved vision-language reasoning through thinking with tools across diverse medical tasks, including VQA, detection, and segmentation.
Ophiuchus can decide whether to actively integrate external visual tools (e.g., SAM2 (Ravi et al., 2024), BiomedParse (Zhao et al., 2024), and zoom-in) directly into the reasoning loop to iteratively manipulate and interpret key visual content. This paradigm shift from pure text-based reasoning to a more grounded and interpretable tool-augmented visual cognition-a new frontier where reasoning is continuously intertwined with ongoing visual perception. To develop Ophiuchus, we construct a high-quality dataset of 64k samples with annotations for grounding key visual regions and explicit tool-invocation trajectories, and we propose a three-stage training protocol (Figure 2). First, we introduce cold-start SFT with tool-integrated reasonin
This content is AI-processed based on open access ArXiv data.