MediRound: Multi-Round Entity-Level Reasoning Segmentation in Medical Images
📝 Abstract
Despite the progress in medical image segmentation, most existing methods remain task-specific and lack interactivity. Although recent text-prompt-based segmentation approaches enhance user-driven and reasoning-based segmentation, they remain confined to single-round dialogues and fail to perform multi-round reasoning. In this work, we introduce Multi-Round Entity-Level Medical Reasoning Segmentation (MEMR-Seg), a new task that requires generating segmentation masks through multi-round queries with entity-level reasoning. To support this task, we construct MR-MedSeg, a large-scale dataset of 177K multi-round medical segmentation dialogues, featuring entity-based reasoning across rounds. Furthermore, we propose MediRound, an effective baseline model designed for multi-round medical reasoning segmentation. To mitigate the inherent error propagation in the chain-like pipeline of multi-round segmentation, we introduce a lightweight yet effective Judgment & Correction Mechanism during model inference. Experimental results demonstrate that our method effectively addresses the MEMR-Seg task and outperforms conventional medical referring segmentation methods.
💡 Analysis
Despite the progress in medical image segmentation, most existing methods remain task-specific and lack interactivity. Although recent text-prompt-based segmentation approaches enhance user-driven and reasoning-based segmentation, they remain confined to single-round dialogues and fail to perform multi-round reasoning. In this work, we introduce Multi-Round Entity-Level Medical Reasoning Segmentation (MEMR-Seg), a new task that requires generating segmentation masks through multi-round queries with entity-level reasoning. To support this task, we construct MR-MedSeg, a large-scale dataset of 177K multi-round medical segmentation dialogues, featuring entity-based reasoning across rounds. Furthermore, we propose MediRound, an effective baseline model designed for multi-round medical reasoning segmentation. To mitigate the inherent error propagation in the chain-like pipeline of multi-round segmentation, we introduce a lightweight yet effective Judgment & Correction Mechanism during model inference. Experimental results demonstrate that our method effectively addresses the MEMR-Seg task and outperforms conventional medical referring segmentation methods.
📄 Content
User Traditional Text-promptbased Models MediRound Figure 1. A demo dialogue of our proposed MediRound. Our model can comprehend user queries that refer to the mask results from previous rounds (e.g., the Round 2 query refers to the Round 1 mask result), enabling cross-round entity-level reasoning in multi-round medical conversations. In contrast, conventional text-prompt-based medical segmentation methods struggle in this complex task.
Despite the progress in medical image segmentation, most existing methods remain task-specific and lack interactivity. Although recent text-prompt-based segmentation approaches enhance user-driven and reasoning-based segmentation, they remain confined to single-round dialogues and fail to perform multi-round reasoning. In this work, we introduce Multi-Round Entity-Level Medical Reasoning Segmentation (MEMR-Seg), a new task that requires generating segmentation masks through multi-round queries with entity-level reasoning. To support this task, we con-struct MR-MedSeg, a large-scale dataset of 177K multiround medical segmentation dialogues, featuring entitybased reasoning across rounds. Furthermore, we propose MediRound, an effective baseline model designed for multi-round medical reasoning segmentation. To mitigate the inherent error propagation in the chain-like pipeline of multi-round segmentation, we introduce a lightweight yet effective Judgment & Correction Mechanism during model inference. Experimental results demonstrate that our method effectively tackles the MEMR-Seg task, surpassing conventional medical referring segmentation approaches. The project is available at https://github.com/ Edisonhimself/MediRound.
Medical image segmentation seeks to precisely delineate regions of interest, such as organs, lesions, and tissues, across diverse medical imaging modalities [28,31,41]. This task supports a wide range of clinical applications [4,15,43,45] and represents a fundamental enabler for continuous advancements in medical research [27]. Most existing studies focus predominantly on task-specific segmentation [9,26,34], commonly referred to as “specialist models”. Despite their remarkable performance, these models generally suffer from limited adaptability and lack the interactive capabilities necessary for practical, real-world deployment.
Recent advancements in text-prompt-based medical image segmentation [6,11,20,50] overcome this limitation by allowing users to actively guide the model via simple textual queries, thereby enabling more diverse and userdriven segmentation outcomes. Building upon these developments, several MLLM-based approaches [13,39,46] further enhance the interactivity and practicality by enabling reasoning-driven medical segmentation, allowing users to obtain the desired masks through implicit queries that do not necessarily contain specialized medical terminology. However, these models primarily focus on single-round medical segmentation and are incapable of supporting multi-round and continuous text-based interactions. Moreover, although they can effectively infer implicit medical queries within a single dialogue round, they fail to perform cross-round, entity-level reasoning in multi-round conversations. As illustrated in Figure 1, in practical medical segmentation scenarios, users tend to formulate new segmentation queries based on the mask results obtained from previous rounds. For example, in Conversation 2, the user’s query is based on the mask generated in Conversation 1, and the results of Conversation 2 in turn serve as a reference for Conversation 4. Such iterative questioning not only ensures the continuity and integrity of the entire conversation but also enables users to express their desired segmentation targets more efficiently, accurately, and conveniently. Unfortunately, due to the lack of multi-round entity-level reasoning capabilities, existing methods often fail to produce the expected responses when dealing with inputs that involve cross-round logical dependencies.
In this work, we define a new medical vision task, termed MEMR-Seg (Multi-Round Entity-Level Medical Reasoning Segmentation), which entails generating binary segmentation masks based on multi-round medical image queries, involving entity-level reasoning across dialogue rounds. Notably, the queries are not only multi-round in nature, but each round is also a derivative and extended inquiry based on the entity results from the previous round. To success-fully perform this task, the model should possess two essential capabilities: (1) supporting multi-round dialogue and cross-round entity-level reasoning; and (2) generating accurate segmentation results in response to user queries.
Given that data scarcity represents a major bottleneck in accomplishing MEMR-Seg, we first introduce a Multi-Round Entity-Level Medical Reasoning Segmentation dataset, termed MR-MedSeg, which contains a large number of medical conversations centered on entity-level reasoning segmentatio
This content is AI-processed based on ArXiv data.