Leveraging Hallucinations to Reduce Manual Prompt Dependency in Promptable Segmentation
Promptable segmentation typically requires instance-specific manual prompts to guide the segmentation of each desired object. To minimize such a need, task-generic promptable segmentation has been introduced, which employs a single task-generic prompt to segment various images of different objects in the same task. Current methods use Multimodal Large Language Models (MLLMs) to reason detailed instance-specific prompts from a task-generic prompt for improving segmentation accuracy. The effectiveness of this segmentation heavily depends on the precision of these derived prompts. However, MLLMs often suffer hallucinations during reasoning, resulting in inaccurate prompting. While existing methods focus on eliminating hallucinations to improve a model, we argue that MLLM hallucinations can reveal valuable contextual insights when leveraged correctly, as they represent pre-trained large-scale knowledge beyond individual images. In this paper, we utilize hallucinations to mine task-related information from images and verify its accuracy for enhancing precision of the generated prompts. Specifically, we introduce an iterative Prompt-Mask Cycle generation framework (ProMaC) with a prompt generator and a mask generator.The prompt generator uses a multi-scale chain of thought prompting, initially exploring hallucinations for extracting extended contextual knowledge on a test image.These hallucinations are then reduced to formulate precise instance-specific prompts, directing the mask generator to produce masks that are consistent with task semantics by mask semantic alignment. The generated masks iteratively induce the prompt generator to focus more on task-relevant image areas and reduce irrelevant hallucinations, resulting jointly in better prompts and masks. Experiments on 5 benchmarks demonstrate the effectiveness of ProMaC. Code given in https://lwpyh.github.io/ProMaC/.
💡 Research Summary
The paper tackles the problem of manual prompt dependency in promptable segmentation by turning the traditionally problematic hallucinations of multimodal large language models (MLLMs) into a source of useful prior knowledge. In task‑generic promptable segmentation, a single coarse prompt (e.g., “camouflaged animal”) is used for all images of a given task, but this generic cue is often ambiguous and leads to poor segmentation if applied directly. Existing approaches (e.g., GenSAM) rely on MLLMs to convert the generic prompt into detailed instance‑specific prompts, yet the quality of those prompts is limited by hallucinations—spurious predictions driven by object co‑occurrence priors learned during pre‑training.
ProMaC (Prompt‑Mask Cycle) introduces a training‑free, test‑time adaptation loop that jointly refines instance‑specific prompts and segmentation masks. The system consists of a prompt generator and a mask generator that interact iteratively. The prompt generator first splits the input image into patches at multiple scales (whole image, horizontal/vertical cuts, smaller crops) and feeds each patch together with the task‑generic prompt into an MLLM. Using a multi‑scale chain‑of‑thought prompting strategy, the model is deliberately allowed to hallucinate, producing candidate object names (foreground and background) and bounding boxes for each patch. These hallucinations exploit the MLLM’s extensive world knowledge, surfacing plausible objects that may be hidden, camouflaged, or otherwise ambiguous in the visual data.
To prune irrelevant hallucinations, ProMaC employs Visual Contrastive Reasoning (VCR). The mask generator takes the current set of instance‑specific prompts and produces a segmentation mask. This mask is then used as a visual marker: an inpainting model removes the masked region, yielding a contrastive image that contains only background. Feeding this contrastive image back to the MLLM reveals which hallucinated candidates disappear when the object is absent, allowing the system to discard those that were purely prior‑driven. By comparing responses on the original and contrastive images, the prompt generator refines its candidate lists, converging toward accurate, task‑relevant prompts.
The refined prompts are fed again to the mask generator, which now performs mask semantic alignment: the generated mask is encouraged to be semantically consistent with the task‑generic prompt (e.g., focusing on camouflaged animals rather than arbitrary background). The mask itself becomes a new visual marker for the next VCR step, creating a closed feedback loop that progressively improves both prompts and masks.
Key contributions are: (1) Reframing MLLM hallucinations as a beneficial knowledge source rather than a flaw; (2) Designing a multi‑scale chain‑of‑thought prompting combined with VCR to iteratively optimize prompts and masks without any additional training; (3) Demonstrating the approach on five distinct segmentation tasks across twelve public datasets, outperforming 22 state‑of‑the‑art baselines by 4–7 percentage points in mean IoU. The method works with open‑source MLLMs such as LLaVA, avoiding the need for proprietary models like GPT‑4V, and requires no pixel‑level visual markers beyond the masks produced by the segmentation model itself.
Limitations include the potential need for multiple cycles when initial hallucinations are overly abundant, which can increase inference time. Future work may integrate lightweight learning‑based hallucination suppression to reduce the number of iterations. The authors release code and models at https://lwpyh.github.io/ProMaC/, providing a practical, cost‑effective solution for large‑scale, manual‑free promptable segmentation.
Comments & Academic Discussion
Loading comments...
Leave a Comment