GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation

GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study fine-grained referring image segmentation via a decoupled reason-then-segment pipeline. A vision-language model (VLM) receives an image and a natural-language query, reasons about the scene, and emits structured spatial prompts: a bounding box plus two interior keypoints for every referred instance. A frozen promptable segmenter (SAM 2) converts these prompts into high-quality masks. Within our GenSeg-R1 framework we finetune Qwen3-VL models (4B and 8B parameters) using Group Relative Policy Optimization (GRPO), requiring no supervised reasoning-chain annotations. On RefCOCOg validation our best model (GenSeg-R1-8B) achieves 0.7127 cIoU and 0.7382 mIoU, substantially outperforming the corresponding Qwen3-VL Instruct baselines (+15.3 and +21.9 points, respectively) and surpassing Seg-Zero-7B [3] by +3.3 cIoU under identical evaluation. We further introduce GenSeg-R1-G, a variant trained on GRefCOCO [9] with a SAM 2 in-the-loop reward that directly optimizes mask quality. On GRefCOCO validation GenSeg-R1-G achieves 76.69% target mIoU with 82.40% accuracy on negative (no-target) prompts, substantially outperforming Seg-R1-7B and Seg-Zero-7B, which lack no-target detection capability. On ReasonSeg test, GenSeg-R1-4B reaches 68.40% mIoU, surpassing Seg-Zero-7B by +7.0 and Seg-R1-7B by +10.7 points.


💡 Research Summary

GenSeg‑R1 introduces a decoupled “reason‑then‑segment” architecture for fine‑grained referring image segmentation. The system consists of a vision‑language model (VLM) based on Qwen3‑VL (available in 4 B and 8 B parameter versions) and a frozen promptable segmenter, SAM 2. Given an image and a natural‑language query, the VLM first reasons about the scene (optionally emitting a chain) and then outputs structured spatial prompts inside an tag: a bounding box together with two interior keypoints for each referred instance, or a “no_target” flag for negative queries. SAM 2 consumes these prompts and produces high‑quality binary masks.

Training is performed with Group Relative Policy Optimization (GRPO), a reinforcement‑learning method that samples a set of candidate responses from the current policy, scores each with a composite reward, and updates the policy toward higher‑reward samples. Two reward configurations are used. The lightweight distance‑based reward checks JSON format, box IoU, keypoint distance, and penalizes duplicate outputs, enabling fast training without any model inference. The more expensive SAM 2‑in‑the‑loop reward runs SAM 2 on each candidate, computes mask IoU against ground‑truth, validates negative keypoints, and rewards correct “no_target” predictions. This loop directly ties the VLM’s output to the final segmentation quality and endows the model with explicit handling of empty‑target queries.

Two training data regimes are explored. GenSeg‑R1‑4B/8B are fine‑tuned on VisionReasoner‑MultiObjects‑7K (≈7 k samples) that provide bounding boxes and a single point per object. GenSeg‑R1‑G is trained on GRefCOCO (≈15 k samples) which includes multi‑object references and a substantial set of negative prompts; each sample supplies two positive interior points, two background points, and polygon masks for SAM 2 reward computation. The larger, richer GRefCOCO dataset enables the model to learn precise interior prompting and reliable no‑target detection.

Evaluation is conducted on three benchmarks under a unified protocol (same SAM 2 checkpoint, same evaluation code): RefCOCOg validation (2 573 expressions), GRefCOCO validation (1 000 samples, 642 with targets, 358 no‑target), and ReasonSeg test (779 compositional queries). Results show that GenSeg‑R1‑8B achieves 0.7127 cumulative IoU (cIoU) and 0.7382 mean IoU (mIoU) on RefCOCOg, outperforming the Qwen3‑VL‑Instruct baseline by +15.3 cIoU and +21.9 mIoU, and surpassing Seg‑Zero‑7B by +3.3 cIoU and +2.3 mIoU. Bounding‑box detection metrics are also superior (AP 0.7277, AP@0.5 0.8717). An ablation on SAM 2 prompting demonstrates that adding two interior keypoints yields monotonic improvements, especially at strict IoU thresholds (P@0.9 gains of 0.0148).

On GRefCOCO, the SAM 2‑in‑the‑loop variant GenSeg‑R1‑G reaches 76.69 % target mIoU and 82.40 % no‑target accuracy, while prior models (Seg‑R1‑7B, Seg‑Zero‑7B) fail to detect negative queries (0 % accuracy) and hallucinate masks. Even the non‑GRefCOCO‑trained GenSeg‑R1‑4B attains 79.05 % no‑target accuracy, indicating that the GRPO reward alone confers some negative‑query awareness. False‑negative rates remain below 1 % for both variants.

ReasonSeg results further confirm reasoning capability: GenSeg‑R1‑4B obtains 68.40 % mIoU, beating Seg‑Zero‑7B by +7.0 points and Seg‑R1‑7B by +10.7 points. The model’s emergent traces reveal coherent multi‑step reasoning despite no explicit chain‑of‑thought supervision.

From an efficiency standpoint, GRPO avoids a separate value network by normalizing rewards within each sampled group, making training feasible on moderate hardware. The authors employ ZeRO‑3 FSDP across 2–4 H200 GPUs, vLLM for batched rollout sampling, and achieve convergence within a few days.

In summary, GenSeg‑R1 combines a powerful Qwen3‑VL backbone, policy‑gradient fine‑tuning via GRPO, a SAM 2‑in‑the‑loop reward that directly optimizes mask quality, and explicit negative‑prompt handling. This synergy yields state‑of‑the‑art performance on standard referring expression benchmarks, robust handling of empty‑target queries, and demonstrable compositional reasoning. The approach opens avenues for interactive robotics, assistive vision tools, and any application demanding real‑time, high‑precision segmentation guided by natural language.


Comments & Academic Discussion

Loading comments...

Leave a Comment