Revisiting Salient Object Detection from an Observer-Centric Perspective
Salient object detection is inherently a subjective problem, as observers with different priors may perceive different objects as salient. However, existing methods predominantly formulate it as an objective prediction task with a single groundtruth segmentation map for each image, which renders the problem under-determined and fundamentally ill-posed. To address this issue, we propose Observer-Centric Salient Object Detection (OC-SOD), where salient regions are predicted by considering not only the visual cues but also the observer-specific factors such as their preferences or intents. As a result, this formulation captures the intrinsic ambiguity and diversity of human perception, enabling personalized and context-aware saliency prediction. By leveraging multi-modal large language models, we develop an efficient data annotation pipeline and construct the first OC-SOD dataset named OC-SODBench, comprising 33k training, validation and test images with 152k textual prompts and object pairs. Built upon this new dataset, we further design OC-SODAgent, an agentic baseline which performs OC-SOD via a human-like “Perceive-Reflect-Adjust” process. Extensive experiments on our proposed OC-SODBench have justified the effectiveness of our contribution. Through this observer-centric perspective, we aim to bridge the gap between human perception and computational modeling, offering a more realistic and flexible understanding of what makes an object truly “salient.” Code and dataset are publicly available at: https://github.com/Dustzx/OC_SOD
💡 Research Summary
The paper introduces Observer‑Centric Salient Object Detection (OC‑SOD), a paradigm that explicitly incorporates observer‑specific factors—such as long‑term preferences and short‑term intents—into the saliency prediction process. Traditional SOD treats saliency as an objective property, providing a single ground‑truth mask per image, which fails to capture the inherent subjectivity of human visual attention and leads to an ill‑posed problem, especially in complex scenes where different observers may focus on different objects.
OC‑SOD reformulates saliency as a conditional generation task: given an image I and a textual instruction T that describes the observer’s cognitive state, a model P predicts a mask M (and optionally an intermediate reasoning trace D). Three representative observer modes are defined: (1) Free‑Viewing (no explicit prior), (2) Preference‑Driven (stable long‑term interests, e.g., “food lover”), and (3) Intent‑Driven (momentary goals, e.g., “check email”). By varying T, the same image can yield multiple plausible masks, turning the problem into a well‑posed one.
To enable research on OC‑SOD, the authors devise an efficient annotation pipeline powered by multimodal large language models (MLLMs). Starting from existing saliency and segmentation datasets that already contain pixel‑level masks, the pipeline performs: (i) rule‑based filtering to discard tiny or semantically meaningless objects, (ii) MLLM‑driven captioning and categorization to assign each image to one of the three observer modes, (iii) automatic generation of observer portraits (for Preference‑Driven) or intents (for Intent‑Driven) as natural‑language prompts, (iv) automated verification using a reasoning‑oriented MLLM, and (v) final manual curation by expert annotators. The resulting OC‑SODBench dataset comprises 33 000 images and 152 000 instruction‑mask pairs, covering a wide variety of objects, preferences, and intents, with detailed statistics on object frequency, mask area distribution, and word clouds.
On the modeling side, the paper proposes OC‑SOD Agent, an agentic baseline that integrates an MLLM with the Segment‑Anything Model v2 (SAMv2). The agent first lets the MLLM interpret the instruction and generate a high‑level reasoning plan (“Perceive”). It then calls SAMv2 to obtain an initial segmentation. The MLLM reviews the result, reflects on any mismatch with the stated intent or preference, and, if needed, adjusts the prompt or requests a refined segmentation from SAMv2 (“Reflect‑Adjust”). This iterative loop mimics human perception‑reasoning‑adjustment behavior. Notably, even without any fine‑tuning, OC‑SOD Agent outperforms existing MLLM‑based segmentation frameworks (e.g., LISA, LLM‑Seg) on the new benchmark. Fine‑tuning on OC‑SODBench further improves performance across all metrics, including traditional SOD scores (F‑measure, MAE) and observer‑specific measures (intent alignment, preference alignment).
Extensive experiments demonstrate that OC‑SOD Agent can correctly disambiguate scenes where multiple objects compete for attention. For example, in an image containing both bread and a laptop, the model produces a bread mask when given a “food lover” preference, and a laptop mask when given an “check email” intent, whereas conventional SOD models always output a single, often arbitrary, mask. This validates the claim that incorporating observer context resolves the ill‑posedness of classic SOD.
The authors acknowledge limitations: the current textual prompts rely on predefined templates, and the MLLM‑SAM pipeline incurs substantial computational cost, limiting real‑time deployment. Future directions include richer, user‑generated observer profiles, lightweight inference architectures, extension to video or AR scenarios, and interactive intent updating.
In summary, the work establishes a new research direction that aligns saliency detection with human subjective perception. By releasing the OC‑SODBench dataset and code, the authors provide a solid foundation for subsequent studies in personalized vision, human‑computer interaction, robotics, and context‑aware content recommendation.
Comments & Academic Discussion
Loading comments...
Leave a Comment