IntRec: Intent-based Retrieval with Contrastive Refinement
Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.
💡 Research Summary
IntRec addresses the longstanding challenge of retrieving a user‑specified object from cluttered scenes, where traditional open‑vocabulary detectors operate in a stateless, one‑shot manner. Such detectors encode a textual query and match it against region features, returning the region with the highest similarity. This approach fails when the query is ambiguous or when multiple visually similar objects satisfy the same description, leading to frequent mis‑retrievals and offering no mechanism for correction.
The core contribution of IntRec is the Intent State (IS), a dynamic memory that stores two exemplar sets: positive anchors (Z⁺) representing confirmed cues and negative constraints (Z⁻) representing rejected hypotheses. At turn 0 the user supplies a text prompt T₀ and optionally a reference image Iᵣ. Both are encoded with CLIP (or any compatible vision‑language encoder) to obtain embeddings z_T and z_I, which are linearly fused (α·z_T + (1‑α)·z_I) into an initial positive exemplar z₀ᵖ. This vector seeds Z⁺, while Z⁻ starts empty.
All candidate regions r₁…r_M are extracted from the target image using a pre‑trained detector (e.g., Faster‑RCNN + CLIP). For each interaction turn t, IntRec computes a contrastive alignment score for every candidate:
S(r_j | IS_t) = max_{z⁺∈Z⁺} cos(r_j, z⁺) − λ · max_{z⁻∈Z⁻} cos(r_j, z⁻).
The first term rewards similarity to any positive exemplar; the second term penalizes similarity to any negative exemplar, with λ controlling the strength of the penalty. The scores are calculated in a single matrix operation, enabling real‑time performance (≈30 ms per interaction). The top‑k regions are presented to the user.
User feedback comes in two forms: (i) positive confirmation of a region (or a new textual refinement), which is added to Z⁺, and (ii) negative rejection, which adds the region’s embedding to Z⁻. The IS is updated accordingly, and the next round re‑ranks candidates using the updated exemplar sets. This loop continues until the user confirms the correct object.
Theoretically, the method resolves the “ambiguity condition” where a distractor region r_d has similarity to the query at least as high as the true target r_. A one‑shot model would select r_d and cannot recover. IntRec, after a single negative feedback on r_d, inserts r_d into Z⁻, sharply lowering its score, while a subsequent positive feedback on r_ enriches Z⁺, guaranteeing that r_* outranks r_d in subsequent rounds.
Empirical evaluation was performed on LVIS and the LVIS‑Ambiguous benchmark, which emphasizes queries with multiple similar candidates. IntRec achieved 35.4 AP on LVIS, surpassing OVMR (+2.3 AP), CoDet (+3.7 AP), and CAKE (+0.5 AP). On LVIS‑Ambiguous, a single corrective feedback yielded a +7.9 AP gain over the one‑shot baseline, demonstrating the framework’s ability to disambiguate in challenging settings.
Ablation studies revealed that using only Z⁺ or only Z⁻ degrades performance by 3–4 AP, confirming the necessity of both positive and negative memories. Varying λ showed optimal performance around 0.5–0.7; setting λ = 0 eliminates the benefit of negative feedback, while λ = 1 over‑penalizes and harms recall. The system’s latency remains low (≤30 ms per turn), making it suitable for interactive applications such as human‑robot collaboration, AR/VR visual search, and assistive interfaces.
Limitations include dependence on the initial detector’s proposal set—if the true object is never proposed, feedback cannot recover it—and the current focus on single‑image retrieval, leaving video or multi‑image extensions to future work.
In summary, IntRec introduces a principled, memory‑based contrastive refinement mechanism that leverages minimal user feedback to dramatically improve object retrieval in open‑world, ambiguous scenes, while maintaining real‑time responsiveness and requiring no additional supervision beyond the standard vision‑language pre‑training.
Comments & Academic Discussion
Loading comments...
Leave a Comment