ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Composed Image Retrieval (CIR) aims to retrieve target images based on a hybrid query comprising a reference image and a modification text. Early dual-tower Vision-Language Models (VLMs) struggle with cross-modality compositional reasoning required for this task. Recently, adapting generative Multimodal Large Language Models (MLLMs) for retrieval offers a promising direction. However, we identify that this adaptation strategy overlooks a fundamental issue: adapting a generative MLLM into a single-embedding discriminative retriever triggers a paradigm conflict, which leads to Capability Degradation - the deterioration of native fine-grained reasoning after retrieval adaptation. To address this challenge, we propose ReCALL (Recalibrating Capability Degradation), a model-agnostic framework that follows a diagnose-generate-refine pipeline: Firstly, we diagnose cognitive blind spots of the retriever via self-guided informative instance mining. Next, we generate corrective instructions and triplets by CoT prompting the foundation MLLM and conduct quality control with VQA-based consistency filtering. Finally, we refine the retriever through continual training on these triplets with a grouped contrastive scheme, thereby internalizing fine-grained visual-semantic distinctions and realigning the discriminative embedding space of retriever with intrinsic compositional reasoning within the MLLM. Extensive experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance. Code will be released soon.

💡 Research Summary

Composed Image Retrieval (CIR) requires a system to find a target image given a hybrid query that consists of a reference image and a textual modification. Early dual‑tower Vision‑Language Models (VLMs) such as CLIP struggle with the fine‑grained cross‑modal reasoning needed for this task because their alignment between image and text is shallow. Recent work therefore adapts Multimodal Large Language Models (MLLMs) – models that fuse visual and textual information deeply and are trained to follow natural‑language instructions – to CIR. The common practice is to fine‑tune an MLLM on a set of (reference image, modification text, target image) triplets using contrastive learning, thereby turning the generative model into a discriminative single‑embedding retriever (denoted R_base).

The authors identify a fundamental problem they call Capability Degradation. When an MLLM is repurposed from its native generative, step‑by‑step reasoning paradigm to a similarity‑based retrieval paradigm, the model’s intrinsic fine‑grained compositional reasoning deteriorates. Empirically, the foundation model (F) can answer a VQA version of the query correctly 100 % of the time on a selected subset, while the fine‑tuned retriever (R_base) only achieves 62 % (CIRR) and 56 % (FashionIQ) recall@1 on the same subset. Qualitative examples show that R_base fails to distinguish subtle visual differences (e.g., two dogs of the same breed) that F can reason about through a chain of thought.

To recover the lost abilities, the paper proposes ReCALL – a model‑agnostic, three‑stage pipeline: Diagnose → Generate → Refine.

Diagnose (Self‑Guided Informative Instance Mining).
R_base is run on the whole training set. Queries that do not retrieve the ground‑truth at rank 1 are collected as failure cases. For each failure, the top‑K images ranked above the ground‑truth are taken as hard negatives (I_h). These hard negatives are visually and semantically close to the correct target, precisely exposing the decision boundaries where R_base’s reasoning has degraded.
Generate (Generative Calibration).
The foundation model F is prompted with a Chain‑of‑Thought (CoT) style instruction that asks it to (a) decompose the original modification text T_m into atomic intents, (b) verify each intent against the reference image I_r and a hard negative I_h, and (c) synthesize a minimal edit ˜T_m that would make the query correctly describe I_h. This yields a corrective triplet (I_r, ˜T_m, I_h). To filter out noisy generations, a VQA‑based consistency check is applied: the model is asked targeted questions about key attributes in ˜T_m, and only if the answers are high‑confidence and internally consistent is the triplet retained.
Refine (Targeted Refinement).
Starting from R_base, the model is fine‑tuned again using a grouped contrastive loss. Each mini‑batch contains both the original positive triplet (I_r, T_m, I_t) and its corrective counterpart (I_r, ˜T_m, I_h). The loss combines the standard InfoNCE term over the whole batch (preserving global retrieval structure) with an explicit push‑apart term for the hard negative within the same group. This forces the model to learn the subtle visual‑semantic distinctions encoded in the minimal textual edits, effectively realigning its embedding space with the compositional reasoning of F.

Key technical contributions include:

A self‑supervised method to discover the most informative failure cases without extra annotation.
Leveraging the generative reasoning of an MLLM to produce discriminative supervision, bridging the paradigm gap.
A VQA‑based quality filter that ensures only reliable corrective signals are used.
A grouped contrastive training scheme that simultaneously preserves overall retrieval performance while sharpening fine‑grained discrimination.

Experiments on the two major CIR benchmarks, CIRR and FashionIQ, demonstrate that ReCALL consistently improves recall@1 by +8.7 % and +9.5 % respectively over the strong baseline R_base, and outperforms all previously reported methods. Ablation studies confirm that each component (informative mining, CoT generation, VQA filtering, grouped contrastive loss) contributes meaningfully to the final gain. The approach is also parameter‑efficient: LoRA adapters modify only a tiny fraction of the model’s weights, leading to modest training overhead.

The paper discusses broader implications: the paradigm conflict identified here likely affects other downstream tasks where generative MLLMs are forced into discriminative roles (e.g., image‑text matching, video retrieval). ReCALL offers a general recipe for self‑improvement loops that can be adapted to those settings. Limitations include dependence on the quality of VQA questions and the need for sufficient hard negatives, which may be challenging in very small datasets.

In summary, ReCALL provides a principled solution to the Capability Degradation problem by diagnosing a model’s blind spots, generating minimal corrective instructions with the foundation model’s reasoning power, and refining the retriever through a targeted contrastive regime. This restores the fine‑grained compositional reasoning lost during standard fine‑tuning and sets a new state‑of‑the‑art on CIR benchmarks.

ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval

💡 Research Summary

Comments & Academic Discussion

Leave a Comment