Enhancing Supervised Composed Image Retrieval via Reasoning-Augmented Representation Engineering

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Composed Image Retrieval (CIR) presents a significant challenge as it requires jointly understanding a reference image and a modified textual instruction to find relevant target images. Some existing methods attempt to use a two-stage approach to further refine retrieval results. However, this often requires additional training of a ranking model. Despite the success of Chain-of-Thought (CoT) techniques in reducing training costs for language models, their application in CIR tasks remains limited – compressing visual information into text or relying on elaborate prompt designs. Besides, existing works only utilize it for zero-shot CIR, as it is challenging to achieve satisfactory results in supervised CIR with a well-trained model. In this work, we proposed a framework that includes the Pyramid Matching Model with Training-Free Refinement (PMTFR) to address these challenges. Through a simple but effective module called Pyramid Patcher, we enhanced the Pyramid Matching Model’s understanding of visual information at different granularities. Inspired by representation engineering, we extracted representations from COT data and injected them into the LVLMs. This approach allowed us to obtain refined retrieval scores in the Training-Free Refinement paradigm without relying on explicit textual reasoning, further enhancing performance. Extensive experiments on CIR benchmarks demonstrate that PMTFR surpasses state-of-the-art methods in supervised CIR tasks. The code will be made public.

💡 Research Summary

The paper tackles the challenging task of Composed Image Retrieval (CIR), where a system must locate a target image given a reference image together with a textual modification instruction. Existing approaches fall into three broad categories: (1) sophisticated multimodal fusion architectures that increase computational cost, (2) two‑stage pipelines that first retrieve candidates and then train a separate ranking model to re‑order them, and (3) zero‑shot methods that employ Chain‑of‑Thought (CoT) prompting to generate richer textual descriptions. The authors argue that (1) often loses fine‑grained visual details, (2) requires additional labeled data and training resources, and (3) relies on explicit textual reasoning, which is both costly and prone to information loss when converting visual content to text.

To address these limitations, the authors propose a novel framework called PMTFR (Pyramid Matching Model with Training‑Free Refinement). PMTFR consists of two complementary modules:

Pyramid Matching Model – This module uses a pre‑trained Large Vision‑Language Model (LVLM) as a frozen image encoder. The key innovation is the Pyramid Patcher, a lightweight preprocessing step that creates multi‑scale visual tokens without altering the underlying transformer architecture. An input image is duplicated M times; each copy is patched with a different patch size (P, 2·P, 4·P, …, 2^{M‑1}·P). The resulting token sequences, each representing a different receptive field, are embedded and concatenated into a single long sequence. Small‑scale tokens capture fine details (e.g., a dog’s tongue), while large‑scale tokens encode global context (e.g., “on the grass”). The concatenated tokens are fed to the LVLM, and the representation of the last token is taken as the query or target embedding. Training proceeds with a contrastive InfoNCE loss that pulls together embeddings of matching query‑target pairs and pushes apart mismatched pairs. This yields a fast, high‑quality initial retrieval set.
Training‑Free Refinement – Instead of training an additional ranking network, the authors exploit CoT data to extract Reasoning‑Augmented Representations (RAug‑Rep). They feed each (reference image, candidate image) pair into the same LVLM and capture hidden states from a chosen intermediate layer (e.g., layer N‑1). These hidden vectors, derived from CoT‑generated reasoning paths, encode the model’s latent “yes/no” judgment about whether the candidate satisfies the composed query. During inference, for the top‑N candidates returned by the Pyramid Matching Model, the RAug‑Rep is injected back into the LVLM’s forward pass, producing a refinement score (e.g., probability of correctness). The final ranking combines the original contrastive similarity score with the refinement score, typically via a weighted sum or learned fusion, without any additional gradient updates.

Key insights and contributions:

Multi‑granular visual tokens via the Pyramid Patcher improve the model’s ability to reason about both macro (background) and micro (object detail) aspects, a crucial requirement for CIR.
Representation engineering applied to LVLMs: rather than generating explicit reasoning text, the method extracts internal representations from CoT data and injects them, achieving a training‑free boost.
Zero‑training refinement eliminates the need for a separate ranking model, saving data preparation, compute, and training time while still delivering measurable performance gains.
Empirical validation on two standard CIR benchmarks (FashionIQ and CIRR) shows that PMTFR outperforms prior state‑of‑the‑art methods, achieving higher Recall@K and Top‑1 accuracy. Ablation studies confirm that both the Pyramid Patcher and the RAug‑Rep injection contribute additively.
Broad compatibility: the framework works with multiple LVLM backbones (e.g., BLIP‑2, LLaVA, Qwen‑VL), indicating that the approach is not tied to a specific architecture.

The paper also outlines future directions: automated selection of the number of scales M, extending RAug‑Rep to other domains (medical imaging, satellite imagery), and exploring hybrid refinements that combine textual CoT prompts with representation injection.

In summary, PMTFR offers a simple yet powerful solution for supervised CIR: a multi‑scale visual tokenization stage that enriches the base LVLM’s perception, followed by a training‑free refinement that leverages latent reasoning representations. The method achieves state‑of‑the‑art results while reducing computational overhead and eliminating the need for extra ranking‑model training, and the authors plan to release code and pretrained models to facilitate further research.

Enhancing Supervised Composed Image Retrieval via Reasoning-Augmented Representation Engineering

💡 Research Summary

Comments & Academic Discussion

Leave a Comment