bi-modal textual prompt learning for vision-language models in remote sensing
Prompt learning (PL) has emerged as an effective strategy to adapt vision-language models (VLMs), such as CLIP, for downstream tasks under limited supervision. While PL has demonstrated strong generalization on natural image datasets, its transferability to remote sensing (RS) imagery remains underexplored. RS data present unique challenges, including multi-label scenes, high intra-class variability, and diverse spatial resolutions, that hinder the direct applicability of existing PL methods. In particular, current prompt-based approaches often struggle to identify dominant semantic cues and fail to generalize to novel classes in RS scenarios. To address these challenges, we propose BiMoRS, a lightweight bi-modal prompt learning framework tailored for RS tasks. BiMoRS employs a frozen image captioning model (e.g., BLIP-2) to extract textual semantic summaries from RS images. These captions are tokenized using a BERT tokenizer and fused with high-level visual features from the CLIP encoder. A lightweight cross-attention module then conditions a learnable query prompt on the fused textual-visual representation, yielding contextualized prompts without altering the CLIP backbone. We evaluate BiMoRS on four RS datasets across three domain generalization (DG) tasks and observe consistent performance gains, outperforming strong baselines by up to 2% on average. Codes are available at https://github.com/ipankhi/BiMoRS.
💡 Research Summary
The paper introduces BiMoRS, a lightweight bi‑modal prompt learning framework designed to improve the transferability of vision‑language models (VLMs), particularly CLIP, to remote sensing (RS) tasks where data are scarce and domain shifts are common. Traditional prompt learning methods such as CoOp and CoCoOp rely on static or class‑conditioned prompts, which struggle with the multi‑label, spatially complex nature of RS imagery. Moreover, fine‑tuning CLIP on large RS‑caption datasets (e.g., RemoteCLIP, RSCLIP) incurs high annotation costs and still lacks image‑specific adaptability at inference time.
BiMoRS addresses these issues by leveraging a frozen image captioning model (BLIP‑2) to generate natural‑language descriptions of each input RS image. The captions are tokenized with a pretrained BERT tokenizer, preserving rich contextual information. Simultaneously, high‑level visual features are extracted from the penultimate layer of the frozen CLIP image encoder. Both textual and visual embeddings are projected into a shared 512‑dimensional space via lightweight projection heads (P_t and P_v). The two modalities are concatenated to form a bi‑modal token matrix B′, which serves as key‑value pairs for a multi‑head cross‑attention (CA) module. A set of learnable query prompt tokens Q′ (four tokens) attends over B′, producing contextualized prompts Q that replace the static “a photo of a
Comments & Academic Discussion
Loading comments...
Leave a Comment