PMPGuard: Catching Pseudo-Matched Pairs in Remote Sensing Image-Text Retrieval
Remote sensing (RS) image-text retrieval faces significant challenges in real-world datasets due to the presence of Pseudo-Matched Pairs (PMPs), semantically mismatched or weakly aligned image-text pairs, which hinder the learning of reliable cross-modal alignments. To address this issue, we propose a novel retrieval framework that leverages Cross-Modal Gated Attention and a Positive-Negative Awareness Attention mechanism to mitigate the impact of such noisy associations. The gated module dynamically regulates cross-modal information flow, while the awareness mechanism explicitly distinguishes informative (positive) cues from misleading (negative) ones during alignment learning. Extensive experiments on three benchmark RS datasets, i.e., RSICD, RSITMD, and RS5M, demonstrate that our method consistently achieves state-of-the-art performance, highlighting its robustness and effectiveness in handling real-world mismatches and PMPs in RS image-text retrieval tasks.
💡 Research Summary
The paper tackles a pervasive problem in remote‑sensing (RS) image‑text retrieval: the presence of “Pseudo‑Matched Pairs” (PMPs). These are image‑caption pairs that are only partially aligned or outright mismatched, a situation that arises frequently because large‑scale RS datasets are often built using automated captioning pipelines, web‑crawled metadata, or crowdsourced annotations. Traditional retrieval models assume that every training pair labeled as positive is truly semantically matched; when this assumption is violated, noisy supervision degrades the learned cross‑modal embeddings and harms retrieval performance.
To address this, the authors propose PMPGuard, a retrieval framework that does not merely suppress noisy pairs but actively extracts useful semantic cues from them. PMPGuard consists of two complementary attention mechanisms: Cross‑Modal Gated Attention (CGA) and Positive‑Negative Awareness Attention (PNAA).
Cross‑Modal Gated Attention (CGA)
CGA first computes a standard cross‑attention matrix between textual token embeddings (U = {u_i}) and image region embeddings (V = {v_j}). The attention scores (a_{ij}=u_i^\top W_a v_j / (|u_i||v_j|)) are normalized to obtain forward ((\alpha_{ij})) and backward ((\beta_{ji})) attention weights. Using these weights, cross‑modal context vectors (\tilde v_i = \sum_j \alpha_{ij} v_j) and (\tilde u_j = \sum_i \beta_{ji} u_i) are generated.
A gating function then decides how much of the original modality‑specific feature versus the cross‑modal context should be retained. The gate for each token is (g_{u_i} = \sigma(W_{ug}
Comments & Academic Discussion
Loading comments...
Leave a Comment