QA-ReID: Quality-Aware Query-Adaptive Convolution Leveraging Fused Global and Structural Cues for Clothes-Changing ReID

QA-ReID: Quality-Aware Query-Adaptive Convolution Leveraging Fused Global and Structural Cues for Clothes-Changing ReID
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Unlike conventional person re-identification (ReID), clothes-changing ReID (CC-ReID) presents severe challenges due to substantial appearance variations introduced by clothing changes. In this work, we propose the Quality-Aware Dual-Branch Matching (QA-ReID), which jointly leverages RGB-based features and parsing-based representations to model both global appearance and clothing-invariant structural cues. These heterogeneous features are adaptively fused through a multi-modal attention module. At the matching stage, we further design the Quality-Aware Query Adaptive Convolution (QAConv-QA), which incorporates pixel-level importance weighting and bidirectional consistency constraints to enhance robustness against clothing variations. Extensive experiments demonstrate that QA-ReID achieves state-of-the-art performance on multiple benchmarks, including PRCC, LTCC, and VC-Clothes, and significantly outperforms existing approaches under cross-clothing scenarios.


💡 Research Summary

The paper addresses the challenging problem of clothes‑changing person re‑identification (CC‑ReID), where traditional ReID methods that rely heavily on appearance cues such as color and texture fail when a subject changes garments. The authors propose QA‑ReID, a novel framework that combines two complementary streams of information and a quality‑aware matching module.

Dual‑branch feature extraction: The first branch processes the raw RGB image to capture global appearance, while the second branch feeds a human‑parsing mask‑derived image (obtained by masking out clothing regions) into an identical ResNet‑50 backbone to extract clothing‑invariant structural cues (head, face, limbs). Both branches output feature maps of size C × H × W after the third ResNet stage.

Multi‑modal attention fusion: The RGB and parsing feature maps are concatenated and passed through separate channel‑wise and spatial attention branches. Their outputs are multiplied to produce a joint attention map ω ∈


Comments & Academic Discussion

Loading comments...

Leave a Comment