DRFormer: A Dual-Regularized Bidirectional Transformer for Person Re-identification

DRFormer: A Dual-Regularized Bidirectional Transformer for Person Re-identification
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Both fine-grained discriminative details and global semantic features can contribute to solving person re-identification challenges, such as occlusion and pose variations. Vision foundation models (\textit{e.g.}, DINO) excel at mining local textures, and vision-language models (\textit{e.g.}, CLIP) capture strong global semantic difference. Existing methods predominantly rely on a single paradigm, neglecting the potential benefits of their integration. In this paper, we analyze the complementary roles of these two architectures and propose a framework to synergize their strengths by a \textbf{D}ual-\textbf{R}egularized Bidirectional \textbf{Transformer} (\textbf{DRFormer}). The dual-regularization mechanism ensures diverse feature extraction and achieves a better balance in the contributions of the two models. Extensive experiments on five benchmarks show that our method effectively harmonizes local and global representations, achieving competitive performance against state-of-the-art methods.


💡 Research Summary

DRFormer tackles the long‑standing challenge of person re‑identification (Re‑ID) by jointly exploiting the complementary strengths of two powerful pre‑trained vision models: DINO, a vision foundation model (VFM) that excels at extracting fine‑grained local textures, and CLIP, a vision‑language model (VLM) that captures strong global semantic cues. While prior Re‑ID works typically rely on either a VFM or a VLM, this paper demonstrates that integrating both can substantially improve robustness to occlusion, pose variation, and illumination changes.

The core of the method is a bidirectional cross‑attention transformer. After splitting an input image into non‑overlapping patches, the patches are fed separately into DINO and CLIP encoders. Each encoder is augmented with a small set of learnable tokens (N ≈ 4‑8). These tokens serve two purposes: (1) they act as queries for the cross‑attention module, and (2) they are regularized to encourage diverse attention patterns. The bidirectional module consists of one cross‑attention layer followed by two self‑attention layers. In the first pass, DINO tokens query CLIP tokens, allowing global semantic information to guide the selection of fine‑grained details. In the second pass, CLIP tokens query DINO tokens, enabling local details to enrich the global representation. The two resulting feature vectors (H_D→C and H_C→D) are concatenated and fed to a simple linear classifier.

Two novel regularizers are introduced to address two practical problems that arise when fusing heterogeneous models:

  1. Intra‑model Token Diversity Regularizer – Without additional constraints, multiple learnable tokens within the same encoder tend to attend to nearly identical regions, limiting the richness of the extracted features. The authors maximize the cosine distance (i.e., minimize cosine similarity) between the first token (augmented with camera‑specific side‑information embeddings) and the remaining tokens. This forces each token to capture complementary visual cues, such as accessories or background elements that are often decisive in Re‑ID. Empirically, this regularizer improves mean average precision (mAP) by about 2.5 % on DukeMTMC‑reID.

  2. Inter‑model Bias Regularizer – When the concatenated features are passed through a linear layer, the final logits are effectively a weighted sum of the DINO‑derived and CLIP‑derived scores. Because CLIP typically converges faster during training, its contribution can dominate, leading to an imbalance. The authors formulate a bias‑variance decomposition of the expected generalization error and derive closed‑form expressions for the optimal contribution weights (w₀, w₁) based on the bias of each branch. By adding a regularization term that drives the learned weights toward these optimal values, the training curves of the two branches become more synchronized, and the overall accuracy on Market‑1501 improves by roughly 1.8 %p.

The overall loss combines the standard Re‑ID objectives (cross‑entropy ID loss and triplet loss) with the two regularization terms, weighted by hyper‑parameters λ_intra and λ_inter. The model is trained end‑to‑end on five widely used Re‑ID benchmarks: Market‑1501, DukeMTMC‑reID, MSMT17, CUHK‑03, and VeRi‑776. DRFormer consistently matches or exceeds state‑of‑the‑art methods such as TransReID, CLIP‑ReID, and PersonViT, especially on datasets with severe pose variation and occlusion where the synergy between local detail and global semantics is most beneficial.

From a computational standpoint, the authors discard the image‑token outputs of the two encoders and keep only the learnable‑token embeddings, dramatically reducing memory and FLOPs compared with naïve feature concatenation. The bidirectional transformer adds only a modest overhead (one cross‑attention and two self‑attention layers), making the approach compatible with real‑time inference requirements.

In summary, DRFormer provides a principled framework for fusing heterogeneous pre‑trained vision models in a Re‑ID context. By introducing token‑level diversity and branch‑level bias regularization, it ensures that each model contributes uniquely and proportionally to the final representation. The paper not only achieves strong empirical results but also opens a new research direction: the systematic combination of multiple foundation models (VFM, VLM, masked‑image‑modeling networks, etc.) through bidirectional attention and dual regularization, potentially benefiting a wide range of vision tasks beyond person re‑identification.


Comments & Academic Discussion

Loading comments...

Leave a Comment