Beyond the Encoder: Joint Encoder-Decoder Contrastive Pre-Training Improves Dense Prediction
Contrastive learning methods in self-supervised settings have primarily focused on pre-training encoders, while decoders are typically introduced and trained separately for downstream dense prediction tasks. However, this conventional approach overlooks the potential benefits of jointly pre-training both encoder and decoder. In this paper, we propose DeCon, an efficient encoder-decoder self-supervised learning (SSL) framework that supports joint contrastive pre-training. We first extend existing SSL architectures to accommodate diverse decoders and their corresponding contrastive losses. Then, we introduce a weighted encoder-decoder contrastive loss with non-competing objectives to enable the joint pre-training of encoder-decoder architectures. By adapting a contrastive SSL framework for dense prediction, DeCon establishes consistent state-of-the-art performance on most of the evaluated tasks when pre-trained on Imagenet-1K, COCO and COCO+. Notably, when pre-training a ResNet-50 encoder on COCO dataset, DeCon improves COCO object detection and instance segmentation compared to the baseline framework by +0.37 AP and +0.32 AP, respectively, and boosts semantic segmentation by +1.42 mIoU on Pascal VOC and by +0.50 mIoU on Cityscapes. These improvements generalize across recent backbones, decoders, datasets, and dense tasks beyond segmentation and object detection, and persist in out-of-domain scenarios, including limited-data settings, demonstrating that joint pre-training significantly enhances representation quality for dense prediction. Code is available at https://github.com/sebquetin/DeCon.git.
💡 Research Summary
The paper introduces DeCon (Decoder‑aware Contrastive learning), a self‑supervised learning (SSL) framework that jointly pre‑trains both the encoder and decoder using contrastive objectives. Traditional contrastive SSL methods (e.g., SimCLR, MoCo, SlotCon) focus exclusively on encoder pre‑training; decoders are added later and trained from scratch for downstream dense prediction tasks such as object detection, instance segmentation, and semantic segmentation. This separation neglects the low‑level spatial information that decoders require and limits the transferability of encoder representations to dense tasks.
DeCon addresses this gap by extending any existing contrastive SSL pipeline with decoder branches and corresponding auxiliary layers (projectors, predictors, etc.). Two variants are proposed:
-
DeCon‑SL (Single‑Level) – Mirrors the encoder’s contrastive loss at the decoder level. The total loss is a weighted sum: Loss = α·L_enc + (1‑α)·L_dec, where L_enc is the original encoder contrastive loss and L_dec is computed from decoder‑projected features. α balances the contribution of each term.
-
DeCon‑ML (Multi‑Level) – Applies deep supervision across multiple decoder stages. For each decoder level i, a contrastive loss L_dec_i is computed; the decoder loss L_dds is the average of these. The overall loss becomes Loss = α·L_enc + (1‑α)·L_dds. Additionally, DeCon‑ML introduces channel dropout on the skip‑connection features passed from encoder to decoder. Entire channels are randomly zeroed, forcing the network to avoid over‑reliance on a few channels and encouraging richer utilization of encoder parameters at all depths.
Implementation details:
- Backbone: ResNet‑50 (also ConvNeXt‑Small in some experiments).
- Decoders: Fully Convolutional Network (FCN) and Feature Pyramid Network (FPN) with skip connections.
- The framework is built on top of SlotCon, but also adapted to DenseCL and PixPro, showing that DeCon is framework‑agnostic.
- Pre‑training datasets: COCO 2017, COCO+ (train2017 + unlabeled2017), and ImageNet‑1K. Training schedules follow the original SSL methods (800 epochs on COCO/COCO+, 200 epochs on ImageNet‑1K).
- Hyper‑parameters (learning rate, weight decay, optimizer) are kept identical to the base SSL method; only α and the dropout probability are tuned.
Key empirical findings:
- Object detection & instance segmentation (COCO): Using a ResNet‑50 encoder pre‑trained on COCO, DeCon‑ML‑L improves AP by +0.37 and instance AP by +0.32 over the baseline SlotCon.
- Semantic segmentation: Gains of +1.42 mIoU on Pascal VOC and +0.50 mIoU on Cityscapes are reported. Similar improvements appear on ADE20K.
- The improvements persist across backbones (ResNet‑50, ConvNeXt‑Small) and pre‑training datasets (COCO, COCO+, ImageNet‑1K).
- In low‑data regimes (e.g., 1 % labeled COCO) and out‑of‑domain transfer (training on COCO, testing on Pascal/Cityscapes), DeCon maintains its advantage, indicating better representation generalization.
- Parameter overhead is modest. DeCon‑ML‑S (using only two decoder levels) matches the original SlotCon parameter count while still delivering ~0.3 AP boost; the full DeCon‑ML‑L adds more parameters but remains comparable in GPU memory consumption because the decoder heads are lightweight.
- Training cost is similar to the baseline SSL (same batch size, optimizer, epochs). Experiments on NVIDIA H100 and A100 GPUs confirm that adding decoder branches does not dramatically increase wall‑clock time.
Technical insights:
- Joint contrastive training forces the encoder to produce features that are simultaneously useful for global image discrimination (the classic SSL goal) and for reconstructive, spatially detailed tasks required by decoders. This dual pressure yields richer multi‑scale embeddings.
- Channel dropout on skip connections prevents the decoder from “cheating” by relying on a few high‑energy channels, thereby encouraging the encoder to distribute semantic information more evenly across its feature maps.
- Multi‑level decoder supervision propagates gradient signals to earlier encoder layers via the bottleneck representation, reinforcing the encoder’s ability to encode both high‑level semantics and fine‑grained spatial cues.
Limitations and future directions:
- The current work focuses on CNN‑based backbones; extending DeCon to Vision Transformers (ViT) or hybrid CNN‑Transformer architectures remains an open question.
- Hyper‑parameter sensitivity (α, dropout rate) is explored only empirically; automated tuning or adaptive weighting could further improve robustness.
- Application to video, 3D medical imaging, or other modalities with inherent temporal/spatial complexity is suggested but not evaluated.
In summary, DeCon demonstrates that jointly pre‑training encoder‑decoder pairs with contrastive objectives is a practical and effective way to boost dense prediction performance without incurring substantial computational overhead. By integrating decoder‑aware losses, channel dropout, and multi‑level supervision, the framework enriches the learned representations, leading to consistent gains across a wide spectrum of downstream tasks and datasets. This work paves the way for more holistic self‑supervised pre‑training strategies that treat the entire vision pipeline—not just the encoder—as a learnable, contrastive system.
Comments & Academic Discussion
Loading comments...
Leave a Comment