Self-Supervised Learning Based on Transformed Image Reconstruction for Equivariance-Coherent Feature Representation
Self-supervised learning (SSL) methods have achieved remarkable success in learning image representations allowing invariances in them - but therefore discarding transformation information that some computer vision tasks actually require. While recent approaches attempt to address this limitation by learning equivariant features using linear operators in feature space, they impose restrictive assumptions that constrain flexibility and generalization. We introduce a weaker definition for the transformation relation between image and feature space denoted as equivariance-coherence. We propose a novel SSL auxiliary task that learns equivariance-coherent representations through intermediate transformation reconstruction, which can be integrated with existing joint embedding SSL methods. Our key idea is to reconstruct images at intermediate points along transformation paths, e.g. when training on 30-degree rotations, we reconstruct the 10-degree and 20-degree rotation states. Reconstructing intermediate states requires the transformation information used in augmentations, rather than suppressing it, and therefore fosters features containing the augmented transformation information. Our method decomposes feature vectors into invariant and equivariant parts, training them with standard SSL losses and reconstruction losses, respectively. We demonstrate substantial improvements on synthetic equivariance benchmarks while maintaining competitive performance on downstream tasks requiring invariant representations. The approach seamlessly integrates with existing SSL methods (iBOT, DINOv2) and consistently enhances performance across diverse tasks, including segmentation, detection, depth estimation, and video dense prediction. Our framework provides a practical way for augmenting SSL methods with equivariant capabilities while preserving invariant performance.
💡 Research Summary
Self‑supervised learning (SSL) has become a cornerstone for pre‑training vision models, yet most state‑of‑the‑art joint‑embedding methods (e.g., DINOv2, iBOT) are deliberately designed to be invariant to data augmentations. This invariance bias is beneficial for classification, where labels are agnostic to rotations, color jitter, or translations, but it discards precisely the transformation information needed for dense prediction tasks such as object detection, segmentation, pose estimation, and depth prediction. Recent work such as the Split Invariant‑Equivariant (SIE) framework attempts to recover equivariant features by learning a linear operator that maps transformed inputs to transformed features. However, the linearity assumption restricts the expressive power of the model and requires additional hyper‑networks, limiting scalability to complex or non‑group transformations.
The paper introduces a weaker notion called equivariance‑coherence: a representation need not obey a strict group homomorphism, but it must retain enough information to reconstruct any intermediate transformation applied in the input space. To enforce this property, the authors propose an auxiliary reconstruction task that is seamlessly integrated into any joint‑embedding SSL pipeline. The key idea is to generate a sequence of intermediate transformed views along a transformation trajectory (e.g., for a 30° rotation, also create 10° and 20° rotated images) and require the model to reconstruct these intermediate images from a dedicated equivariant feature subspace.
The method works as follows. Two views of an image are created: view v₁ using the standard augmentation set A₁, and view v₂ using a second augmentation set A₂ followed by a chain of K intermediate transformations g_{θ₁}, …, g_{θ_K} ending with the final transformation g_θ. The encoder f processes both views, producing patch‑wise features z₁ and z₂. Each feature vector is split along the channel dimension into an invariant part (z_inv) and an equivariant part (z_equi). The invariant part is trained with any conventional SSL loss (e.g., VICReg, iBOT), preserving the usual invariance properties. The equivariant part is fed to K lightweight decoders (a single linear layer plus four convolutional layers) that output reconstructions \hat{u}_k of the intermediate transformed images u_k. A simple L₂ reconstruction loss L_recon = (1/K) Σ_k ||\hat{u}_k – u_k||² is combined with the SSL loss via a weighting λ: L_total = L_SSL + λ·L_recon.
Crucially, the decoders are deliberately shallow so that most learning pressure falls on the encoder, ensuring that the equivariant subspace learns to encode transformation cues rather than relying on a powerful decoder. Hyper‑parameter studies show that K = 2 intermediate steps, λ ≈ 1.0, and an equivariant dimension d_equi ≈ 0.2·d_total provide a good trade‑off between reconstruction quality and downstream performance.
Empirical evaluation covers two fronts. First, synthetic equivariance benchmarks (measured by R² between predicted and ground‑truth transformed images) demonstrate that the proposed approach outperforms SIE across all tested transformations—rotation, color jitter, Gaussian blur, translation, and SE(2) rigid motions—often by a margin of 5–10 %. Second, when the method is added to strong baselines (iBOT, DINOv2) and the resulting models are fine‑tuned on real‑world tasks, consistent gains are observed: modest improvements (≈1–2 %) on COCO object detection, ADE20K semantic segmentation, NYU depth estimation, and video dense prediction on Kinetics. Importantly, classification accuracy on ImageNet remains on par with the original baselines, confirming that invariant performance is not sacrificed.
The paper also discusses limitations and future directions. Because reconstruction is performed in pixel space with a simple L₂ loss, high‑frequency details or complex textures may not be perfectly recovered, potentially limiting the richness of the equivariant signal for very high‑resolution data. The current experiments focus on 2‑D transformations; extending the framework to 3‑D rotations, non‑rigid deformations, or multi‑modal inputs would require additional investigation. Moreover, exploring perceptual or adversarial reconstruction losses could further strengthen the equivariant subspace without inflating computational cost.
In summary, this work presents a practical, architecture‑agnostic augmentation to existing SSL methods that injects equivariant capabilities through intermediate image reconstruction. By relaxing the strict linear equivariance assumption and leveraging transformation metadata already available during augmentation, the approach delivers richer feature representations that are simultaneously invariant where needed and transformation‑aware where beneficial, thereby narrowing the gap between self‑supervised pre‑training and the diverse demands of downstream computer‑vision applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment