Semi-Supervised Masked Autoencoders: Unlocking Vision Transformer Potential with Limited Data
We address the challenge of training Vision Transformers (ViTs) when labeled data is scarce but unlabeled data is abundant. We propose Semi-Supervised Masked Autoencoder (SSMAE), a framework that jointly optimizes masked image reconstruction and classification using both unlabeled and labeled samples with dynamically selected pseudo-labels. SSMAE introduces a validation-driven gating mechanism that activates pseudo-labeling only after the model achieves reliable, high-confidence predictions that are consistent across both weakly and strongly augmented views of the same image, reducing confirmation bias. On CIFAR-10 and CIFAR-100, SSMAE consistently outperforms supervised ViT and fine-tuned MAE, with the largest gains in low-label regimes (+9.24% over ViT on CIFAR-10 with 10% labels). Our results demonstrate that when pseudo-labels are introduced is as important as how they are generated for data-efficient transformer training. Codes are available at https://github.com/atik666/ssmae.
💡 Research Summary
The paper introduces Semi‑Supervised Masked Autoencoders (SSMAE), a framework that unifies masked image reconstruction and classification to train Vision Transformers (ViTs) efficiently when labeled data is scarce but unlabeled data is abundant. SSMAE builds on the Masked Autoencoder (MAE) paradigm: an input image is split into non‑overlapping patches, a high masking ratio (75 %) is applied, and only the visible tokens are fed into a ViT encoder. The encoder output serves two purposes. First, a lightweight decoder reconstructs the masked patches, and a mean‑squared error loss (L_recon) is computed only on the masked region. Second, the
Comments & Academic Discussion
Loading comments...
Leave a Comment