Understanding Self-Supervised Learning via Gaussian Mixture Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Self-supervised learning attempts to learn representations from un-labeled data; it does so via a loss function that encourages the embedding of a point to be close to that of its augmentations. This simple idea performs remarkably well, yet it is not precisely theoretically understood why this is the case. In this paper we analyze self-supervised learning in a natural context: dimensionality reduction in Gaussian Mixture Models. Crucially, we define an augmentation of a data point as being another independent draw from the same underlying mixture component. We show that vanilla contrastive learning (specifically, the InfoNCE loss) is able to find the optimal lower-dimensional subspace even when the Gaussians are not isotropic – something that vanilla spectral techniques cannot do. We also prove a similar result for “non-contrastive” self-supervised learning (i.e., SimSiam loss). We further extend our analyses to multi-modal contrastive learning algorithms (e.g., CLIP). In this setting we show that contrastive learning learns the subset of fisher-optimal subspace, effectively filtering out all the noise from the learnt representations. Finally, we corroborate our theoretical finding through synthetic data experiments.

💡 Research Summary

This paper provides a rigorous theoretical analysis of self-supervised learning (SSL), specifically contrastive and non-contrastive methods, by examining them through the lens of Gaussian Mixture Models (GMMs). The core objective is to understand why the simple idea of bringing an embedding close to its augmentation while pushing it away from others (or just bringing it close in non-contrastive cases) leads to powerful learned representations.

The authors establish a natural and analytically tractable framework. They assume data is generated from a K-component GMM with a shared covariance matrix (SharedGMM). The key conceptual innovation is the mathematical formalization of “augmentation.” An augmented pair (x, x̂) is defined such that x̂ is an independent draw from the same mixture component as x with a probability δ, and from a random component otherwise. This models realistic, noisy augmentations that preserve semantic class identity on average.

Within this setup, the goal of SSL is framed as learning a linear projection matrix A that maps high-dimensional data to a lower-dimensional subspace. The quality of a projection is measured by the Fisher discriminant, which maximizes the ratio of between-class variance to within-class variance in the projected space. The subspace that achieves this optimum is the Fisher subspace, which is known to be the optimal projection for classification and is learned by supervised Linear Discriminant Analysis (LDA).

The paper’s main contributions are threefold:

Analysis of Contrastive Learning (InfoNCE): The authors prove that minimizing the InfoNCE loss on augmented pairs from their AeD model leads to learning a linear projector whose column space converges to the Fisher subspace. This is a significant result: it shows that purely self-supervised contrastive learning can, in theory, match the performance of fully supervised LDA in finding the optimal representation subspace, even when the Gaussian components have non-isotropic (general) covariance structures. This capability surpasses standard spectral methods like PCA/SVD, which fail in such non-isotropic settings.
Analysis of Non-Contrastive Learning (SimSiam): The analysis is extended to non-contrastive SSL methods like SimSiam, which do not use negative samples. The authors prove a similar result, showing that these methods also recover a subspace closely related to the Fisher subspace, providing a theoretical justification for their effectiveness and their connection to contrastive methods.
Extension to Multi-Modal Learning (CLIP): The framework is generalized to multi-modal contrastive learning, such as in CLIP, where paired data (e.g., images and text captions) come from two different GMMs. The authors demonstrate that the CLIP loss learns linear projectors for each modality that map data onto a subset of their respective Fisher subspaces. This implies that the learned representations effectively filter out noise directions, retaining only the semantically aligned information shared across modalities, which helps explain CLIP’s strong zero-shot capabilities.

Finally, the theoretical findings are corroborated by experiments on synthetic data, where representations learned via InfoNCE on non-isotropic GMMs enable k-means clustering performance that matches supervised LDA.

In summary, this work offers a fundamental theoretical insight: self-supervised learning via augmentations is not merely a heuristic but a principled method for discovering the underlying latent structure (the mixture components) of data. It achieves this by learning projections that maximize class separability (the Fisher subspace), rivaling supervised methods, all without access to any labels. The paper bridges the gap between the empirical success of SSL and its theoretical foundations, using the simplicity and rigor of the GMM framework.

Understanding Self-Supervised Learning via Gaussian Mixture Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment