Dual Perspectives on Non-Contrastive Self-Supervised Learning
The {\em stop gradient} and {\em exponential moving average} iterative procedures are commonly used in non-contrastive approaches to self-supervised learning to avoid representation collapse, with excellent performance in downstream applications in practice. This presentation investigates these procedures from the dual viewpoints of optimization and dynamical systems. We show that, in general, although they {\em do not} optimize the original objective, or {\em any} other smooth function, they {\em do} avoid collapse Following~\citet{Tian21}, but without any of the extra assumptions used in their proofs, we then show using a dynamical system perspective that, in the linear case, minimizing the original objective function without the use of a stop gradient or exponential moving average {\em always} leads to collapse. Conversely, we characterize explicitly the equilibria of the dynamical systems associated with these two procedures in this linear setting as algebraic varieties in their parameter space, and show that they are, in general, {\em asymptotically stable}. Our theoretical findings are illustrated by empirical experiments with real and synthetic data.
💡 Research Summary
This paper provides a dual‑theoretical examination of the two most widely used mechanisms in non‑contrastive self‑supervised learning (SSL): the stop‑gradient (SG) operation and the exponential moving average (EMA) of the teacher network. Both techniques are employed in state‑of‑the‑art methods such as BYOL and SimSiam to prevent representation collapse, yet their relationship to a well‑defined optimization objective has remained unclear.
The authors first formalize the standard SSL objective
\
Comments & Academic Discussion
Loading comments...
Leave a Comment