Canonicalizing Multimodal Contrastive Representation Learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As models and data scale, independently trained networks often induce analogous notions of similarity. But, matching similarities is weaker than establishing an explicit correspondence between the representation spaces, especially for multimodal models, where consistency must hold not only within each modality, but also for the learned image-text coupling. We therefore ask: given two independently trained multimodal contrastive models (with encoders $(f, g)$ and $(\widetilde{f},\widetilde{g})$) – trained on different distributions and with different architectures – does a systematic geometric relationship exist between their embedding spaces? If so, what form does it take, and does it hold uniformly across modalities? In this work, we show that across model families such as CLIP, SigLIP, and FLAVA, this geometric relationship is well approximated by an orthogonal map (up to a global mean shift), i.e., there exists an orthogonal map $Q$ where $Q^\top Q = I$ such that $\widetilde{f}(x)\approx Q f(x)$ for paired images $x$. Strikingly, the same $Q$ simultaneously aligns the text encoders i.e., $\widetilde{g}(y)\approx Q g(y)$ for texts $y$. Theoretically, we prove that if the multimodal kernel agrees across models on a small anchor set i.e. $\langle f(x), g(y)\rangle \approx \langle \widetilde{f}(x), \widetilde{g}(y)\rangle$, then the two models must be related by a single orthogonal map $Q$ and the same $Q$ maps images and text across models. More broadly, this finding enables backward-compatible model upgrades, avoiding costly re-embedding, and has implications for the privacy of learned representations. Our project page: https://canonical-multimodal.github.io/

💡 Research Summary

The paper investigates whether independently trained multimodal contrastive models—specifically dual‑encoder systems that map images and texts onto a shared unit hypersphere—share a systematic geometric relationship. The authors focus on pairs of models (f, g) and (˜f, ˜g) that may differ in architecture, embedding dimension, and training data. Their central claim is that a single orthogonal transformation Q (i.e., a rotation matrix satisfying QᵀQ = I) can simultaneously align both the image and text embedding spaces of the two models: ˜f(x) ≈ Q f(x) and ˜g(y) ≈ Q g(y).

The theoretical contribution proceeds in several steps. First, they show that the population‑level optimum of the symmetric InfoNCE loss is achieved when the scoring function equals the pointwise mutual information (PMI) up to a constant. Consequently, any two globally optimal models trained on the same joint distribution—or on distributions related by a bijective re‑parameterization—produce multimodal kernels ⟨f, g⟩ and ⟨˜f, ˜g⟩ that differ only by a constant offset. Second, they prove that if this kernel agreement holds on a modest anchor set of image‑text pairs, the map ψ that transports embeddings from one model to the other must be linear. Because the embeddings lie on a unit sphere, linearity collapses to an isometry, i.e., an orthogonal matrix Q. Third, they derive stability bounds: when the kernel agreement is approximate (within ε), the recovered Q deviates from the true orthogonal map by at most O(ε), guaranteeing that even imperfect alignment yields useful transformations.

Empirically, the authors evaluate model pairs across three major families: CLIP (both OpenAI and LAION variants), SigLIP, and FLAVA. They learn Q using only image embeddings (mean‑centered) and then apply the same Q to text embeddings without any further supervision. Results show that after applying Q, the mean cosine similarity between matched image embeddings rises to >0.98, and the mean similarity between matched text embeddings improves dramatically, despite Q being learned without any text data. Prompt‑to‑class retrieval accuracy, nearest‑neighbor image classification, and zero‑shot cross‑modal retrieval all remain essentially unchanged, indicating that semantic geometry is preserved. The map can be learned with as little as ~30 % of the data, and a Q trained on one dataset transfers directly to other datasets, confirming its data‑efficiency and transferability.

From a practical standpoint, the findings enable backward‑compatible model upgrades: a new model can be deployed without re‑embedding the entire corpus, simply by applying the learned orthogonal map to existing embeddings. Because the same Q works for both modalities, cross‑modal pipelines can interoperate seamlessly, facilitating model stitching and ensemble methods. The authors also discuss privacy implications: the existence of a universal Q suggests that an adversary with a small set of paired examples could infer the alignment between a proprietary model and a public reference, potentially exposing internal representation geometry.

In summary, the paper provides both a rigorous theoretical framework and extensive empirical evidence that multimodal contrastive models converge to a shared representation up to a single orthogonal transformation. This insight advances our understanding of representation convergence, offers a lightweight method for model interoperability, and raises important considerations for the security of large‑scale multimodal systems.

Canonicalizing Multimodal Contrastive Representation Learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment