CyIN: Cyclic Informative Latent Space for Bridging Complete and Incomplete Multimodal Learning
Multimodal machine learning, mimicking the human brain’s ability to integrate various modalities has seen rapid growth. Most previous multimodal models are trained on perfectly paired multimodal input to reach optimal performance. In real-world deployments, however, the presence of modality is highly variable and unpredictable, causing the pre-trained models in suffering significant performance drops and fail to remain robust with dynamic missing modalities circumstances. In this paper, we present a novel Cyclic INformative Learning framework (CyIN) to bridge the gap between complete and incomplete multimodal learning. Specifically, we firstly build an informative latent space by adopting token- and label-level Information Bottleneck (IB) cyclically among various modalities. Capturing task-related features with variational approximation, the informative bottleneck latents are purified for more efficient cross-modal interaction and multimodal fusion. Moreover, to supplement the missing information caused by incomplete multimodal input, we propose cross-modal cyclic translation by reconstruct the missing modalities with the remained ones through forward and reverse propagation process. With the help of the extracted and reconstructed informative latents, CyIN succeeds in jointly optimizing complete and incomplete multimodal learning in one unified model. Extensive experiments on 4 multimodal datasets demonstrate the superior performance of our method in both complete and diverse incomplete scenarios.
💡 Research Summary
The paper introduces CyIN (Cyclic Informative Learning), a unified framework that simultaneously handles complete and incomplete multimodal learning. Traditional multimodal models assume that all modalities present during training will also be available at inference time; when this assumption is violated, performance drops dramatically, especially for transformer‑based architectures. Existing solutions either align modalities (contrastive learning, CCA, data augmentation) or generate missing modalities (autoencoders, VAEs, graph networks, diffusion models). However, they often require separate models for each missing‑modality pattern, suffer from insufficient exploitation of the missing information, and sacrifice performance on the fully‑paired case.
CyIN addresses these issues by constructing an “informative latent space” using two complementary Information Bottleneck (IB) mechanisms applied cyclically: (1) token‑level IB and (2) label‑level IB.
Token‑level IB operates on the low‑level token embeddings produced by modality‑specific encoders. For a source modality (S) and a target modality (T), each token embedding (f_i^S) is mapped to a Gaussian latent (b_i^S \sim \mathcal{N}(\mu_i, \sigma_i^2)) via an IB encoder. The KL divergence between this posterior and a standard normal prior penalizes unnecessary information, while a reconstruction term forces the decoder to predict the corresponding target token (f_i^T). By iterating over all modality pairs and also over the case (S=T) (intra‑modal compression), the model learns both intra‑modal dynamics and inter‑modal shared features. The final token‑level loss aggregates intra‑modal and bidirectional inter‑modal terms.
Label‑level IB injects high‑level semantic supervision. The same latent (B^S) is fed to a predictor (P^S) that outputs the task label (regression score or class probabilities). A variational IB loss combines the KL regularizer with a negative log‑likelihood term for the ground‑truth label. This forces the latent space to retain only task‑relevant information, aligning low‑level perception with high‑level semantics.
Both IB losses are weighted and summed to form the overall bottleneck objective (L_{\text{tib}}). The resulting latent space is compact, noise‑filtered, and enriched with task semantics.
Cross‑modal Cyclic Translation leverages this purified latent space to reconstruct missing modalities. A Cascaded Residual Autoencoder (CRA) acts as a translator (\Gamma_{S\rightarrow T}) that maps latent (B^S) to an estimate of the target latent (\hat{B}^{T}). A forward reconstruction loss (|B^T - \hat{B}^{T}|^2) encourages accurate translation. To further regularize the process, a reverse translation (\Gamma_{T\rightarrow S}) is applied to the forward output, and a cyclic consistency loss (|B^S - \Gamma_{T\rightarrow S}(\hat{B}^{T})|^2) is added. This bidirectional scheme ensures that the translation does not discard essential information and improves robustness when modalities are missing.
The total training objective combines the task loss, the IB loss, the forward reconstruction loss, and the cyclic consistency loss: \
Comments & Academic Discussion
Loading comments...
Leave a Comment