BrokenBind: Universal Modality Exploration beyond Dataset Boundaries
Multi-modal learning combines various modalities to provide a comprehensive understanding of real-world problems. A common strategy is to directly bind different modalities together in a specific joint embedding space. However, the capability of existing methods is restricted within the modalities presented in the given dataset, thus they are biased when generalizing to unpresented modalities in downstream tasks. As a result, due to such inflexibility, the viability of previous methods is seriously hindered by the cost of acquiring multi-modal datasets. In this paper, we introduce BrokenBind, which focuses on binding modalities that are presented from different datasets. To achieve this, BrokenBind simultaneously leverages multiple datasets containing the modalities of interest and one shared modality. Though the two datasets do not correspond to each other due to distribution mismatch, we can capture their relationship to generate pseudo embeddings to fill in the missing modalities of interest, enabling flexible and generalized multi-modal learning. Under our framework, any two modalities can be bound together, free from the dataset limitation, to achieve universal modality exploration. Further, to reveal the capability of our method, we study intensified scenarios where more than two datasets are needed for modality binding and show the effectiveness of BrokenBind in low-data regimes. Through extensive evaluation, we carefully justify the superiority of BrokenBind compared to well-known multi-modal baseline methods.
💡 Research Summary
The paper tackles a fundamental limitation in current multimodal learning: most existing modality‑binding methods (e.g., ImageBind, LanguageBind, FreeBind) require that all modalities of interest appear together in a single dataset with instance‑level correspondence. In practice, collecting such fully aligned multimodal data is prohibitively expensive, and datasets often come from different domains (indoor vs. outdoor, visual vs. tactile, etc.). The authors name this situation a “broken dataset” problem and propose BrokenBind, a framework that can bind arbitrary modalities even when they are never observed together.
Core idea
BrokenBind leverages a pivot modality that is shared across multiple datasets. By simultaneously training on two (or more) datasets that each contain the pivot and one target modality, the method learns two kinds of relationships: (1) cross‑modal transition from the pivot to each target modality, and (2) cross‑dataset transition between the pivot representations of the different datasets. Both relationships are modeled as linear transition matrices W, computed via a pseudo‑inverse of the pivot embedding matrix (Moore‑Penrose inverse). Applying these matrices yields two synthetic embeddings for a missing modality:
- X‑mod: pivot → target transition (e.g., vision → audio)
- X‑data: dataset‑to‑dataset pivot transition followed by target mapping
A regularization term R_Fro enforces consistency between X‑mod and X‑data, ensuring that the synthetic embeddings are coherent across both modalities and datasets.
Modality Extrapolation (MOX) loss
The synthetic embeddings are used in a contrastive loss that aligns them with the real pivot embeddings, effectively teaching the model how the missing modality should behave. This MOX loss is combined with a CyCLIP loss (a CLIP‑style contrastive objective plus a symmetric loss) that preserves the usual cross‑modal alignment for the modalities that are actually present.
The overall objective is:
L = L_MOX + L_CyCLIP
where L_MOX contains two contrastive terms (pivot‑target and target‑pivot) plus the Frobenius regularizer, and L_CyCLIP is the standard CLIP loss extended to multiple datasets.
Training and inference
During training, the pseudo‑inverse of the pivot matrix is kept fixed, avoiding instability. The transition matrices are updated so that the pivot embeddings from different datasets become mutually aligned, and the target encoders learn to generate embeddings that are compatible with both datasets. At test time, the model can accept any modality as input and retrieve representations of any other modality, even if that pair never co‑occurred in the training data.
Experiments
The authors evaluate BrokenBind on several realistic scenarios:
- Two‑dataset binding: ImageNet (vision) + AudioSet (audio) with vision as pivot, and ShapeNet (point clouds) + a tactile dataset with point clouds as pivot.
- Multi‑dataset binding: Combining three or more datasets to bind four modalities (vision, audio, point cloud, tactile) simultaneously.
- Low‑data regime: Training with as little as 1 % of the original samples.
Across all settings, BrokenBind outperforms prior methods by a large margin (10–25 % absolute improvement in mAP or Recall@1). The gains are especially pronounced when the target modality is missing from a dataset, confirming that the synthetic embeddings effectively fill the “broken” gaps. Ablation studies show that both the cross‑modal and cross‑data transition matrices are necessary, and that the Frobenius consistency regularizer stabilizes training.
Insights and implications
- Pivot flexibility – The framework works with either visual or textual pivots, indicating that any well‑pretrained modality can serve as the bridge.
- Linear transition sufficiency – Although the transition matrices assume linear relationships, multi‑extrapolation (using many source points) allows the model to approximate non‑linear manifolds, as demonstrated empirically.
- Scalability – Adding a new modality only requires a dataset that shares the pivot; no pairwise alignment with every existing modality is needed, dramatically reducing data collection cost.
- Generalization – By aligning distributions across datasets, BrokenBind mitigates domain shift, enabling robust performance even when the new modality comes from a different environment (e.g., indoor vs. outdoor).
Conclusion
BrokenBind introduces a principled way to bind arbitrary modalities across disparate datasets by exploiting a shared pivot modality, transition‑matrix based extrapolation, and contrastive learning. It removes the restrictive requirement of fully aligned multimodal datasets, opening the door to large‑scale, cost‑effective multimodal AI systems that can seamlessly incorporate new sensors or data sources. The extensive experiments validate its superiority over existing binding methods, especially in low‑resource and multi‑dataset scenarios.
Comments & Academic Discussion
Loading comments...
Leave a Comment