Quantifying modality contributions in multimodal models remains a challenge, as existing approaches conflate the notion of contribution itself. Prior work relies on accuracy-based approaches, interpreting performance drops after removing a modality as indicative of its influence. However, such outcome-driven metrics fail to distinguish whether a modality is inherently informative or whether its value arises only through interaction with other modalities. This distinction is particularly important in cross-attention architectures, where modalities influence each other's representations. In this work, we propose a framework based on Partial Information Decomposition (PID) that quantifies modality contributions by decomposing predictive information in internal embeddings into unique, redundant, and synergistic components. To enable scalable, inference-only analysis, we develop an algorithm based on the Iterative Proportional Fitting Procedure (IPFP) that computes layer and dataset-level contributions without retraining. This provides a principled, representation-level view of multimodal behavior, offering clearer and more interpretable insights than outcome-based metrics.
Deep Dive into Quantifying Modality Contributions via Disentangling Multimodal Representations.
Quantifying modality contributions in multimodal models remains a challenge, as existing approaches conflate the notion of contribution itself. Prior work relies on accuracy-based approaches, interpreting performance drops after removing a modality as indicative of its influence. However, such outcome-driven metrics fail to distinguish whether a modality is inherently informative or whether its value arises only through interaction with other modalities. This distinction is particularly important in cross-attention architectures, where modalities influence each other’s representations. In this work, we propose a framework based on Partial Information Decomposition (PID) that quantifies modality contributions by decomposing predictive information in internal embeddings into unique, redundant, and synergistic components. To enable scalable, inference-only analysis, we develop an algorithm based on the Iterative Proportional Fitting Procedure (IPFP) that computes layer and dataset-level c
Recent advances in multimodal learning have enabled models to process and align information across sensory modalities such as vision and language. However, despite impressive empirical results, current methods still struggle with true multimodal integration. They often exhibit modality imbalance, a tendency to over-rely on one modality while underutilizing the other Peng et al. (2022), Huang et al. (2022), Fan et al. (2023), Wei et al. (2024). As illustrated in Figure 1, this imbalance appears asymmetrically across both vision and text. * Equal Contribution Existing works attempt to quantify modality contribution by perturbing or masking inputs (Gat et al., 2021a), (Wenderoth et al., 2025), (Wang and Wang, 2025) or by employing gradient and attention-based explanations, such as Integrated Gradients (IG) (Sundararajan et al., 2017) and Grad-CAM (Selvaraju et al., 2017). Recent works use Shapley-based approaches, such as MM-SHAP (Parcalabescu and Frank, 2022) and its extensions Wang and Wang (2025), Goldshmidt and Horovicz (2024), Goldshmidt (2025). However, these methods largely treat modalities as independent sources of information, overlooking the cross-modal interactions that emerge within the model's internal arXiv: 2511.19470v1 [cs.LG] 22 Nov 2025 feature space.
While the methods above often treat modalities as independent sources of information, another line of research highlights that multimodal fusion is inherently interactional. Methods such as EMAP Hessel and Lee (2020), DIME Lyu et al. (2022), and MultiViz Liang et al. (2022) demonstrate that meaningful cross-modal dependencies play a critical role in shaping the final prediction.
To bridge these perspectives, we introduce a unified information-theoretic framework that captures both modality attribution and cross-modal interactions through PID (Williams and Beer, 2010). PID decomposes total predictive information into unique, redundant, and synergistic components. This principled decomposition offers clear insights into how individual modalities contribute both independently and collectively, enabling a deeper understanding of how multimodal fusion arises within the model’s internal feature representations.
Unlike previous PID-based approaches that rely on computationally intensive conic solvers or require auxiliary network training (Liang et al., 2023), we present a scalable, inference-only approximation leveraging the Iterative Proportional Fitting Procedure (IPFP) (Bacharach, 1965). This makes our approach parameter-free, computationally efficient, and directly applicable for post-hoc multimodal analysis.
Our key contributions are as follows:
• We propose the first modality contribution metric that jointly captures modality-specific and interactional effects.
• We develop a quantification approach based on PID and propose a novel, computationally efficient metric derived using the IPFP algorithm.
• We evaluate our method across diverse VLMs and datasets, supported by synthetic experiments and ablation studies.
2 Related work
Gradient-based approaches, such as Integrated Gradients (IG) Sundararajan et al. (2017) attribute contributions through path-integrated gradients but suffer from instability, baseline sensitivity, and high attribution noise (Zhuo and Ge, 2024). By relying solely on explicand gradients and neglecting counterfactual ones, IG fails to capture higher-order effects, like redundant or synergistic contributions, where features jointly influence predictions in nonadditive ways, and violates key Shapley axioms, yielding spurious attributions. More broadly, gradient saliency often reflects the model’s implicit density induced by the softmax layer rather than genuine discriminative reasoning (Srinivas and Fleuret, 2020). Attention-based explanations share similar pitfalls. Attention weights may misrepresent causal polarity (Liu et al., 2022), fail to correlate with true feature importance (Wiegreffe and Pinter, 2019), and can be adversarially altered without affecting predictions (Serrano and Smith, 2019). White-box visualization methods such as attention maps and Grad-CAM Selvaraju et al. ( 2017) offer coarse interpretability but remain architecture-dependent and cannot disentangle synergistic from suppressive cross-modal effects.
Early studies on quantifying modality contribution focused on uncovering dataset biases that enable strong unimodal performance (Goyal et al., 2017a). Subsequent work introduced perturbation-based tests to quantify modality importance by observing performance degradation when one modality is removed or altered. These approaches fall into two categories: deletion methods, which suppress or mask modality features (Shekhar et al., 2017a;Madhyastha et al., 2018;Frank et al., 2021), and contradiction or foiling methods, which inject misleading inputs such as swapped captions or textual foils (Gat et al., 2021b;Shekhar et al., 2019;Parcalabescu et al., 2022). However, their reliance on accuracy-based ev
…(Full text truncated)…
This content is AI-processed based on ArXiv data.