Feature-level Interaction Explanations in Multimodal Transformers
Multimodal Transformers often produce predictions without clarifying how different modalities jointly support a decision. Most existing multimodal explainable AI (MXAI) methods extend unimodal saliency to multimodal backbones, highlighting important tokens or patches within each modality, but they rarely pinpoint which cross-modal feature pairs provide complementary evidence (synergy) or serve as reliable backups (redundancy). We present Feature-level I2MoE (FL-I2MoE), a structured Mixture-of-Experts layer that operates directly on token/patch sequences from frozen pretrained encoders and explicitly separates unique, synergistic, and redundant evidence at the feature level. We further develop an expert-wise explanation pipeline that combines attribution with top-K% masking to assess faithfulness, and we introduce Monte Carlo interaction probes to quantify pairwise behavior: the Shapley Interaction Index (SII) to score synergistic pairs and a redundancy-gap score to capture substitutable (redundant) pairs. Across three benchmarks (MMIMDb, ENRICO, and MMHS150K), FL-I2MoE yields more interactionspecific and concentrated importance patterns than a dense Transformer with the same encoders. Finally, pair-level masking shows that removing pairs ranked by SII or redundancy-gap degrades performance more than masking randomly chosen pairs under the same budget, supporting that the identified interactions are causally relevant.
💡 Research Summary
The paper tackles a fundamental gap in multimodal explainable AI (MXAI): existing methods typically highlight salient tokens or patches within each modality but fail to reveal how cross‑modal feature pairs jointly contribute to a decision, whether as complementary (synergistic) evidence or as interchangeable (redundant) backups. To address this, the authors extend the Interaction‑aware Mixture‑of‑Experts (I²MoE) architecture from pooled modality vectors to raw token‑ and patch‑level sequences, creating Feature‑level I²MoE (FL‑I2MoE).
FL‑I2MoE keeps pretrained text and image encoders frozen, extracting token‑level (text) and patch‑level (image) embeddings without any pooling. These sequences are concatenated and fed into a transformer‑based fusion module that contains three dedicated experts: uniqueness, synergy, and redundancy. Each expert produces its own class logits; a gating network learns instance‑specific mixture weights (wᵢ) that combine the expert predictions. The training objective mirrors the original I²MoE: a standard cross‑entropy loss for the downstream task plus an interaction loss (L_int) that encourages each expert to specialize in its intended information type.
For post‑hoc interpretation, the authors propose an expert‑wise attribution pipeline. They compute importance scores for every token/patch using Gradient × Attention rollout (Grad × AttnRoll), which proved most faithful in masking experiments. The resulting importance vectors are split per modality and per expert, yielding three sets of high‑importance features. These sets define a candidate pool for interaction analysis, dramatically reducing the combinatorial explosion of possible cross‑modal pairs.
Interaction quantification is performed via Monte Carlo estimation. For each expert, the top‑ρ % of its important features are selected. Then, for every cross‑modal pair (u, v) in this reduced set, the authors mask u, v, and both together across many random subsets of the remaining features, measuring the change in the target logit. Two scores are derived: (1) the Shapley Interaction Index (SII), which captures synergistic contribution (positive SII indicates that the pair adds more predictive power together than the sum of its parts); and (2) a Redundancy‑Gap metric, defined as the difference between the sum of individual masking effects and the joint masking effect, quantifying how substitutable the two features are.
Experiments span three diverse benchmarks: MM‑IMDb (movie‑genre classification from posters and plot summaries), ENRICO (clinical image‑text matching), and MMHS150K (multimodal sentiment analysis). All baselines share the same frozen encoders and comparable parameter budgets; the primary comparison is a dense transformer that fuses pooled modality vectors. Results show that FL‑I2MoE achieves modest but consistent accuracy gains (≈1–2 % absolute) while producing far more interpretable expert‑wise importance maps. Crucially, masking the top‑5 % of pairs ranked by SII degrades performance by an average of 8 % absolute, whereas random pair masking under the same budget yields only ~2 % degradation. A similar pattern holds for the Redundancy‑Gap ranking, confirming that the identified pairs are causally linked to model predictions.
Analysis of the learned mixture weights reveals dataset‑level trends: for text‑heavy datasets, the uniqueness and redundancy experts allocate higher weights to textual features, whereas image‑centric tasks see more balanced allocations. Instance‑level weights vary dynamically, reflecting the model’s ability to route each input toward the most informative interaction type.
The authors acknowledge computational limitations: Monte Carlo sampling requires 500–1000 draws per instance to obtain stable estimates, increasing GPU memory and runtime roughly twofold. Moreover, the current implementation is limited to two modalities; extending to three or more would necessitate additional expert designs and more sophisticated pair‑selection heuristics.
In summary, FL‑I2MoE provides a principled, feature‑level framework for disentangling unique, synergistic, and redundant information in multimodal transformers. By coupling an interaction‑aware architecture with Shapley‑based pairwise metrics and rigorous masking validation, the work advances both model performance and interpretability, offering a valuable tool for high‑stakes domains where understanding cross‑modal reasoning is essential.
Comments & Academic Discussion
Loading comments...
Leave a Comment