DCER: Dual-Stage Compression and Energy-Based Reconstruction
Multimodal fusion faces two robustness challenges: noisy inputs degrade representation quality, and missing modalities cause prediction failures. We propose DCER, a unified framework addressing both challenges through dual-stage compression and energy-based reconstruction. The compression stage operates at two levels: within-modality frequency transforms (wavelet for audio, DCT for video) remove noise while preserving task-relevant patterns, and cross-modality bottleneck tokens force genuine integration rather than modality-specific shortcuts. For missing modalities, energy-based reconstruction recovers representations via gradient descent on a learned energy function, with the final energy providing intrinsic uncertainty quantification (\r{ho} > 0.72 correlation with prediction error). Experiments on CMU-MOSI, CMU-MOSEI, and CH-SIMS demonstrate state-of-the-art performance across all benchmarks, with a U-shaped robustness pattern favoring multimodal fusion at both complete and high-missing conditions. The code will be available on Github.
💡 Research Summary
The paper introduces DCER (Dual‑stage Compression with Energy‑based Reconstruction), a unified framework that tackles two major robustness issues in multimodal sentiment analysis: noisy inputs and missing modalities. The authors argue that both problems can be mitigated by forcing information through capacity‑limited channels, i.e., compression.
In the first compression stage, each modality is processed in its native signal domain. Audio signals undergo a three‑level discrete wavelet transform (DWT) with learnable wavelet bases (initialized from Daubechies‑4) to separate coarse approximations from fine‑scale details. The wavelet coefficients are then passed through a cross‑scale attention module that emphasizes frequency bands known to carry emotional cues while suppressing uniformly distributed sensor noise. Video frames are transformed with a 2‑D discrete cosine transform (DCT) and split into four frequency bands (low to high). A learnable band‑boundary attention selects the most informative bands, effectively discarding high‑frequency components that are mostly noise. Text, already a symbolic representation, is encoded with a pre‑trained RoBERTa model followed by a linear projection.
The second stage introduces a cross‑modal bottleneck. A small set of K learnable query tokens (K=4 in the experiments) attend simultaneously to the concatenated modality features H =
Comments & Academic Discussion
Loading comments...
Leave a Comment