Deep Multimodal Learning with Missing Modality: A Survey

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

During multimodal model training and testing, certain data modalities may be absent due to sensor limitations, cost constraints, privacy concerns, or data loss, negatively affecting performance. Multimodal learning techniques designed to handle missing modalities can mitigate this by ensuring model robustness even when some modalities are unavailable. This survey reviews recent progress in Multimodal Learning with Missing Modality (MLMM), focusing on deep learning methods. It provides the first comprehensive survey that covers the motivation and distinctions between MLMM and standard multimodal learning setups, followed by a detailed analysis of current methods, applications, and datasets, concluding with challenges and future directions.

💡 Research Summary

This paper presents the first comprehensive survey of deep multimodal learning methods that explicitly address the problem of missing modalities (MLMM). The authors begin by distinguishing MLMM from conventional multimodal learning, emphasizing that real‑world systems often encounter situations where one or more modalities are unavailable due to sensor failures, cost constraints, privacy regulations, or transmission errors. They argue that robustness to such incomplete inputs is essential for reliable deployment.

A novel taxonomy is introduced, organized along two orthogonal axes: Data Processing and Strategy Design. Under Data Processing, methods are split into Modality Imputation and Representation‑Focused Models. Imputation operates at the raw data level and includes (i) modality composition (zero/random filling, K‑nearest‑neighbor retrieval of similar samples) and (ii) modality generation using generative models such as auto‑encoders, GANs, and diffusion networks. Representation‑Focused models address missing data at the feature level, either by enforcing coordinated latent spaces (co‑aligned embeddings) or by generating missing‑modality embeddings from available ones and/or by fusing existing embeddings to compensate for gaps.

Strategy Design is divided into Architecture‑Focused Models and Model Combinations. Architecture‑Focused approaches modify the network structure to adapt dynamically to the set of present modalities. The survey highlights four main techniques: (a) attention mechanisms that re‑weight available modalities, (b) knowledge distillation where a full‑modality “teacher” transfers knowledge to a partial‑modality “student,” (c) graph‑based fusion that exploits relational structures among modalities, and (d) multimodal large language models (MLLMs) that ingest arbitrary numbers of modality embeddings via textual prompts. Model Combination strategies, on the other hand, keep multiple specialized models and select or ensemble them at inference time. Examples include dedicated training schedules for each missing‑modality pattern, ensemble voting, and discrete schedulers that orchestrate sub‑modules based on modality availability.

The authors collected 354 relevant papers published between 2012 and August 2025 from top conferences and journals across AI, computer vision, NLP, audio processing, data mining, multimedia, medical imaging, remote sensing, and pervasive computing. They provide a detailed table of representative works for each taxonomy branch, together with the domains where they have been applied: sentiment analysis, medical diagnosis, remote sensing, robotic vision, information retrieval, and multi‑view clustering. A curated list of more than 20 public datasets (e.g., CMU-MOSI, AV-MNIST, MIMIC‑IV multimodal, etc.) is also included, illustrating the breadth of evaluation resources.

In the discussion of challenges, the survey identifies several open issues: (1) integration of missing‑modality techniques with large‑scale pre‑trained multimodal models, (2) handling temporally evolving missing patterns (e.g., streaming video with intermittent sensor drop‑outs), (3) computational and memory efficiency for real‑time deployment, (4) bias and distortion introduced by imputed or generated modalities, and (5) lack of standardized benchmarks that jointly evaluate robustness, accuracy, and efficiency. Future research directions suggested include learning modality‑invariant representations, dynamic graph‑based fusion, prompt engineering for MLLMs, and multi‑task/multi‑domain transfer learning to improve generalization under missing data.

Overall, the paper offers a clear, structured overview of the state‑of‑the‑art in deep MLMM, a taxonomy that helps researchers locate relevant methods, a synthesis of applications and datasets, and a forward‑looking agenda that highlights where the field should head next.

Deep Multimodal Learning with Missing Modality: A Survey

💡 Research Summary

Comments & Academic Discussion

Leave a Comment