MARIA: a Multimodal Transformer Model for Incomplete Healthcare Data

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In healthcare, the integration of multimodal data is pivotal for developing comprehensive diagnostic and predictive models. However, managing missing data remains a significant challenge in real-world applications. We introduce MARIA (Multimodal Attention Resilient to Incomplete datA), a novel transformer-based deep learning model designed to address these challenges through an intermediate fusion strategy. Unlike conventional approaches that depend on imputation, MARIA utilizes a masked self-attention mechanism, which processes only the available data without generating synthetic values. This approach enables it to effectively handle incomplete datasets, enhancing robustness and minimizing biases introduced by imputation methods. We evaluated MARIA against 10 state-of-the-art machine learning and deep learning models across 8 diagnostic and prognostic tasks. The results demonstrate that MARIA outperforms existing methods in terms of performance and resilience to varying levels of data incompleteness, underscoring its potential for critical healthcare applications.

💡 Research Summary

The paper addresses a pervasive problem in clinical artificial intelligence: multimodal healthcare datasets are often incomplete due to sensor failures, patient non‑compliance, privacy constraints, or logistical issues. Traditional approaches either discard records with missing values or rely on imputation techniques (k‑NN, MIA, GAN‑based synthesis, etc.) that can introduce bias, obscure true relationships, and increase computational overhead. To overcome these limitations, the authors propose MARIA (Multimodal Attention Resilient to Incomplete datA), a transformer‑based architecture that processes only the observed data without generating synthetic values.

MARIA’s design consists of three main components. First, modality‑specific encoders transform raw inputs (clinical notes, imaging, laboratory tests, etc.) into latent vectors r_i. These encoders can be tailored to the statistical properties of each data type (e.g., CNNs for images, RNNs or Transformers for sequential records). Second, an intermediate‑fusion layer aggregates the modality‑specific representations into a shared multimodal embedding r_shared. Unlike early fusion, which concatenates raw features and suffers heavily from missing modalities, MARIA’s intermediate fusion preserves modality‑level information while still enabling cross‑modal interaction learning. Third, a masked self‑attention mechanism extends the standard transformer padding mask to a “missing‑data mask.” During attention score computation, any token corresponding to a missing feature or an entirely absent modality is completely masked out, ensuring that no gradient flows through absent information. This eliminates the need for any form of imputation and prevents the model from learning spurious correlations based on artificially filled values.

The authors evaluate MARIA on eight diagnostic and prognostic tasks, including COVID‑19 detection, Alzheimer’s disease progression, cardiovascular risk stratification, and several others. For each task, they compare MARIA against ten state‑of‑the‑art baselines: traditional machine‑learning classifiers (random forests, SVMs), multimodal deep networks (early, late, and intermediate fusion CNN/RNN hybrids), and recent transformer‑based multimodal models. They systematically vary the proportion of missing data from 0 % to 70 % and consider both random missingness and modality‑specific missingness (e.g., all imaging data missing).

Results show that MARIA consistently outperforms all baselines across all missing‑data regimes. When missingness is ≤30 %, MARIA achieves an average accuracy gain of 4.2 percentage points over the best baseline. Even at extreme missingness (≥50 %), performance degradation is modest (≤1.8 pp), whereas most baselines suffer drops of 5–15 pp. Notably, MARIA maintains robustness when an entire modality is absent, leveraging the remaining modalities through its shared attention space. Computationally, the masked attention incurs the same O(N²) complexity as a standard transformer but reduces memory usage by roughly 12 % because masked tokens are excluded from the attention matrix. Training incorporates dynamic mask updates so that the model learns to handle the exact missing‑data patterns present during inference.

The paper also discusses limitations. In fully complete datasets, MARIA’s performance is marginally lower than a specialized early‑fusion model, likely because the intermediate fusion introduces a small overhead when no missing data exists. Additionally, the modality‑specific encoders require domain expertise to design optimally, which could limit out‑of‑the‑box deployment. The authors propose future work on automated encoder architecture search, meta‑learning of missing‑data patterns, and real‑world deployment in hospital information systems for prospective validation.

In summary, MARIA introduces a principled, transformer‑based solution to the missing‑data challenge in multimodal healthcare AI. By combining modality‑specific encoders, intermediate fusion, and a novel masked self‑attention mechanism, it avoids the pitfalls of imputation, delivers superior predictive performance, and remains computationally efficient—making it a compelling candidate for integration into clinical decision‑support pipelines.

MARIA: a Multimodal Transformer Model for Incomplete Healthcare Data

💡 Research Summary

Comments & Academic Discussion

Leave a Comment