M3OOD: Automatic Selection of Multimodal OOD Detectors
Out-of-distribution (OOD) robustness is a critical challenge for modern machine learning systems, particularly as they increasingly operate in multimodal settings involving inputs like video, audio, and sensor data. Currently, many OOD detection methods have been proposed, each with different designs targeting various distribution shifts. A single OOD detector may not prevail across all the scenarios; therefore, how can we automatically select an ideal OOD detection model for different distribution shifts? Due to the inherent unsupervised nature of the OOD detection task, it is difficult to predict model performance and find a universally Best model. Also, systematically comparing models on the new unseen data is costly or even impractical. To address this challenge, we introduce M3OOD, a meta-learning-based framework for OOD detector selection in multimodal settings. Meta learning offers a solution by learning from historical model behaviors, enabling rapid adaptation to new data distribution shifts with minimal supervision. Our approach combines multimodal embeddings with handcrafted meta-features that capture distributional and cross-modal characteristics to represent datasets. By leveraging historical performance across diverse multimodal benchmarks, M3OOD can recommend suitable detectors for a new data distribution shift. Experimental evaluation demonstrates that M3OOD consistently outperforms 10 competitive baselines across 12 test scenarios with minimal computational overhead.
💡 Research Summary
The paper tackles the problem of automatically selecting the most suitable out‑of‑distribution (OOD) detector for multimodal machine‑learning systems, where inputs may include video, audio, optical flow, and other sensor streams. While many OOD detection methods exist, each is tailored to specific distribution shifts and no single detector works well across all scenarios. Moreover, OOD detection is inherently unsupervised—ground‑truth OOD labels are unavailable at test time—making it impractical to evaluate every candidate detector on a new dataset. To address this, the authors propose M3OOD, a meta‑learning framework that learns from historical performance of a pool of OOD detectors on a diverse set of multimodal benchmark datasets and then predicts which detector will perform best on a previously unseen dataset without any OOD labels.
Meta‑learning formulation
M3OOD treats each historical dataset‑detector pair (D_i, M_j) as a training example with a known performance score P_{i,j} (e.g., AUROC). The goal is to learn a function f that maps a representation of the dataset and a representation of the detector to the expected performance. The function f is instantiated as an XGBoost regression model because of its strong feature‑selection capability and interpretability.
Dataset representations
Two complementary types of meta‑features are extracted for each dataset:
- Hand‑crafted statistical and distributional descriptors (means, variances, dimensionalities, modality‑specific counts, etc.).
- Learned multimodal embeddings obtained from a SlowFast network pre‑trained on Kinetics‑400. The Slow pathway processes video frames, while a separate SlowFast branch processes optical flow; the resulting embeddings for all modalities are concatenated to form a unified multimodal vector E_data.
Detector representations
Each OOD detector (e.g., Maximum Softmax Probability, ODIN, Energy‑based, Mahalanobis, CLIP‑based methods) is encoded via a lightweight embedding ϕ(M_j) that captures its algorithmic family, scoring function, and any modality‑specific components. These embeddings are concatenated with the dataset embeddings to form the input to f.
Training phase (offline)
Given n historical multimodal dataset pairs and m candidate detectors, the authors compute the performance matrix P ∈ ℝ^{n×m}. They then train f to minimize the squared error between f(E_data_i, E_model_j) and P_{i,j} across all (i, j) pairs. Because XGBoost can handle heterogeneous features, the model automatically learns which meta‑features are most predictive of detector success.
Selection phase (online)
When a new dataset D_new arrives, its meta‑features and multimodal embedding are computed. The trained f is queried for each detector, producing predicted performance scores \hat{P}_j. The detector with the highest \hat{P}_j is selected and deployed, all without requiring any OOD ground‑truth labels.
Experimental evaluation
The authors assemble 12 multimodal OOD test scenarios covering video‑audio, video‑optical‑flow, and spectrogram modalities. They evaluate 11 baseline selection strategies, including prior meta‑OD methods (MetaOD, ELECT, ADGym), simple heuristics (always pick MSP or ODIN), and similarity‑based approaches. Across all scenarios, M3OOD consistently outperforms baselines, achieving average AUROC improvements of 4–7 percentage points. Selection time is on the order of seconds, demonstrating low computational overhead. Statistical tests confirm the significance of the ranking gains, and ablation studies show that both the handcrafted meta‑features and the SlowFast embeddings contribute meaningfully to performance.
Contributions
- Introduces the first meta‑learning framework for zero‑shot selection of OOD detectors in multimodal settings.
- Proposes a hybrid representation that combines domain‑specific statistical meta‑features with deep multimodal embeddings.
- Demonstrates superior selection accuracy over eleven baselines on a diverse benchmark, with efficient runtime.
- Releases code and benchmark datasets to facilitate reproducibility.
Limitations and future work
The current meta‑training corpus is dominated by video and optical‑flow data; extending to text, raw audio, LiDAR, or other sensor modalities remains an open question. While XGBoost works well, exploring more expressive meta‑learners such as transformer‑based models could further improve performance. Finally, the handcrafted meta‑features rely on expert knowledge; automating their discovery via neural architecture search or self‑supervised representation learning is a promising direction.
In summary, M3OOD provides a practical, data‑driven solution for automatically picking the most appropriate OOD detector in complex multimodal environments, bridging a critical gap between OOD detection research and real‑world deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment