Zero-shot anomaly detection (ZSAD) has gained increasing attention in medical imaging as a way to identify abnormalities without task-specific supervision, but most advances remain limited to 2D datasets. Extending ZSAD to 3D medical images has proven challenging, with existing methods relying on slice-wise features and vision-language models, which fail to capture volumetric structure. In this paper, we introduce a fully training-free framework for ZSAD in 3D brain MRI that constructs localized volumetric tokens by aggregating multi-axis slices processed by 2D foundation models. These 3D patch tokens restore cubic spatial context and integrate directly with distance-based, batch-level anomaly detection pipelines. The framework provides compact 3D representations that are practical to compute on standard GPUs and require no fine-tuning, prompts, or supervision. Our results show that training-free, batch-based ZSAD can be effectively extended from 2D encoders to full 3D MRI volumes, offering a simple and robust approach for volumetric anomaly detection.
Anomaly detection is essential in medical imaging, where early identification of abnormal structures supports diagnosis and treatment planning. Conventional unsupervised anomaly detection (UAD) methods rely on large sets of clean, domain-specific training data, which are costly to obtain for volumetric medical imaging. Zero-shot anomaly detection (ZSAD) offers an appealing alternative by removing the need for supervised training, yet most advances remain limited to 2D images (Jeong et al., 2023;Zhou et al., 2023;Chen et al., 2023;Li et al., 2024;Gia and Ahn, 2025). Extending ZSAD to 3D MRI volumes is nontrivial: there are no 3D foundation models, and simple slice-wise features cannot capture full volumetric structure. Recent CLIP-based attempts (Marzullo et al., 2025) highlight this difficulty, showing that naive extensions of 2D ZSAD pipelines to 3D yield unstable performance.
Existing ZSAD approaches for 2D images follow two main directions. Text-based methods (Jeong et al., 2023;Zhou et al., 2023;Chen et al., 2023) use vision-language models to score abnormalities via text prompts, but often require prompt tuning or additional training to achieve satisfactory performance. Batch-based methods (Li et al., 2024;Gia and Ahn, 2025) instead operate purely on visual tokens extracted by vision transformers and exploit the intrinsic structure and statistics of these tokens across a batch of images.
These methods exploit the statistical observations: normal patches consistently find similar counterparts in other images, whereas anomalous patches are rare and distinctive. By performing cross-sample similarity searches, batch-based approaches isolate such outliers without any prompts or supervision. However, extending this paradigm directly to 3D MRI is not straightforward. First, there are currently no general-purpose foundation models (akin to DINOv2 or CLIP) for volumetric data. Second, volumetric images generate orders of magnitude more tokens than 2D images, so naive tokenization causes extreme memory demands and renders mutual similarity computations computationally intractable.
In this paper, we propose a training-free batch-based ZSAD framework for 3D brain MRI to address these limitations. Our approach constructs localized volumetric tokens by aggregating multi-axis 2D foundation-model features into cubic 3D patches. This strategy preserves essential 3D spatial context while drastically reducing the number of tokens per volume, bringing the token count back into a regime where batch-based methods are feasible. To further reduce computational and memory burden, we apply a random projection to the token features, compressing their dimensionality while approximately preserving neighborhood geometry. The resulting compact 3D tokens can then be processed with standard batch-based approaches such as MuSc (Li et al., 2024) andCoDeGraph (Gia andAhn, 2025), without any fine-tuning, prompts, or task-specific supervision.
Our main contributions are:
• We introduce the first practical batch-based ZSAD framework for 3D brain MRI, extending fully training-free principles from 2D to volumetric data.
• We propose a multi-axis volumetric tokenization and random projection pipeline that preserves cubic spatial context while enabling tractable mutual similarity computations for 3D volumes.
• Through extensive experiments, we show that our method outperforms representative CLIP-based ZSAD baselines and, in some cases, matches or exceeds the performance of supervised methods.
Anomaly Detection in Brain MRI. Most unsupervised anomaly detection methods for brain MRI rely on reconstruction-based models-such as Autoencoders (Atlason et al., 2019;Baur et al., 2021;Cai et al., 2024), VQ-VAE (Pinaya et al., 2022b), GANs (Schlegl et al., 2019), or diffusion models (Pinaya et al., 2022a;Wu et al., 2024)-that must be trained on large collections of normal 3D MRI volumes to learn a representation of healthy anatomy. These approaches operate on the assumption that a model trained exclusively on healthy data will fail to accurately reconstruct unseen pathological features, allowing anomalies to be segmented via the residual error. While effective in controlled settings, their behavior depends heavily on the specific distribution of the training set. Consequently, these models are often brittle when deployed across different scanners or acquisition protocols (domain shift).
Zero-Shot Anomaly Detection. To eliminate the need for training data, Zero-Shot Anomaly Detection (ZSAD) leverages representations from pre-trained foundation mod-els. The dominant approach aligns visual features with textual descriptions of normality and pathology using vision-language models like CLIP (Jeong et al., 2023;Zhou et al., 2023;Chen et al., 2023). However, applying this strategy to medical imaging is problematic due to the significant domain gap and the difficulty of crafting robust clinical text prompts (Marzullo et al., 2025). A parallel emerging p
This content is AI-processed based on open access ArXiv data.