PromptMAD: Cross-Modal Prompting for Multi-Class Visual Anomaly Localization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Visual anomaly detection in multi-class settings poses significant challenges due to the diversity of object categories, the scarcity of anomalous examples, and the presence of camouflaged defects. In this paper, we propose PromptMAD, a cross-modal prompting framework for unsupervised visual anomaly detection and localization that integrates semantic guidance through vision-language alignment. By leveraging CLIP-encoded text prompts describing both normal and anomalous class-specific characteristics, our method enriches visual reconstruction with semantic context, improving the detection of subtle and textural anomalies. To further address the challenge of class imbalance at the pixel level, we incorporate Focal loss function, which emphasizes hard-to-detect anomalous regions during training. Our architecture also includes a supervised segmentor that fuses multi-scale convolutional features with Transformer-based spatial attention and diffusion iterative refinement, yielding precise and high-resolution anomaly maps. Extensive experiments on the MVTec-AD dataset demonstrate that our method achieves state-of-the-art pixel-level performance, improving mean AUC to 98.35% and AP to 66.54%, while maintaining efficiency across diverse categories.

💡 Research Summary

PromptMAD tackles the challenging problem of unsupervised visual anomaly detection and localization in a multi‑class industrial setting. While existing reconstruction‑based methods such as OneNIP can detect anomalies by measuring reconstruction errors, they often struggle with subtle, camouflaged defects and suffer from pixel‑level class imbalance. PromptMAD introduces three key innovations to overcome these limitations.

First, it enriches the visual pipeline with cross‑modal textual prompts. For each product category, human‑crafted descriptions of normal appearance and possible defect types (e.g., “bent lead”, “crack”) are encoded using the frozen CLIP text encoder. The resulting 256‑dimensional embeddings are fused element‑wise with visual prompt embeddings derived from a single normal reference image. This semantic guidance leverages CLIP’s pre‑trained vision‑language alignment, allowing the model to differentiate between genuine anomalies and benign texture variations, especially in texture‑rich categories.

Second, PromptMAD replaces the uniform reconstruction loss (MSE) and Dice loss with a focal loss (α = 0.75, γ = 2). By down‑weighting well‑classified (easy) pixels and amplifying the loss for hard, low‑confidence pixels, the focal loss forces the network to focus on the sparse anomalous regions that dominate the pixel‑level evaluation metric.

Third, the authors design a text‑guided diffusion segmentor that refines the raw reconstruction error map. The segmentor consists of a multi‑scale residual CNN for local feature extraction, a Transformer encoder for global context modeling, and a 10‑step diffusion denoiser. The diffusion process follows a linear β schedule (1e‑4 → 0.02) and incorporates sinusoidal timestep embeddings into residual blocks. Crucially, the CLIP‑encoded text embedding is injected as a conditioning signal, improving boundary precision and average precision for small or camouflaged defects.

The backbone is EfficientNet‑B4, and the bidirectional Transformer decoder processes tokenized visual features together with the cross‑modal prompts. Training uses AdamW (lr = 1e‑4, weight decay = 1e‑5) for 1,500 epochs on four NVIDIA A100 GPUs.

Experiments on the MVTec‑AD benchmark (15 classes, 3,629 normal training images, 1,725 test images) show that PromptMAD achieves state‑of‑the‑art pixel‑level performance: mean pixel‑AUC improves from 97.81 % to 98.35 % and pixel‑AP from 63.52 % to 66.54 %. Notable gains appear in texture‑heavy categories such as pill, hazelnut, and tile, where semantic prompts help disambiguate defects from normal variations. The focal loss especially benefits sparse‑defect classes like bottle and wood. Image‑level metrics also see modest improvements, with maximum AUC rising from 97.20 % to 99.23 %.

Ablation studies confirm that each component contributes positively: the text‑guided segmentor alone yields modest gains; cross‑modal prompts alone improve texture categories; focal loss alone boosts rare defects but slightly hurts image‑level scores; combining prompts with the segmentor yields the largest synergy; the full PromptMAD model attains the best results.

In terms of efficiency, PromptMAD processes a test image in 5.2 ms (≈193 FPS), only slightly slower than OneNIP’s 4.5 ms (≈219 FPS), demonstrating that the added semantic and diffusion modules do not compromise real‑time applicability.

Limitations include the reliance on manually crafted textual prompts and occasional false positives on extremely thin defects. Future work may explore automatic prompt generation via large language models, lightweight diffusion architectures, and deployment on edge devices. Overall, PromptMAD presents a compelling integration of vision‑language alignment, loss re‑balancing, and diffusion refinement to achieve precise, multi‑class anomaly localization in industrial inspection scenarios.

PromptMAD: Cross-Modal Prompting for Multi-Class Visual Anomaly Localization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment