LADMIM: Logical Anomaly Detection with Masked Image Modeling in Discrete Latent Space

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Detecting anomalies such as an incorrect combination of objects or deviations in their positions is a challenging problem in unsupervised anomaly detection (AD). Since conventional AD methods mainly focus on local patterns of normal images, they struggle with detecting logical anomalies that appear in the global patterns. To effectively detect these challenging logical anomalies, we introduce Logical Anomaly Detection with Masked Image Modeling (LADMIM), a novel unsupervised AD framework that harnesses the power of masked image modeling and discrete representation learning. Our core insight is that predicting the missing region forces the model to learn the long-range dependencies between patches. Specifically, we formulate AD as a mask completion task, which predicts the distribution of discrete latents in the masked region. As a distribution of discrete latents is invariant to the low-level variance in the pixel space, the model can desirably focus on the logical dependencies in the image, which improves accuracy in the logical AD. We evaluate the AD performance on five benchmarks and show that our approach achieves compatible performance without any pre-trained segmentation models. We also conduct comprehensive experiments to reveal the key factors that influence logical AD performance.

💡 Research Summary

The paper introduces LADMIM, a novel unsupervised anomaly‑detection framework that simultaneously addresses structural and logical defects in industrial images. Structural anomalies are local irregularities such as scratches or stains, while logical anomalies involve incorrect relationships among multiple objects (mis‑alignment, wrong combinations). Existing unsupervised methods focus on local feature reconstruction or memory‑bank similarity, which work well for structural defects but struggle with global relational inconsistencies.

LADMIM combines two recent advances: (1) Hierarchical Vector‑Quantized Transformer (HVQ‑Trans) and (2) Masked Image Modeling (MIM) with a Vision Transformer (ViT). HVQ‑Trans learns a hierarchical codebook that quantizes feature maps into discrete latent tokens at multiple granularities. This hierarchical quantization mitigates the “dead‑code” problem of flat VQ‑VAE, ensures balanced code usage, and provides a rich token vocabulary that serves both as a reconstruction target for structural anomaly detection and as a tokenizer for the MIM stage.

During training, a random subset of image patches is masked. Instead of predicting raw pixels (which leads to blurry reconstructions), the ViT is trained to predict the probability distribution (histogram) of the discrete latent tokens within the masked region, as supplied by HVQ‑Trans. Because the target is a distribution over unordered tokens, the prediction is invariant to the exact spatial arrangement of features, forcing the model to capture long‑range dependencies and object‑level relationships. Structural anomalies are detected by measuring the reconstruction error of HVQ‑Trans (the difference between original and reconstructed quantized features). Logical anomalies are detected by measuring the divergence between the predicted token histogram and the ground‑truth histogram from HVQ‑Trans. The two scores are linearly combined to produce a final anomaly score, eliminating the need for any pre‑trained segmentation network.

The authors evaluate LADMIM on five benchmarks, notably MVTecLOCO (which contains both structural and logical anomalies) and MVTecAD (purely structural). Compared with prior MIM‑based methods, reconstruction‑based autoencoders, and memory‑bank approaches, LADMIM achieves state‑of‑the‑art AUROC on logical anomaly detection (improvements of 3–5 percentage points) while remaining competitive on structural detection. Ablation studies reveal that (i) predicting token histograms rather than token indices is crucial for positional invariance, (ii) a moderate masking ratio (≈25 %) balances context and difficulty, and (iii) larger hierarchical codebooks improve both reconstruction fidelity and MIM prediction accuracy. Importantly, because HVQ‑Trans prevents the identity‑shortcut common in autoencoders, the framework scales to larger ViT backbones without degrading performance.

In summary, LADMIM’s contributions are threefold: (1) introducing distribution‑based MIM targets to capture global logical relations, (2) leveraging hierarchical vector quantization as a unified token source for both structural reconstruction and logical prediction, and (3) providing a fully unsupervised pipeline that outperforms existing methods on logical anomaly detection without relying on external segmentation models. The work opens avenues for future research on more complex relational anomalies and large‑scale industrial deployment.

LADMIM: Logical Anomaly Detection with Masked Image Modeling in Discrete Latent Space

💡 Research Summary

Comments & Academic Discussion

Leave a Comment