FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization
Anomaly detection methods typically require extensive normal samples from the target class for training, limiting their applicability in scenarios that require rapid adaptation, such as cold start. Zero-shot and few-shot anomaly detection do not require labeled samples from the target class in advance, making them a promising research direction. Existing zero-shot and few-shot approaches often leverage powerful multimodal models to detect and localize anomalies by comparing image-text similarity. However, their handcrafted generic descriptions fail to capture the diverse range of anomalies that may emerge in different objects, and simple patch-level image-text matching often struggles to localize anomalous regions of varying shapes and sizes. To address these issues, this paper proposes the FiLo++ method, which consists of two key components. The first component, Fused Fine-Grained Descriptions (FusDes), utilizes large language models to generate anomaly descriptions for each object category, combines both fixed and learnable prompt templates and applies a runtime prompt filtering method, producing more accurate and task-specific textual descriptions. The second component, Deformable Localization (DefLoc), integrates the vision foundation model Grounding DINO with position-enhanced text descriptions and a Multi-scale Deformable Cross-modal Interaction (MDCI) module, enabling accurate localization of anomalies with various shapes and sizes. In addition, we design a position-enhanced patch matching approach to improve few-shot anomaly detection performance. Experiments on multiple datasets demonstrate that FiLo++ achieves significant performance improvements compared with existing methods. Code will be available at https://github.com/CASIA-IVA-Lab/FiLo.
💡 Research Summary
**
FiLo++ tackles the practical yet challenging problem of zero‑shot and few‑shot anomaly detection (ZSAD/FSAD), where only a handful or no normal samples from the target class are available. Existing ZSAD/FSAD approaches typically rely on large multimodal models such as CLIP and use handcrafted, generic text prompts like “normal” vs. “abnormal”. This leads to two major shortcomings: (1) the prompts are too coarse to capture the wide variety of defect types that can appear on different objects, and (2) simple patch‑level image‑text similarity matching fails to localize anomalies that span multiple patches or have irregular shapes.
FiLo++ introduces two complementary modules to overcome these limitations: Fused Fine‑Grained Descriptions (FusDes) and Deformable Localization (DefLoc).
FusDes leverages Large Language Models (LLMs) (e.g., GPT‑4) to generate detailed, category‑specific anomaly descriptions for each test sample. These fine‑grained texts replace the generic prompts. FusDes then combines three elements: (a) fixed, human‑crafted prompt templates (e.g., “A photo of
Comments & Academic Discussion
Loading comments...
Leave a Comment