FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we present FSOD-VFM: Few-Shot Object Detectors with Vision Foundation Models, a framework that leverages vision foundation models to tackle the challenge of few-shot object detection. FSOD-VFM integrates three key components: a universal proposal network (UPN) for category-agnostic bounding box generation, SAM2 for accurate mask extraction, and DINOv2 features for efficient adaptation to new object categories. Despite the strong generalization capabilities of foundation models, the bounding boxes generated by UPN often suffer from overfragmentation, covering only partial object regions and leading to numerous small, false-positive proposals rather than accurate, complete object detections. To address this issue, we introduce a novel graph-based confidence reweighting method. In our approach, predicted bounding boxes are modeled as nodes in a directed graph, with graph diffusion operations applied to propagate confidence scores across the network. This reweighting process refines the scores of proposals, assigning higher confidence to whole objects and lower confidence to local, fragmented parts. This strategy improves detection granularity and effectively reduces the occurrence of false-positive bounding box proposals. Through extensive experiments on Pascal-5$^i$, COCO-20$^i$, and CD-FSOD datasets, we demonstrate that our method substantially outperforms existing approaches, achieving superior performance without requiring additional training. Notably, on the challenging CD-FSOD dataset, which spans multiple datasets and domains, our FSOD-VFM achieves 31.6 AP in the 10-shot setting, substantially outperforming previous training-free methods that reach only 21.4 AP. Code is available at: https://intellindust-ai-lab.github.io/projects/FSOD-VFM.

💡 Research Summary

FSOD‑VFM introduces a training‑free few‑shot object detection framework that directly leverages three state‑of‑the‑art vision foundation models (VFMs): a Universal Proposal Network (UPN) for class‑agnostic bounding‑box generation, SAM2 for high‑quality mask extraction, and DINOv2 for rich feature representation. The pipeline works as follows. In a K‑shot setting, each support annotation is processed by SAM2 to obtain a binary mask, and DINOv2 extracts a dense feature map from the whole image. RoI‑based pooling, guided by the mask, yields an object‑specific feature vector. All support features belonging to the same class are averaged and ℓ₂‑normalized to form class prototypes. For a query image, UPN proposes a large set of boxes together with a class‑agnostic confidence score (s_upn). Each proposal is also masked by SAM2 and encoded by DINOv2, producing a feature vector that is compared to the class prototypes via cosine similarity to obtain a class prediction and a matching score. At this stage the system already works without any fine‑tuning, but UPN’s proposals suffer from “over‑fragmentation”: many small boxes cover only parts of objects, inflating false positives.

To mitigate this, the authors construct a directed graph whose nodes correspond to the proposals of a single class. An edge from node i to node j exists only when s_upn_i ≤ s_upn_j; its weight is the fraction of i’s mask that is overlapped by j’s mask (Area(M_i ∩ M_j) / Area(M_i)). This design treats high‑confidence boxes as energy sources and low‑confidence, fragmented boxes as sinks that diffuse their confidence toward larger, more reliable boxes. Graph diffusion is then performed iteratively (the paper reports 30 steps as sufficient for convergence). After diffusion, confidence scores of fragmented proposals drop dramatically, while scores of boxes that tightly overlap ground‑truth objects (IoU > 0.75) remain stable. Visualizations in Figure 1 illustrate the progressive suppression of translucent, low‑quality boxes.

The method is evaluated on three benchmarks: Pascal‑5ⁱ, COCO‑20ⁱ, and the cross‑domain CD‑FSOD dataset (which aggregates six diverse datasets). Across all splits, FSOD‑VFM outperforms prior training‑free approaches and narrows the gap to fully supervised few‑shot detectors. Notably, on CD‑FSOD in the 10‑shot setting, the model achieves 31.6 AP, a ten‑point gain over the previous best training‑free result of 21.4 AP. Ablation studies confirm that (1) the baseline UPN + SAM2 + DINOv2 already provides competitive performance, (2) adding graph diffusion yields an additional 4–6 % absolute AP improvement, and (3) the diffusion hyper‑parameters (edge weighting scheme, number of steps) are robust within a reasonable range.

Key contributions are: (1) a fully training‑free few‑shot detector that unifies three complementary VFMs, (2) a novel graph‑diffusion confidence reweighting mechanism that explicitly addresses over‑fragmentation in class‑agnostic proposals, and (3) extensive cross‑domain validation demonstrating the approach’s generality. By eliminating the need for any task‑specific fine‑tuning, FSOD‑VFM opens the door to rapid deployment of few‑shot detection in resource‑constrained or rapidly changing environments such as autonomous driving, robotics, and medical imaging, where annotated data are scarce and time‑critical. Future work may explore richer graph constructions (e.g., incorporating semantic embeddings or temporal consistency) and extending the diffusion idea to other VFM‑based tasks like instance segmentation or video object tracking.

FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion

💡 Research Summary

Comments & Academic Discussion

Leave a Comment