A Survey on Class-Agnostic Counting: Advancements from Reference-Based to Open-World Text-Guided Approaches
Visual object counting has recently shifted towards class-agnostic counting (CAC), which addresses the challenge of counting objects across arbitrary categories, a crucial capability for flexible and generalizable counting systems. Unlike humans, who effortlessly identify and count objects from diverse categories without prior knowledge, most existing counting methods are restricted to enumerating instances of known classes, requiring extensive labeled datasets for training and struggling in open-vocabulary settings. In contrast, CAC aims to count objects belonging to classes never seen during training, operating in a few-shot setting. In this paper, we present the first comprehensive review of CAC methodologies. We propose a taxonomy to categorize CAC approaches into three paradigms based on how target object classes can be specified: reference-based, reference-less, and open-world text-guided. Reference-based approaches achieve state-of-the-art performance by relying on exemplar-guided mechanisms. Reference-less methods eliminate exemplar dependency by leveraging inherent image patterns. Finally, open-world text-guided methods use vision-language models, enabling object class descriptions via textual prompts, offering a flexible and promising solution. Based on this taxonomy, we provide an overview of 30 CAC architectures and report their performance on gold-standard benchmarks, discussing key strengths and limitations. Specifically, we present results on the FSC-147 dataset, setting a leaderboard using gold-standard metrics, and on the CARPK dataset to assess generalization capabilities. Finally, we offer a critical discussion of persistent challenges, such as annotation dependency and generalization, alongside future directions.
💡 Research Summary
This paper presents the first comprehensive survey of class‑agnostic counting (CAC), a rapidly emerging subfield of visual object counting that seeks to estimate the number of objects belonging to arbitrary, previously unseen categories. The authors begin by contrasting CAC with traditional class‑specific counting, highlighting that conventional methods require a dedicated model and large annotated datasets for each target class, whereas CAC aims to operate in a few‑shot or zero‑shot regime, dramatically reducing annotation burden and improving flexibility.
A central contribution of the work is a taxonomy that classifies CAC approaches according to how the target class is specified: (1) Reference‑based (exemplar‑guided) methods, which receive a few bounding‑box exemplars of the target objects at both training and inference; (2) Reference‑less (exemplar‑free) methods, which automatically infer the dominant repeatable object class from image self‑similarity; and (3) Open‑world text‑guided (prompt‑based) methods, which use natural‑language descriptions together with large vision‑language models (e.g., CLIP, BLIP) to define the counting target. The taxonomy reflects the evolution of CAC from heavily supervised, human‑dependent pipelines toward increasingly autonomous and semantically controllable systems.
The survey systematically reviews 30 representative CAC architectures, organizing them under the three paradigms. For reference‑based approaches, seminal works such as FamNet, RepRPN, R‑C‑C, and the recent CAC‑ViT are discussed; these models typically employ a backbone CNN or Vision Transformer, a prototype‑matching module, and a density‑map regression head. Reference‑less techniques like GCNet, ZSC, VL‑Counter, and GroundingREC rely on self‑similarity cues, multi‑scale feature aggregation, and clustering to discover the most repetitive visual pattern, thereby eliminating the need for exemplars. Text‑guided methods—including CLIP‑Count, V‑Count, and other CLIP‑based pipelines—leverage cross‑modal embeddings to align textual prompts with visual features, enabling zero‑shot counting of arbitrary categories.
Performance evaluation is conducted on two benchmark datasets. FSC‑147, introduced alongside the first few‑shot CAC formulation, contains over 6,000 images spanning 147 categories and provides point‑level annotations as well as exemplar boxes. The authors report mean absolute error (MAE), root‑mean‑square error (RMSE), and normalized absolute error (NAE) for each method, establishing a leaderboard. Reference‑based models achieve the best scores, with CAC‑ViT attaining the lowest MAE (≈6 % improvement over earlier baselines). CARPK, a vehicle‑counting dataset, is used to assess cross‑domain generalization. While reference‑based methods still outperform, text‑guided approaches demonstrate promising zero‑shot transfer, especially when prompts are carefully crafted.
The survey identifies several persistent challenges. First, CAC still depends on dual annotation types (point locations and exemplar boxes), which is costly and limits dataset diversity. Second, many methods over‑fit to the visual distribution of FSC‑147, leading to degraded performance on out‑of‑distribution scenes such as aerial or crowded surveillance footage. Third, uncertainty estimation is largely absent; most models produce deterministic density maps without confidence intervals, hindering deployment in safety‑critical applications. Fourth, current systems typically count a single class per image; extending to simultaneous multi‑class counting remains an open problem.
To address these issues, the authors outline future research directions: (i) Label‑efficient learning, including semi‑supervised, self‑training, and synthetic data generation (e.g., diffusion‑based image‑point pair synthesis); (ii) Hybrid prompting, combining textual descriptions with a few visual exemplars to improve robustness of text‑guided models; (iii) Probabilistic modeling, integrating Bayesian deep learning or Monte‑Carlo dropout to provide uncertainty estimates; (iv) Domain adaptation and continual learning, leveraging meta‑learning or domain randomization to enhance cross‑domain robustness; and (v) Multi‑class and temporal counting, extending CAC to video streams where objects of several categories appear and evolve over time.
In conclusion, this survey offers a detailed taxonomy, a thorough performance comparison, and a critical discussion of strengths and limitations across CAC paradigms. By highlighting the rapid shift toward open‑world, prompt‑driven counting, the paper not only maps the current state of the art but also charts a clear roadmap for future advances in truly universal object counting.
Comments & Academic Discussion
Loading comments...
Leave a Comment