SITUATE -- Synthetic Object Counting Dataset for VLM training

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present SITUATE, a novel dataset designed for training and evaluating Vision Language Models on counting tasks with spatial constraints. The dataset bridges the gap between simple 2D datasets like VLMCountBench and often ambiguous real-life datasets like TallyQA, which lack control over occlusions and spatial composition. Experiments show that our dataset helps to improve generalization for out-of-distribution images, since a finetune of Qwen VL 2.5 7B on SITUATE improves accuracy on the Pixmo count test data, but not vice versa. We cross validate this by comparing the model performance across established other counting benchmarks and against an equally sized fine-tuning set derived from Pixmo count.

💡 Research Summary

The paper introduces SITUATE, a synthetic object‑counting dataset specifically designed to improve Vision‑Language Models (VLMs) on tasks that require precise quantitative reasoning together with spatial constraints. Existing counting benchmarks fall into two problematic categories: simple 2‑D synthetic sets such as VLMCountBench, which lack realistic visual complexity, and real‑world datasets like TallyQA or Pixmo Count, which suffer from uncontrolled occlusions, ambiguous scene composition, and biased visual cues (e.g., recurring patterns for certain numbers). To bridge this gap, the authors generate a large collection of high‑quality 3D renderings using Blender and the BlenderProc pipeline.

Dataset creation

Four primitive shapes (cube, sphere, cone, cylinder) are rendered with varied colors and materials.
Each scene contains a table placed in a room; objects are positioned on, under, in front of, or to the left/right of the table.
To avoid overlap, the table’s horizontal extent is divided into equally sized bins; each bin receives at most one object, guaranteeing non‑overlapping placements while preserving spatial relationships.
Lighting, room dimensions, and background textures are sampled from configurable ranges.
A contrast check based on ΔE in Lab color space ensures that objects stand out from the background (ΔE < 12.5 triggers re‑rendering).

Metadata and QA generation
For every rendered image, the authors record the exact count per color, per shape, and per location, as well as a full scene description. Six question types are defined: color‑based, shape‑based, location‑based, object‑wide count, composite (color + shape + location), and adversarial (asking about absent objects). Each question is instantiated with three answer styles—numeric, short textual, and verbose descriptive—using templated natural language. The final corpus comprises 23 252 image‑question‑answer triples (≈6 875 rendered images from five camera viewpoints, filtered to 2 332‑ish unique entries). Object counts range from 0 to 15, with a balanced distribution across this interval.

Experimental protocol
The authors fine‑tune Qwen VL 2.5 7B (Instruct variant) using LoRA (rank 16, α = 32) on four data configurations: Verbose (all verbose answers), Non‑verbose (short answers), Pixmo‑Sub (a Pixmo‑Count subset matched in count distribution), and Mixed (SITUATE + Pixmo). Training runs for one epoch on a single Nvidia RTX 6000 GPU (batch 4, gradient accumulation 4, lr 1e‑4). Baselines include the larger Qwen VL 3 32B and Molmo 7B‑D0924 (the only model known to have been pre‑trained on Pixmo Count). Evaluation is performed on: (1) the held‑out SITUATE test split (496 items, 31 questions per count class), (2) Pixmo Count test set, (3) CountBench, and (4) a filtered TallyQA subset focusing on higher‑range answers (11‑15) and a mix of simple/complex queries (508 questions total).

Results

Fine‑tuning on SITUATE yields a noticeable boost on the Pixmo Count test (≈+7 percentage points in overall accuracy), especially for counts 6‑15, indicating that learning controlled spatial and color constraints transfers to real‑world images.
The reverse direction (Pixmo → SITUATE) shows negligible improvement or even degradation, suggesting that Pixmo’s uncontrolled scenes do not teach the spatial reasoning required by SITUATE.
The Mixed dataset (SITUATE + Pixmo) performs comparably to pure SITUATE on both benchmarks, confirming that adding noisy real images does not harm the learned counting ability.
Larger models (Qwen VL 3 32B) achieve higher absolute accuracy across all benchmarks, yet the SITUATE‑fine‑tuned 7B model surpasses them in the hardest count range (10‑15 objects), highlighting the value of targeted synthetic data.
Molmo 7B‑D0924, despite being pre‑trained on Pixmo, lags behind SITUATE‑fine‑tuned Qwen 2.5 on both synthetic and real tests, reinforcing the claim that dataset composition matters more than sheer model size.

Key insights

Controlled synthetic data can teach VLMs spatial‑numeric reasoning more effectively than raw real‑world images, provided the data includes explicit positional cues and balanced count distributions.
Contrast validation (ΔE) and bin‑based placement are simple yet powerful mechanisms to ensure that objects are both visually separable and spatially well‑defined, reducing ambiguity for the model.
Multi‑style answer generation (numeric, short, verbose) encourages the model to adapt to varied output formats, which is crucial for downstream VQA applications.
Cross‑benchmark evaluation demonstrates genuine generalization: improvements on Pixmo Count after SITUATE fine‑tuning are not merely overfitting to synthetic patterns.

Limitations and future work

The dataset is limited to four primitive shapes; extending to more complex, non‑rigid objects would increase ecological validity.
Question templates are rule‑based; leveraging large LLMs to generate more diverse, natural language queries could improve robustness to linguistic variation.
Current experiments use a single epoch and modest compute; exploring longer training schedules or curriculum learning (starting from low counts to higher counts) may yield further gains.
Combining synthetic data with carefully curated real images (e.g., semi‑synthetic augmentation) could blend the benefits of realism and control.

Conclusion
SITUATE fills a critical gap in VLM research by providing a publicly available, high‑quality synthetic dataset that explicitly encodes spatial relationships, color attributes, and balanced object counts. Fine‑tuning state‑of‑the‑art VLMs on SITUATE leads to measurable improvements on both synthetic and real‑world counting benchmarks, confirming that targeted synthetic data can substantially enhance VLM quantitative reasoning. The dataset, generation pipeline, and experimental code are released for the community, inviting further exploration of synthetic‑driven VLM training for counting, spatial reasoning, and beyond.

SITUATE -- Synthetic Object Counting Dataset for VLM training

💡 Research Summary

Comments & Academic Discussion

Leave a Comment