SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the $V^*$ VQA benchmark by 6.7 percentage points, and it surpasses “thinking with images” by 4.6 points on a challenging OOD task despite requiring a 200$\times$ lower token budget.


💡 Research Summary

The paper addresses a fundamental weakness in current vision‑language models (VLMs) when they employ test‑time scaling: the entanglement of visual perception and textual chain‑of‑thought (CoT) reasoning leads to long, unstructured contexts where a small perceptual error can cascade into a completely wrong answer. Moreover, achieving strong performance with this “thinking with images” paradigm typically requires expensive reinforcement‑learning fine‑tuning with hand‑crafted rewards.

SPARC (Separating Perception And Reasoning Circuits) proposes a clean, two‑stage architecture that mirrors the brain’s division between early visual pathways (“what/where”) and the prefrontal cortex. In Stage 1 (Perception), the model receives a low‑resolution version of the whole image together with the question and is asked to perform Implicit Relevance Detection (IRD). IRD outputs only the coordinates of the region(s) most relevant to the query, using a minimal number of text tokens. This stage can be run with self‑consistency across multiple roll‑outs, sharing a KV‑cache, which adds virtually no extra token budget while improving robustness.

Stage 2 (Reasoning) takes the high‑resolution crops identified by IRD and feeds them back to the same VLM (or a dedicated reasoning module). The model now generates a standard CoT and the final answer, but the context is dramatically compressed: it contains only the necessary visual details rather than the entire image plus a long chain of interleaved visual actions. Consequently, the reasoning circuit is less prone to hallucinations and does not need to manage tool‑calling logic.

Key empirical findings:

  • On the V* VQA benchmark, SPARC raises Qwen3‑VL‑4B’s accuracy by 6.7 percentage points, while using 200× fewer visual tokens than the baseline “thinking with images” approach.
  • On a challenging out‑of‑distribution (OOD) task, SPARC outperforms the same baseline by 4.6 points.
  • A self‑consistency ensemble of eight IRD roll‑outs yields up to a 9.3 % boost with negligible extra cost.
  • Experiments varying crop overlap and resolution show that a modest 20 % overlap at 256 px resolution already surpasses a 512 px full‑image model, confirming that precise localization can compensate for lower global detail.

The modular design enables independent test‑time scaling: perception can be allocated more compute (e.g., higher resolution, multiple samples) when the visual domain shifts, while reasoning remains fixed. It also permits selective fine‑tuning—one can improve the perception circuit alone (e.g., domain‑specific object detectors) without degrading the reasoning circuit’s pre‑trained language capabilities, avoiding catastrophic forgetting.

From a systems perspective, SPARC reduces the context engineering burden. By structuring the prompt into two concise phases, it eliminates the need for long, unstructured multimodal CoT traces, aligning with recent principles that advocate modular, hierarchical context composition. This leads to faster inference, lower memory consumption, and more predictable latency, which are critical for real‑time applications such as robotics, mobile assistants, and edge devices.

In summary, SPARC demonstrates that decoupling visual perception from logical reasoning in VLMs yields a more robust, efficient, and scalable test‑time scaling strategy. It offers a biologically inspired blueprint for future multimodal models, showing that careful architectural separation can overcome the brittleness of monolithic designs while delivering state‑of‑the‑art performance with dramatically reduced computational overhead.


Comments & Academic Discussion

Loading comments...

Leave a Comment