CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT
Recent advances in vision-language models (VLMs) have markedly improved image-text alignment, yet they still fall short of human-like visual reasoning. A key limitation is that many VLMs rely on surface correlations rather than building logically coherent structured representations, which often leads to missed higher-level semantic structure and non-causal relational understanding, hindering compositional and verifiable reasoning. To address these limitations by introducing human models into the reasoning process, we propose CoTZero, an annotation-free paradigm with two components: (i) a dual-stage data synthesis approach and (ii) a cognition-aligned training method. In the first component, we draw inspiration from neurocognitive accounts of compositional productivity and global-to-local analysis. In the bottom-up stage, CoTZero extracts atomic visual primitives and incrementally composes them into diverse, structured question-reasoning forms. In the top-down stage, it enforces hierarchical reasoning by using coarse global structure to guide the interpretation of local details and causal relations. In the cognition-aligned training component, built on the synthesized CoT data, we introduce Cognitively Coherent Verifiable Rewards (CCVR) in Reinforcement Fine-Tuning (RFT) to further strengthen VLMs’ hierarchical reasoning and generalization, providing stepwise feedback on reasoning coherence and factual correctness. Experiments show that CoTZero achieves an F1 score of 83.33 percent on our multi-level semantic inconsistency benchmark with lexical-perturbation negatives, across both in-domain and out-of-domain settings. Ablations confirm that each component contributes to more interpretable and human-aligned visual reasoning.
💡 Research Summary
The paper “CoTZero: Annotation‑Free Human‑Like Vision Reasoning via Hierarchical Synthetic Chain‑of‑Thought” addresses a fundamental shortcoming of current vision‑language models (VLMs): while they excel at image‑text alignment, they lack the ability to perform deep, hierarchical visual reasoning comparable to human cognition. The authors attribute this gap to VLMs’ reliance on surface statistical correlations rather than constructing structured, causal representations of visual scenes. To bridge this gap, they introduce CoTZero, an annotation‑free framework composed of two tightly coupled components: (1) a dual‑stage synthetic data generation pipeline and (2) a cognition‑aligned training regime.
In the first component, the pipeline operates in a bottom‑up and top‑down fashion. The bottom‑up stage uses a pre‑trained VLM to generate rich captions from raw images, which are then parsed by a large language model (LLM) into atomic (entity, relation, entity) triples. These triples capture the most elementary visual relationships, mirroring the “visual primitives” that humans first perceive. From each triple, atomic yes/no questions are generated. A similarity‑based merging process (using cosine similarity of sentence embeddings) progressively groups atomic questions into intermediate‑level and finally high‑level questions, forming a hierarchical question tree. This mirrors the human global‑to‑local analysis where an overall scene layout guides the interpretation of finer details. The top‑down stage then reverses the process: complex questions are systematically decomposed back into their constituent sub‑questions, yielding a rich set of training examples that contain multi‑granularity supervision (from atomic facts to global conclusions).
The second component, cognition‑aligned training, first fine‑tunes the VLM on the synthetic CoT data via supervised fine‑tuning (SFT). Afterwards, the model undergoes reinforcement fine‑tuning using Group Relative Policy Optimization (GRPO) together with Cognitively Coherent Verifiable Rewards (CCVR). CCVR is a novel reward function that simultaneously penalizes edit‑distance deviations and rewards semantic similarity between the model‑generated reasoning chain and a reference chain. By providing step‑wise feedback on logical coherence and factual correctness, CCVR mitigates the credit‑assignment problem that plagues traditional outcome‑only rewards and encourages the model to construct and verify intermediate reasoning steps, much like a human does.
Empirical evaluation is conducted on a newly constructed multi‑level semantic inconsistency benchmark that includes lexical‑perturbation negative examples. CoTZero achieves an F1 score of 83.33 % across both in‑domain and out‑of‑domain settings, substantially outperforming baseline VLMs and prior linear CoT approaches. Ablation studies demonstrate that each element—bottom‑up/top‑down data synthesis, CCVR‑guided reinforcement, and GRPO optimization—contributes meaningfully to the overall gain. Qualitative analysis also shows that the generated reasoning chains are more interpretable and align better with human judgments of logical consistency.
In summary, CoTZero leverages two core cognitive principles—compositional productivity and global‑to‑local processing—to endow VLMs with structured, verifiable, and human‑like visual reasoning capabilities without any human‑annotated data. The work opens a promising direction for building trustworthy multimodal AI systems that can reason about complex visual scenes in a transparent, step‑by‑step manner.
Comments & Academic Discussion
Loading comments...
Leave a Comment