📝 Original Info
- Title: CoRe3D: Collaborative Reasoning as a Foundation for 3D Intelligence
- ArXiv ID: 2512.12768
- Date: 2025-12-14
- Authors: Tianjiao Yu, Xinzhuo Li, Yifan Shen, Yuanzhe Liu, Ismini Lourentzou
📝 Abstract
Recent advances in large multimodal models suggest that explicit reasoning mechanisms play a critical role in improving model reliability, interpretability, and cross-modal alignment. While such reasoning-centric approaches have been proven effective in language and vision tasks, their extension to 3D remains underdeveloped. CoRe3D introduces a unified 3D understanding and generation reasoning framework that jointly operates over semantic and spatial abstractions, enabling high-level intent inferred from language to directly guide low-level 3D content formation. Central to this design is a spatially grounded reasoning representation that decomposes 3D latent space into localized regions, allowing the model to reason over geometry in a compositional and procedural manner. By tightly coupling semantic chain-of-thought inference with structured spatial reasoning, CoRe3D produces 3D outputs that exhibit strong local consistency and faithful alignment with linguistic descriptions.
💡 Deep Analysis
📄 Full Content
PERCEPTION
&LANGUAGE
plan-lab.github.io
CoRe3D: Collaborative Reasoning as a
Foundation for 3D Intelligence
Tianjiao Yu, Xinzhuo Li, Yifan Shen, Yuanzhe Liu, Ismini Lourentzou
{ty41, lourent2}@illinois.edu
University of Illinois Urbana-Champaign
Abstract. Recent advances in large multimodal models suggest that explicit reasoning mechanisms play a critical
role in improving model reliability, interpretability, and cross-modal alignment. While such reasoning-centric
approaches have been proven effective in language and vision tasks, their extension to 3D remains underdeveloped.
CoRe3D introduces a unified 3D understanding and generation reasoning framework that jointly operates over
semantic and spatial abstractions, enabling high-level intent inferred from language to directly guide low-level 3D
content formation. Central to this design is a spatially grounded reasoning representation that decomposes 3D
latent space into localized regions, allowing the model to reason over geometry in a compositional and procedural
manner. By tightly coupling semantic chain-of-thought inference with structured spatial reasoning, CoRe3D
produces 3D outputs that exhibit strong local consistency and faithful alignment with linguistic descriptions.
1. Introduction
Despite rapid progress in 3D generation, most ex-
isting methods remain imitation-based, reproducing
shapes rather than reasoning about objects [52, 105].
As a result, they struggle with prompts that implicitly
describe relations, counts, geometry, or physical con-
tacts, concepts that recent unified language–vision
models have begun to handle effectively in 2D set-
tings [69, 98]. This progress is largely attributed
to the integration of Chain-of-Thought (CoT) rea-
soning [81], which, when extended to multimodal
LLMs [7, 50, 109, 112], improves interpretability and
consistency across visual reasoning tasks [34, 48].
However, unified reasoning in the 3D domain remains
under-explored; few models are capable of jointly
interpreting and constructing 3D objects [80, 104].
To advance this frontier, we propose CoRe3D, a
framework for collaborative reasoning that unifies
semantic understanding and geometric generation
within a single 3D-LLM. As illustrated in Fig. 1,
CoRe3D integrates a unified 3D language model with
an octant-based 3D VQ-VAE, enabling the model to
reason in both language and 3D token space.
At its core, our approach couples a Semantic
CoT for high-level textual planning with a novel
Geometric CoT for spatial synthesis. The geometric
CoT operates autoregressively across octant blocks,
addressing the limitations of existing “flat” voxel rep-
resentations that waste computation on empty space
and fail to capture structured spatial dependencies.
Unlike part-level representations [9], which require
fixed ontologies and suffer from poor generalization
across categories, or voxel-level representations
[53, 90], which remain unstructured and seman-
tically agnostic, our octant-based representation
remains ontology-free yet structure-aware.
To jointly refine both reasoning streams, we fur-
ther employ Group-Relative Policy Optimization
(GRPO) [60], allowing CoRe3D to learn from
multi-critic feedback that balances semantic in-
tent, visual quality, and physical coherence. This
reasoning-aware framework produces high-fidelity
3D construction with enhanced spatial understand-
ing while maintaining strong general language
abilities. This approach is essential for three reasons:
(1) it elicits plans where no “gold" supervision exists;
(2) it allows for granular process credit assignment
using dense 3D-specific rewards; and (3) it prevents
reward hacking and overfitting by leveraging an
ensemble of different critics.
By rewarding both
linguistic reasoning and 3D synthesis, our approach
lays the groundwork for general 3D intelligence,
unifying understanding and generation.
∗Preprint. Work in progress.
arXiv:2512.12768v1 [cs.CV] 14 Dec 2025
Collaborative Reasoning as a Foundation for 3D Intelligence
Semantic-level CoT
Geometric-level CoT
First, recognize the main
structural parts of the
cottage, including the
sloped roof, wooden
walls, chimney, windows,
and front door.
Next, place these
components in the correct
spatial arrangement, with
the roof on top, the
chimney offset to one
side ... and the door
centered beneath the
upper windows.
Refine the scene by
adding small decorative
details like shingles,
flowers ... and rounded
edges to capture the
cozy, handcrafted look.
Local Details
High-Level Guidance
Then, assign appropriate
materials and styles,
such as warm wooden
textures for the walls ...
along with white shutters
and green vines.
3D Prompt
"A cozy wooden
cottage with a red
door and leafy vines"
3D Prompt
"A cozy wooden
cottage with a red
door and leafy vines"
Collaborative Reasoning
Figure 1: We introduce CoRe3D, a framework that unifies Semantic CoT and octant-based Geometric
CoT through collaborative reasoning. By coupling language-grounded reasoning with sh
Reference
This content is AI-processed based on open access ArXiv data.