CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models
Large Language Models(LLMs) have revolutionized text generation and multimodal perception, but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either low-resolution meshes or coarse structural proxies, failing to capture fine-grained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-of-Transformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles block-level content. By integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks within a single integrated architecture. Experimental results show that CG-MLLM significantly outperforms existing MLLMs in generating high-fidelity 3D objects, effectively bringing high-resolution 3D content creation into the mainstream LLM paradigm.
💡 Research Summary
CG‑MLLM introduces a unified multimodal large language model that can both caption and generate high‑resolution 3D content within a single end‑to‑end framework. The authors identify a gap in current 3D generation research: existing methods either produce low‑resolution meshes or coarse voxel/lego proxies, and they rely on separate diffusion stages for fine geometry. To close this gap, CG‑MLLM combines a pretrained vision‑language backbone (Qwen3‑VL) with a specialized 3D variational auto‑encoder (Hunyuan3D‑2.1‑VAE) and adopts a Mixture‑of‑Transformers (MoT) architecture. Two transformers are employed: Token‑level Autoregressive (TokenAR) handles sequential token generation for text and image captions, while Block‑level Autoregressive (BlockAR) processes parallel blocks of 3D latent tokens, enabling efficient handling of thousands of spatial tokens. A hybrid masking scheme dynamically switches between causal and parallel masks, allowing each modality to attend to appropriate context. The 3D VAE extracts point clouds, downsamples them by a factor of 20, and encodes them into a 64‑dimensional latent space that remains frozen during fine‑tuning, preserving geometric priors. Token and block embeddings are aligned via a connector layer and MLP‑Merger, ensuring a shared representation space across modalities. The model inherits advanced transformer components from Qwen3‑VL, such as Grouped Query Attention, SwiGLU activation, RM‑SNorm, and Interleaved Multi‑modal Rotary Positional Embeddings, which together improve stability and expressiveness. During decoding, textual outputs are produced via the original tokenizer, while 3D outputs are reconstructed through the VAE decoder and a material generator, yielding high‑fidelity meshes with realistic textures. Empirical results demonstrate that CG‑MLLM outperforms prior 3D‑LLMs (e.g., SAR3D, Emu3) and diffusion‑based pipelines on both understanding tasks (object identification, point‑cloud classification) and generation metrics (Chamfer distance, F1 score). Notably, BlockAR achieves a three‑fold speedup over token‑level processing when generating 4096‑token 3D blocks, bringing inference time close to real‑time. The paper concludes that integrating language, vision, and 3D geometry in a single LLM not only bridges the 2D‑to‑3D modality gap but also opens pathways for downstream applications in AR/VR, CAD, and gaming, while suggesting future work on larger 3D latent spaces and richer texture synthesis.
Comments & Academic Discussion
Loading comments...
Leave a Comment