RefAny3D: 3D Asset-Referenced Diffusion Models for Image Generation

RefAny3D: 3D Asset-Referenced Diffusion Models for Image Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we propose a 3D asset-referenced diffusion model for image generation, exploring how to integrate 3D assets into image diffusion models. Existing reference-based image generation methods leverage large-scale pretrained diffusion models and demonstrate strong capability in generating diverse images conditioned on a single reference image. However, these methods are limited to single-image references and cannot leverage 3D assets, constraining their practical versatility. To address this gap, we present a cross-domain diffusion model with dual-branch perception that leverages multi-view RGB images and point maps of 3D assets to jointly model their colors and canonical-space coordinates, achieving precise consistency between generated images and the 3D references. Our spatially aligned dual-branch generation architecture and domain-decoupled generation mechanism ensure the simultaneous generation of two spatially aligned but content-disentangled outputs, RGB images and point maps, linking 2D image attributes with 3D asset attributes. Experiments show that our approach effectively uses 3D assets as references to produce images consistent with the given assets, opening new possibilities for combining diffusion models with 3D content creation.


💡 Research Summary

**
RefAny3D introduces a novel diffusion‑based framework that conditions image generation on 3D assets, addressing a gap in existing reference‑guided generation methods which are limited to single 2D images. The core contribution is a spatially aligned dual‑branch architecture that simultaneously generates an RGB image and a point‑map (a rasterized representation of normalized 3D object coordinates) from multi‑view inputs of the target object. By treating the RGB and point‑map modalities as separate but tightly coupled streams, the model can enforce precise geometric consistency while preserving rich texture and lighting details.

The pipeline begins by rendering a set of multi‑view RGB‑point‑map pairs from the 3D asset. Each pair is encoded with a pretrained VAE into latent tokens; the RGB tokens and point‑map tokens are concatenated with a noisy target latent and fed into a Diffusion Transformer (DiT). A shared positional encoding scheme is applied across both domains, ensuring that tokens representing the same pixel location in the RGB image and the point‑map receive identical positional embeddings. A positional shift (i‑w, j) guarantees that conditional tokens and target tokens remain spatially disjoint, preserving alignment without interference.

To mitigate the inherent information asymmetry—point‑maps contain only geometry while RGB images contain full scene appearance—the authors introduce domain‑specific LoRA adapters and a text‑agnostic attention mechanism. The domain switcher tags each token with its modality, directing it through the appropriate LoRA branch. This decoupling prevents background or lighting information from contaminating the geometry‑only point‑map stream, while still allowing the text prompt to influence both outputs in a controlled manner.

Training uses a pose‑aligned dataset where each object instance is captured from multiple viewpoints, providing perfectly aligned RGB and point‑map data. Conditional tokens are fixed at diffusion timestep 0, preserving their information throughout denoising. The model learns the joint conditional distribution p(x_I, x_P | y, c) where x_I is the target RGB image, x_P its point‑map, y the 3D asset, and c the textual prompt.

Extensive experiments demonstrate that RefAny3D outperforms state‑of‑the‑art 2D reference methods on several metrics: FID, CLIP‑Score, and a newly proposed geometry‑consistency score. Qualitative results show that generated images faithfully reproduce the object’s shape, texture, and pose across diverse viewpoints, while background and lighting adapt naturally to the prompt. The approach also scales to complex assets with intricate geometry and fine‑grained textures, which previous 3D‑aware methods struggled to handle.

Limitations include reliance on rasterized point‑maps (which may not capture high‑frequency surface details or advanced material properties such as translucency or specular highlights) and the need for pre‑rendered multi‑view inputs, which could hinder real‑time applications. Future work may integrate neural rendering techniques for continuous 3D representations, extend the modality set to include normals, depth, or material maps, and explore more efficient conditioning strategies for interactive use.

Overall, RefAny3D provides a practical solution for creators who wish to visualize 3D assets in varied scenes without manual rendering pipelines. By bridging 3D geometry and 2D diffusion generation, it opens new avenues for content creation, advertising, game design, and AR/VR workflows, establishing a foundation for further research at the intersection of diffusion models and 3D content.


Comments & Academic Discussion

Loading comments...

Leave a Comment