BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Witnessing the evolution of text-to-image diffusion models, significant strides have been made in text-to-3D generation. Currently, two primary paradigms dominate the field of text-to-3D: the feed-forward generation solutions, capable of swiftly producing 3D assets but often yielding coarse results, and the Score Distillation Sampling (SDS) based solutions, known for generating high-fidelity 3D assets albeit at a slower pace. The synergistic integration of these methods holds substantial promise for advancing 3D generation techniques. In this paper, we present BoostDream, a highly efficient plug-and-play 3D refining method designed to transform coarse 3D assets into high-quality. The BoostDream framework comprises three distinct processes: (1) We introduce 3D model distillation that fits differentiable representations from the 3D assets obtained through feed-forward generation. (2) A novel multi-view SDS loss is designed, which utilizes a multi-view aware 2D diffusion model to refine the 3D assets. (3) We propose to use prompt and multi-view consistent normal maps as guidance in refinement.Our extensive experiment is conducted on different differentiable 3D representations, revealing that BoostDream excels in generating high-quality 3D assets rapidly, overcoming the Janus problem compared to conventional SDS-based methods. This breakthrough signifies a substantial advancement in both the efficiency and quality of 3D generation processes.

💡 Research Summary

BoostDream introduces a highly efficient, plug‑and‑play refinement pipeline that bridges the speed of feed‑forward text‑to‑3D generators with the high fidelity of Score Distillation Sampling (SDS)‑based optimization. The method consists of three stages. First, a rapid “3D model distillation” step converts a coarse explicit 3D asset (e.g., a mesh or point cloud generated by a feed‑forward model such as Shap‑E) into a differentiable 3D representation (NeRF, SDF, voxel, etc.). By rendering both the coarse asset and the randomly initialized differentiable model under the same camera parameters and minimizing an L1 loss, the conversion is achieved within seconds, eliminating the costly initialization phase typical of SDS methods.

Second, BoostDream employs a novel multi‑view rendering system to combat the Janus (multi‑head) problem. At each iteration, a camera position is sampled in spherical coordinates; a random rotation axis is defined, and the camera is rotated by 90° increments around this axis to obtain four distinct viewpoints. The four rendered RGB images are stitched into a 2×2 composite image G, and the corresponding normal maps are similarly combined into a composite normal map N. These multi‑view images serve as control conditions for a newly designed multi‑view SDS loss. Unlike conventional SDS, which only conditions on the text prompt y, the multi‑view SDS incorporates the normal map N, providing detailed geometric guidance. The noise estimator is defined as

ˆε(x_t; t, y, N) = ε_φ(x_t; t, y, λ·N) + s·(ε_φ(x_t; t, y, λ·N) – ε_φ(x_t; t)),

where λ balances conditioned and unconditioned predictions and s is the classifier‑free guidance scale. The gradient with respect to the differentiable parameters θ is then computed analogously to standard SDS, but now driven by both textual and multi‑view geometric cues. Orientation and opacity losses from DreamFusion are also incorporated when using NeRF, ensuring correct surface normals and transparency.

The third stage, “Self‑Boost,” replaces the initial normal maps with those generated by the model itself as training progresses. This self‑supervised refinement encourages the network to capture increasingly fine details while preserving consistency across views.

Experiments demonstrate that BoostDream works across multiple differentiable representations, achieving 3–5× faster convergence than state‑of‑the‑art SDS methods such as Magic3D, while delivering higher PSNR and lower LPIPS scores. The multi‑view strategy dramatically reduces Janus artifacts, producing coherent 3‑D objects where front and back do not simultaneously appear. User studies confirm that the refined models are judged superior in prompt alignment, detail richness, and overall visual quality.

In summary, BoostDream’s contributions are threefold: (1) a fast initialization technique that distills coarse feed‑forward outputs into trainable differentiable forms; (2) a multi‑view SDS loss that leverages normal‑map guidance to enforce view consistency and mitigate Janus issues; and (3) a generic framework applicable to various 3‑D representations. By uniting speed, fidelity, and multi‑view consistency, BoostDream represents a significant step forward for practical text‑to‑3D generation.

BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion

💡 Research Summary

Comments & Academic Discussion

Leave a Comment