Feedforward 3D Editing via Text-Steerable Image-to-3D

Reading time: 5 minute
...

📝 Original Info

  • Title: Feedforward 3D Editing via Text-Steerable Image-to-3D
  • ArXiv ID: 2512.13678
  • Date: 2025-12-15
  • Authors: Ziqi Ma, Hongqiao Chen, Yisong Yue, Georgia Gkioxari

📝 Abstract

Recent progress in image-to-3D has opened up immense possibilities for design, AR/VR, and robotics. However, to use AI-generated 3D assets in real applications, a critical requirement is the capability to edit them easily. We present a feedforward method, Steer3D, to add text steerability to image-to-3D models, which enables editing of generated 3D assets with language. Our approach is inspired by ControlNet, which we adapt to image-to-3D generation to enable text steering directly in a forward pass. We build a scalable data engine for automatic data generation, and develop a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO). Compared to competing methods, Steer3D more faithfully follows the language instruction and maintains better consistency with the original 3D asset, while being 2.4x to 28.5x faster. Steer3D demonstrates that it is possible to add a new modality (text) to steer the generation of pretrained image-to-3D generative models with 100k data. Project website: https://glab-caltech.github.io/steer3d/

💡 Deep Analysis

Figure 1

📄 Full Content

Feedforward 3D Editing via Text-Steerable Image-to-3D Ziqi Ma Hongqiao Chen Yisong Yue Georgia Gkioxari California Institute of Technology Image Existing Image-to-3D Output “Replace legs with sleek robotic limbs colored silver” Image-to-3D Model ControlNet Steer3D Output Automated Data Engine “Add a hanging flower basket on the upper side of the telephone booth” “Replace the red wrapping with a golden leather strap” Image Instruction Edited 3D Image Instruction Edited 3D 96k Training Pairs “Replace the natural antlers with glowing neon blue antlers” “Replace the spherical studs on the body with bright LED lights” “Add a roof rack on top of the car” Original Generation Instruction Steer3D Output Instruction Steer3D Image-to-3D Model Figure 1. We present Steer3D, which enables feedforward editing of generated 3D assets by ingesting text steerability to image-to-3D models. Steer3D uses an architecture inspired by ControlNet to leverage image-to-3D pretraining and achieve data efficiency. We build an automated data engine that generates 96k synthetic editing pairs for scalable training. Steer3D shows strong editing capabilities while staying consistent with the original 3D asset, as shown on the right side. Abstract Recent progress in image-to-3D has opened up immense possibilities for design, AR/VR, and robotics. However, to use AI-generated 3D assets in real applications, a critical requirement is the capability to edit them easily. We present a feedforward method, Steer3D, to add text steerability to image-to-3D models, which enables editing of generated 3D assets with language. Our approach is inspired by Control- Net, which we adapt to image-to-3D generation to enable text steering directly in a forward pass. We build a scal- able data engine for automatic data generation, and de- velop a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO). Com- pared to competing methods, Steer3D more faithfully fol- lows the language instruction and maintains better consis- tency with the original 3D asset, while being 2.4× to 28.5× faster. Steer3D demonstrates that it is possible to add a new modality (text) to steer the generation of pretrained image- to-3D generative models with 100k data. Project website: https://glab-caltech.github.io/steer3d/ 1. Introduction Generative image-to-3D models [18, 40, 45] enable users to create 3D assets from a single image, unlocking new appli- cations in design, AR/VR, and robotic simulation. How- ever, integrating AI generations into real applications re- quires editing and customizing the generated 3D assets, a capability not supported by these models. Existing solutions to 3D editing include 2D-3D pipelines which use image editing to change the object views and then lift to 3D [6, 34, 37]. However, they suffer from inconsistent image edit outputs, which are reflected in the reconstructed 3D shape. These methods are also slow – they take several minutes per edit. To address these inherent limitations with pipeline methods, we want to train a feedforward model that directly perform editing in 3D. The most intuitive approach is to design a generative model which takes the original 3D asset as input, and generates the edited asset conditioned on the edit instruction. However, such models rely on paired 3D editing data with text instructions, which is very difficult 1 arXiv:2512.13678v1 [cs.CV] 15 Dec 2025 to obtain. Due to this challenge, training such a model from scratch is impractical. We propose a solution from a different perspective: in- stead of training an editing model from scratch, we can turn any image-to-3D model into an editing model by augment- ing it with text steerability. Fig. 1 shows an overview of our method, Steer3D. We start from a pretrained image-to-3D generator and augment it with ControlNet [43]. ControlNet ingests the steering prompt and guides the base model to- ward the desired 3D edit. To edit any asset generated by the base model, we can run Steer3D with the input image plus the editing text, and directly obtain an edited asset that fol- lows the instruction and remains consistent with the original asset. By leveraging shape and object priors from pretrained image-to-3D models, this design is more data-efficient than training a standalone editing model from scratch. A data-efficient architecture makes feedforward 3D edit- ing tractable, yet we still need to collect paired 3D edit data for training. We develop an automated data engine that combines image editing models, large vision–language models, and image-to-3D generators to synthesize diverse geometry- and texture-edit pairs. In total, we produce 96k training pairs covering a wide range of shape and appear- ance changes. We then train Steer3D with flow matching [25], and apply Direct Preference Optimization (DPO; [35]) to avoid the trivial “no-edit” solution. Our analysis shows that both the data engine and the training recipe scale

📸 Image Gallery

humaneval_ui.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut