Recent progress in image-to-3D has opened up immense possibilities for design, AR/VR, and robotics. However, to use AI-generated 3D assets in real applications, a critical requirement is the capability to edit them easily. We present a feedforward method, Steer3D, to add text steerability to image-to-3D models, which enables editing of generated 3D assets with language. Our approach is inspired by ControlNet, which we adapt to image-to-3D generation to enable text steering directly in a forward pass. We build a scalable data engine for automatic data generation, and develop a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO). Compared to competing methods, Steer3D more faithfully follows the language instruction and maintains better consistency with the original 3D asset, while being 2.4x to 28.5x faster. Steer3D demonstrates that it is possible to add a new modality (text) to steer the generation of pretrained image-to-3D generative models with 100k data. Project website: https://glab-caltech.github.io/steer3d/
💡 Deep Analysis
📄 Full Content
Feedforward 3D Editing via Text-Steerable Image-to-3D
Ziqi Ma
Hongqiao Chen
Yisong Yue
Georgia Gkioxari
California Institute of Technology
Image
Existing Image-to-3D Output
“Replace legs with
sleek robotic limbs
colored silver”
Image-to-3D
Model
ControlNet
Steer3D Output
Automated Data Engine
“Add a hanging
flower basket on the
upper side of the
telephone booth”
“Replace the red
wrapping with a
golden leather
strap”
Image
Instruction
Edited 3D
Image
Instruction
Edited 3D
96k Training Pairs
“Replace the
natural
antlers with
glowing neon
blue antlers”
“Replace the
spherical studs
on the body
with bright LED
lights”
“Add a
roof rack
on top of
the car”
Original
Generation Instruction
Steer3D
Output
Instruction
Steer3D
Image-to-3D
Model
Figure 1. We present Steer3D, which enables feedforward editing of generated 3D assets by ingesting text steerability to image-to-3D
models. Steer3D uses an architecture inspired by ControlNet to leverage image-to-3D pretraining and achieve data efficiency. We build
an automated data engine that generates 96k synthetic editing pairs for scalable training. Steer3D shows strong editing capabilities while
staying consistent with the original 3D asset, as shown on the right side.
Abstract
Recent progress in image-to-3D has opened up immense
possibilities for design, AR/VR, and robotics. However, to
use AI-generated 3D assets in real applications, a critical
requirement is the capability to edit them easily. We present
a feedforward method, Steer3D, to add text steerability to
image-to-3D models, which enables editing of generated 3D
assets with language. Our approach is inspired by Control-
Net, which we adapt to image-to-3D generation to enable
text steering directly in a forward pass. We build a scal-
able data engine for automatic data generation, and de-
velop a two-stage training recipe based on flow-matching
training and Direct Preference Optimization (DPO). Com-
pared to competing methods, Steer3D more faithfully fol-
lows the language instruction and maintains better consis-
tency with the original 3D asset, while being 2.4× to 28.5×
faster. Steer3D demonstrates that it is possible to add a new
modality (text) to steer the generation of pretrained image-
to-3D generative models with 100k data. Project website:
https://glab-caltech.github.io/steer3d/
1. Introduction
Generative image-to-3D models [18, 40, 45] enable users to
create 3D assets from a single image, unlocking new appli-
cations in design, AR/VR, and robotic simulation. How-
ever, integrating AI generations into real applications re-
quires editing and customizing the generated 3D assets, a
capability not supported by these models.
Existing solutions to 3D editing include 2D-3D pipelines
which use image editing to change the object views and then
lift to 3D [6, 34, 37]. However, they suffer from inconsistent
image edit outputs, which are reflected in the reconstructed
3D shape. These methods are also slow – they take several
minutes per edit. To address these inherent limitations with
pipeline methods, we want to train a feedforward model that
directly perform editing in 3D. The most intuitive approach
is to design a generative model which takes the original 3D
asset as input, and generates the edited asset conditioned on
the edit instruction. However, such models rely on paired
3D editing data with text instructions, which is very difficult
1
arXiv:2512.13678v1 [cs.CV] 15 Dec 2025
to obtain. Due to this challenge, training such a model from
scratch is impractical.
We propose a solution from a different perspective: in-
stead of training an editing model from scratch, we can turn
any image-to-3D model into an editing model by augment-
ing it with text steerability. Fig. 1 shows an overview of our
method, Steer3D. We start from a pretrained image-to-3D
generator and augment it with ControlNet [43]. ControlNet
ingests the steering prompt and guides the base model to-
ward the desired 3D edit. To edit any asset generated by the
base model, we can run Steer3D with the input image plus
the editing text, and directly obtain an edited asset that fol-
lows the instruction and remains consistent with the original
asset. By leveraging shape and object priors from pretrained
image-to-3D models, this design is more data-efficient than
training a standalone editing model from scratch.
A data-efficient architecture makes feedforward 3D edit-
ing tractable, yet we still need to collect paired 3D edit
data for training. We develop an automated data engine
that combines image editing models, large vision–language
models, and image-to-3D generators to synthesize diverse
geometry- and texture-edit pairs. In total, we produce 96k
training pairs covering a wide range of shape and appear-
ance changes. We then train Steer3D with flow matching
[25], and apply Direct Preference Optimization (DPO; [35])
to avoid the trivial “no-edit” solution. Our analysis shows
that both the data engine and the training recipe scale