Prompt-to-Parts: Generative AI for Physical Assembly and Scalable Instructions
Reading time: 5 minute
...
📝 Original Info
Title: Prompt-to-Parts: Generative AI for Physical Assembly and Scalable Instructions
ArXiv ID: 2512.15743
Date: 2025-12-10
Authors: David Noever
📝 Abstract
We present a framework for generating physically realizable assembly instructions from natural language descriptions. Unlike unconstrained text-to-3D approaches, our method operates within a discrete parts vocabulary, enforcing geometric validity, connection constraints, and buildability ordering. Using LDraw as a text-rich intermediate representation, we demonstrate that large language models can be guided with tools to produce valid step-by-step construction sequences and assembly instructions for brick-based prototypes of more than 3000 assembly parts. We introduce a Python library for programmatic model generation and evaluate buildable outputs on complex satellites, aircraft, and architectural domains. The approach aims for demonstrable scalability, modularity, and fidelity that bridges the gap between semantic design intent and manufacturable output. Physical prototyping follows from natural language specifications. The work proposes a novel elemental lingua franca as a key missing piece from the previous pixel-based diffusion methods or computer-aided design (CAD) models that fail to support complex assembly instructions or component exchange. Across four original designs, this novel "bag of bricks" method thus functions as a physical API: a constrained vocabulary connecting precisely oriented brick locations to a "bag of words" through which arbitrary functional requirements compile into material reality. Given such a consistent and repeatable AI representation opens new design options while guiding natural language implementations in manufacturing and engineering prototyping.
💡 Deep Analysis
📄 Full Content
Prompt-to-Parts:
Generative AI for Physical Assembly and Scalable Instructions
ABSTRACT
We present a framework for generating physically realizable assembly instructions from
natural language descriptions. Unlike unconstrained text-to-3D approaches, our method operates
within a discrete parts vocabulary, enforcing geometric validity, connection constraints, and
buildability ordering. Using LDraw as a text-rich intermediate representation, we demonstrate that
large language models can be guided with tools to produce valid step-by-step construction
sequences and assembly instructions for brick-based prototypes of more than 3000 assembly parts.
We introduce a Python library for programmatic model generation and evaluate buildable outputs
on complex satellites, aircraft, and architectural domains. The approach aims for demonstrable
scalability, modularity, and fidelity that bridges the gap between semantic design intent and
manufacturable output. Physical prototyping follows from natural language specifications. The
work proposes a novel elemental lingua franca as a key missing piece from the previous pixel-
based diffusion methods or computer-aided design (CAD) models that fail to support complex
assembly instructions or component exchange. Across four original designs, this novel “bag of
bricks” method thus functions as a physical API: a constrained vocabulary connecting precisely
oriented brick locations to a “bag of words” through which arbitrary functional requirements
compile into material reality. Given such a consistent and repeatable AI representation opens new
design options while guiding natural language implementations in manufacturing and engineering
prototyping.
Keywords: large language models, VLM, 3d model, image generation, spatial reasoning, part
assembly
INTRODUCTION
The conversion of natural language into
functional physical prototypes represents an
emerging frontier in computational design.
While text-to-image and text-to-3D systems
have advanced significantly [1-14], the
challenge of generating physically buildable,
materially constrained structures remains
open [15-25]. LEGO bricks [1,26-32], with
their standardized geometry, tolerances, and
global availability, offer a uniquely well-
bounded medium for exploring this problem. LEGO systems are already widely
adopted in engineering education [30-31],
robotics experiments [19,33-34], and low-
cost scientific instrumentation [35-36]. The
Figure 1. Complex medieval castle automatically
generated as an 860-part instruction kit and bill of
materials
largest LEGO set with instructions (Eiffel
Tower) totals around 10,000 bricks, 58.5-
inch height, and nearly 1000 assembly steps
(or manual pages). As illustrated in Figure 1,
these composable examples highlight the
potential for LEGO to serve as an accessible
platform for rapid, low-cost experimentation
in mechanical design, instrumentation, and
physical sciences. The underlying insight is
that discretized construction systems—
whether LEGO, modular satellites, or flat-
pack furniture—share a common structure
amenable to language-driven synthesis: a
finite vocabulary of parts, a grammar of valid
connections, and functional constraints that
map to inventive principles (Figure 2).
A key question arises: Can large
language models (LLMs) reliably generate
accurate LEGO models and step-by-step
assembly instructions from arbitrary text
prompts? If successful, this capability would
operationalize a form of text-to-prototype,
allowing designers, students, and researchers
to translate conceptual descriptions directly
into testable physical artifacts. The problem
is nontrivial. It requires spatial reasoning [17-
20], long-horizon planning [34], constraint
satisfaction
[10,14],
and
an
implicit
understanding of real-world forces [33-35]
and geometric compatibility [1,3,10]. It also
requires the model to produce instructions
that a human can follow and that result in a
stable assembly [23,29] with no ambiguous
or impossible steps. We hypothesize that the adoption of a
compact,
human-readable
component
language can substantially improve the
reliability and scale of LLM-generated
physical assemblies. The hypothesis draws
from LLM methods to encode chess games
with Forsyth-Edwards Notation (FEN) [37]
or use natural language to produce structured
query languages (SQL) in databases [38])
Just as FEN encodes complete board
states in a single line of text that both humans
and machines can parse unambiguously, the
LDraw
format
[39]
encodes
LEGO
assemblies
through
standardized
part
identifiers, precise coordinates, and rotation
matrices in a syntax that predates and is
independent of any particular AI system. We extend the success of the natural
language choice—so called “Bag of Words
(BOW)” approaches–