MotionPhysics: Learnable Motion Distillation for Text-Guided Simulation

January 01, 2026

Reading time: 31 minute

...

#Computer Vision #Computer Science

📝 Original Info

Title: MotionPhysics: Learnable Motion Distillation for Text-Guided Simulation
ArXiv ID: 2601.00504
Date: 2026-01-01
Authors: Miaowei Wang, Jakub Zadrożny, Oisin Mac Aodha, Amir Vaxman

📝 Abstract

Text Prompt PhysFlow MotionPhysics (Ours) "An object like a rubber ball hits the ground and bounces back." "An object made of water is hitting the ground" Figure 1: MotionPhysics automatically estimates plausible material parameters to support dynamic 3D simulation of diverse materials and object types. Compared to prior work (e.g., PhysFlow (Liu et al. 2025a)), it more accurately adheres to the user's input prompt (Left), particularly for AI-generated objects (Top: elastic simulation), human-designed objects (Bottom: water simulation), and real-world scans. Image backgrounds are from Li et al. (2023).

📄 Full Content

The specification of plausible physical parameters is essential for realistic 3D simulation (Gottstein 2004). For instance, Young's modulus controls a material's stiffness, while yield stress marks the onset of irreversible plastic deformation. Traditional methods for identifying such parameters often rely on expert intuition or laborious trial-and-error (Stomakhin et al. 2013), making simulation pipelines timeconsuming and inaccessible to non-experts.

This has motivated a wide body of work focused on automatic physical parameter estimation. Early approaches relied on direct observations (Asenov et al. 2019;Wu et al. 2016;Jaques, Burke, and Hospedales 2020;Ma et al. 2023). More recent work in novel view synthesis, such as Neural Radiance Fields (NeRF) (Mildenhall et al. 2021) and Gaussian Splatting (GS) (Kerbl et al. 2023), offers alternative strategies. GS represents scene geometry explicitly via Gaussian kernels, which enables straightforward integration with existing simulators such as PBD for fluids (Feng et al. 2025), XPBD for elastic bodies (Jiang et al. 2024), and MPM for general materials (Xie et al. 2024). This integration has revived interest in system identification, which recovers physical parameters from multi-view videos of syn-thetic objects (Jatavallabhula et al. 2021;Cai et al. 2024;Li et al. 2023). When ground-truth dynamics are unavailable, recent methods guide video diffusion models with text or image prompts to infer plausible parameter values (Liu et al. 2025a;Zhang et al. 2024;Lin et al. 2025;Huang et al. 2025).

While video diffusion models offer the potential for zeroshot parameter estimation, their performance remains limited. Recent studies (Kang et al. 2025;Bansal et al. 2024) report that both open-source (Yang et al. 2025;Chen et al. 2024b;Wang et al. 2025b;Zheng et al. 2024) and closedsource (Pika 2024;Bar-Tal et al. 2024;Luma AI 2024) video generation systems fail to produce videos that obey even basic physical common sense. Consequently, for many objects and scenes such as AI-generated shapes (Fig. 1, Top), human-designed models (Fig. 1, Bottom), or real-world objects viewed from novel viewpoints or subjected to different force conditions (Fig. 7), current video diffusion-based methods often infer incorrect simulation parameters.

Building fully-fledged video diffusion models capable of generating physically plausible outputs is impractical due to their massive computational demands and the need for diverse ground-truth (GT) motion data across a wide range of novel objects. Instead, we leverage existing pretrained video diffusion models to assist with out-of-distribution scene and object simulations. We aim to distill plausible motion cues from pre-trained models guided by high-level, user-provided language instructions, while mitigating the models’ inductive shape and appearance biases. To achieve this, we introduce a novel Learnable Motion Distillation (LMD) loss that extracts pure motion signals from a pretrained video diffusion model to steer our differentiable simulations. Concretely, LMD minimizes appearance and geometry discrepancies between the simulation and the diffusion model’s predictions by combining a lightweight, trainable motion extractor with augmented perturbations in both geometry and appearance during training.

Accurate initialization of simulation parameters is critical as poor starting values waste computation and hinder convergence. We extend PhysFlow’s multimodal initialization with constraint-aware prompts that embed domain-specific parameter bounds (e.g., typical Young’s modulus or density ranges for metal, foam, plasticine). By forcing the LLM to select values within these limits, we leverage its internal knowledge of real-world materials. This approach both anchors simulations in physically plausible ranges and suppresses LLM hallucinations and fabrications (Farquhar et al. 2024;Hao et al. 2024;Walters and Wilder 2023).

Our ultimate goal is to ensure that “what you describe is exactly what you simulate”. We validate our framework on over 30 simulation scenarios, spanning elastic materials, plasticine, metals, foams, sands, Newtonian, and non-Newtonian fluids. Our main contributions are: (i) the introduction of a learnable motion distillation loss to isolate and leverage true motion signals, with LLM initialization triggered by plausible material range values, and (ii) a fully automatic, text-guided system that achieves state-of-the-art (SOTA) simulation performance, surpassing existing methods seamlessly on human-designed, AI-generated, and real-world objects.

3D Dynamics Generation. Diffusion (Liu et al. 2022) and flow-matching (Jin et al. 2024) models have enabled medium duration, high-quality video synthesis from text or image prompts, exemplified by SORA (Liu et al. 2024b) and GOKU (Chen et al. 2025). To extend these dynamic priors into 4D (3D + time), several works fuse diffusion with dual shape-texture representations, using either NeRFstyle encoders or Gaussian Splatting (GS)to generate viewconsistent, dynamic scenes without explicit 4D supervision (Singer et al. 2023;Yuan et al. 2024;Ling et al. 2024;Zhao et al. 2023). While implicit methods such as D-NeRF (Pumarola et al. 2021) and HyperNeRF (Park et al. 2021) often suffer from slow rendering and limited user control, explicit splatting pipelines (e.g., 4DGS (Wu et al. 2024), HY-BRID3D-4DGS (Oh et al. 2025), and other real-time systems (Yang et al. 2024;Duan et al. 2024)) offer fast, editable alternatives. These splatting frameworks have been further generalized across input modalities: multi-view static reconstruction (Chen et al. 2024a;Huang et al. 2024a), dynamic captures (Xu et al. 2024), single uncalibrated images (Yi et al. 2024;Smart et al. 2024), and mesh-to-Gaussian-field conversion (Waczyńska et al. 2024). Despite these advances, most methods lack physics grounding, as they do not model deformations or motion driven by forces and material properties, limiting realism (Bansal et al. 2025;Zhang et al. 2025;Cao et al. 2024;Wang et al. 2025c). Physics-grounded Dynamic Generation. Embedding physical laws into generative models (Zhong et al. 2024) yields more realistic interactions and using differentiable simulators enables gradient-based motion synthesis while simultaneously tuning material parameters. Simple spring-mass systems, such as SPRINGGAUS (Zhong et al. 2024) and PHYSTWIN (Jiang et al. 2025), effectively capture elastic deformation, while the differentiable Material Point Method (MPM) (Jiang et al. 2016) excels at modeling diverse material behaviors. For example, PHYSMOTION (Tan et al. 2024) leverages MPM for single-image dynamics, whereas PAC-NERF (Li et al. 2023) fuses NeRF and MPM via particles, trading off efficiency and fidelity in complex scenes. PHYS-GAUSSIAN (Xie et al. 2024) further boosts visual quality by combining 3DGS with MPM, yet still requires hand-tuned material settings. System identification can recover these settings but requires ground-truth (GT) videos (Cai et al. 2024) or markers (Ma et al. 2023), limiting scalability. Thus, automatic estimation of material properties without any GT dynamics is an open challenge for real-world applications. Physical Parameter Estimation. To inject physical realism, recent method (Zhang et al. 2024;Liu et al. 2024a;Huang et al. 2025;Lin et al. 2025;Liu et al. 2025a) leverage videodiffusion priors (Blattmann et al. 2023;Meng et al. 2024) to infer material properties such as elasticity and plasticity. PHYSDREAMER (Zhang et al. 2024) models elastic behavior in real scenes, and PHYS3D (Liu et al. 2024a) extends this approach to plastic deformations. DREAMPHYSICS and PHYSFLOW handle a broader range of materials, while OM-NIPHYSGS integrates constitutive models into each particle to support heterogeneous interactions. These methods optimize material parameters by backpropagating through dif-

Lagrangian Gaussians Euclidean Grid {“material_type”:“elastic”, “density”:1200 kg/m³, “Young’s modulus”:5e7, “Poisson’s ratio”:0.49} ferentiable simulators using either (i) direct perceptual objectives, such as image similarity (Zhang et al. 2024) or optical-flow divergence (Liu et al. 2025a) between simulated and generated frames, or (ii) score-distillation losses derived from a diffusion model (Huang et al. 2025;Liu et al. 2024a). However, these methods rely on real-world footage, where objects are often anchored, occluded, or subject to noise (Wang et al. 2025a), limiting their applicability compared to the increasingly prevalent humandesigned and AI-generated assets. Human-designed objects often lack consistent textures (Fig. 1, Bottom), while AIgenerated meshes (Zhao et al. 2025;Tochilkin et al. 2024) can exhibit atypical geometry or appearance (Fig. 1, Top), confusing appearance-driven supervision. To address these limitations, we introduce a learnable motion distillation loss that extracts motion cues from pretrained diffusion models guided by text prompts, while suppressing appearance and geometry biases. While some works (Jeong, Park, and Ye 2024; Zhai et al. 2024) extract inter-frame motion for video-based transfer, our approach focuses on text-guided estimation of physical parameters for physics-based simulation. Text prompts often implicitly convey physical properties by indicating material types and object categories. We enable more effective automated estimation of these material properties by using pretrained LLMs with plausible material value ranges, a crucial aspect overlooked by prior works (Huang et al. 2025;Liu et al. 2025a;Lin et al. 2024).

We consider objects and scenes that are real, synthetic, human-designed (often lacking high-quality textures), or AIgenerated (often with uncommon geometry or appearance). These inputs may be provided as multi-view static images, dynamic videos, single images, or meshes. Compared to traditional mesh representations, 3DGS is more suitable for reproducing real-world scenes and supports high-quality, realtime rendering. Methods such as PGSR, GIC (Cai et al. 2024), Splat3R, and GaMes can convert inputs into collec-tions of 3D Gaussians, G := {x g , σ g , Σ g , S g }, where each splat g is defined by its center

, and color coefficients S g .

To simulate dynamic behavior over discrete time t, we denote time-varying splats as G t := {x t g , σ t g , Σ t g , S t g }. We adopt a differentiable GS-adapted MLS-MPM simulator (Xie et al. 2024), which evolves splat states using a Markovian (Richey 2010) update T , consisting of particle-to-grid and grid-to-particle mappings (velocity omitted for clarity):

where Σ t g and S t g evolve through the deformation gradient F t+1 , which captures stretching, rotation, and shear (Xie et al. 2024), and f ext denotes external forces. Following PHYSFLOW (Liu et al. 2025a), all splats share the same material parameters defined as θ := {ρ, c, θ c }, where ρ is density, c is a material class (e.g., elastic, plasticine, metal, foam, sand, Newtonian or non-Newtonian fluid) corresponding to different material constitutive models, and θ c contains associated class-specific coefficients (see Suppl. for details). Objective: Our goal is to automatically infer the full material parameter set θ from a natural language prompt P text , without supervision from ground-truth dynamics, motion capture markers, or videos. Once obtained, the parameters enable high-fidelity and diverse 3D dynamic simulations under various force fields and physical conditions (see Suppl.).

Score Distillation Sampling (SDS) (Poole et al. 2022) was initially proposed to distill 3D priors from large-scale 2D diffusion models for text-to-3D generation (Lin et al. 2023). Recent extensions (Lin et al. 2025;Huang et al. 2025;Liu et al. 2024a) adapt SDS to optimize physical parameters θ c using a diffusion model ϕ. Let z 0 = E ϕ ({I l }) be the latent encoding of frames {I l } via the video encoder E ϕ . At video diffusion step k, z 0 is perturbed by noise ϵ ∼ N (0, I):

The SDS update is computed as:

where ϵ ϕ is the predicted noise from ϕ and w k is a stepdependent weight. The gradient with respect to θ c is:

Here, {I l } is differentiably rendered from {G t } by 3DGS, and {G t } depends on θ c via GS-adapted MLS-MPM simulation. This objective aligns simulated videos with the data distribution of the diffusion model, thereby distilling physical priors from the latter to optimize the parameters θ c .

As shown in Fig. 2, we first reconstruct the static object or scene into an initial GS representation G 0 . Simultaneously, we prompt a multimodal LLM (e.g., GPT-4 (Achiam et al. 2023)) with prescribed parameter ranges defined for each possible material type, conditioning primarily on the user’s text prompt P text and secondarily on a reference rendered image, to obtain an initial material parameters estimate θ ini . Using θ ini and external forces f ext , our differentiable GS-adapted MLS-MPM solver simulates splat dynamics over time, yielding a sequence {G t }. We alpha-blend a sparse subset of these into frames {I l }. To refine θ ini , we introduce a learnable motion distillation loss (Eq. 5), which extracts dynamic priors from a pretrained video diffusion model ϕ conditioned on the same P text . By iterating simulation, rendering, and gradient-based optimization, we converge to a physically plausible parameter set θ c that faithfully reproduces the motion specified by the user.

Accurately identifying the material type c is crucial for selecting the appropriate constitutive model in MLS-MPM. Given a user prompt, e.g., “A rubber ball-like object hits the ground and bounces back.”, we aim to infer the material type c, density ρ, and a corresponding set of materialspecific parameters θ c . If the prompt changes, the inferred material and parameters should adapt accordingly, enabling flexible, user-driven simulation (see Figure A4 in Suppl.).

To achieve this, we leverage GPT-4 (Achiam et al. 2023) to estimate initial material parameters θ ini primarily from text and secondarily from a reference image. However, naïvely querying GPT-4 as done by PhysFlow can lead to hallucinated (Farquhar et al. 2024;Hao et al. 2024) or fabricated (Walters and Wilder 2023) numerical values (see Fig. 6). LLMs inherently encode extensive real-world material knowledge, so to reduce hallucinations, we provide GPT-4 with prompt templates that contain value-range constraints grounded in standard material-property handbooks (Callister and Rethwisch 2020;Ashby and Jones 2012;Mitchell and Soga 2005;Gibson and Ashby 1999;Rumble 2024). This grounding steers the LLM toward realistic values, preventing implausible predictions and simulation failures in certain cases (Fig. 6). The Supplement

Density ρ and material class c can be reliably inferred using our LLM-based approach above. However, the class-specific coefficients θ c are coarse approximations that are insufficient for precise simulation and require additional supervision, since LLMs lack dynamic modelling and simulation capabilities. Following prior work (Liu et al. 2024a;Huang et al. 2025), we employ a video diffusion model ϕ to supervise motion optimization. However, diffusion-based predictions inherently entangle motion with appearance and geometric biases from their training data, making it challenging to extract pure motion signals, especially for human-designed or AI-generated objects whose appearance or structure falls outside the training distribution (see Fig. 5).

A key observation is that, despite changes in appearance or shape, identical simulated motions under the same initial physical parameters yield globally consistent latent codes from the video encoder E ϕ . As Fig. 3 (Left) shows, whether the bird’s color shifts from red to blue (Top) or its shape loses tail and beak (Bottom), the latents (visualized by PCA projection) share the same overall structure and vary only in local details. This holds across different diffusion models (see Suppl.) and motivates extracting motion signals directly from “clean” latents rather than noise via a learnable motion extractor M , by dynamically smoothing local latent-space disparities, i.e, the appearance and geometry gaps between the model’s pre-learned distribution and the target scene.

Concretely, to enforce motion learning, we augment the initial representation G 0 (both kernel centers and color coefficients) with additive Gaussian noise ϵ ′ (see Suppl.), yielding G0 . This produces augmented renderings { Ĩl } and latents z0 = E ϕ ({ Ĩl }). Then we compute one-step denoised

, geometrically projecting zk back onto the clean-latent manifold.

We denote the distilled motion targets and predictions as:

“The jam on the toast is being spread” where the learnable motion extractor M is a lightweight two-layer convolutional network initialized to the identity mapping and trained with a small learning rate of 2 × 10 -5 . Our resulting learnable motion distillation (LMD) loss uses a Charbonnier variant for numerical stability (Barron 2019):

with small constant β = 10 -3 , and M is kept synchronized via exponential moving averaging (Tarvainen and Valpola 2017). Finally, gradients to the coefficients θ c are estimated:

As shown in Fig. 3 (Right), our L LMD loss captures consistent motion patterns across appearance and geometry, while L SDS (Eq. 3) leads to inconsistent results after optimization.

We build our differentiable simulator on NVIDIA’s WARP implementation (Macklin 2022). Following (Xie et al. 2024), we mitigate skinny artifacts via anisotropic regularization and fill solid objects’ internal volumes to enhance simulation realism. To stabilize gradient propagation and improve training speed over long MPM rollouts, we leverage a frame boosting scheme (Huang et al. 2025;Zhang et al. 2024). Given M × L total frames (with M = 8), we split them into M interleaved subsequences V i = {I i , I i+M , . . . , I i+M (L-1) } for i = 1, . . . , M , and alternate supervision across these M groups. Each simulation spans 5 seconds, generating 150 rendered frames in total, with 256 internal substeps per frame. With frame boosting, for each subsequence, we perform 256 × M intermediate updates between adjacent frames, computing gradients only at the final step. Besides, learnable motion distillation is distilled using the CogVideoX model (Yang et al. 2025) with classifier-free guidance (CFG=100), following Phys-Flow. Training converges in approximately 40 iterations, with each forward-backward pass taking about 28 seconds on an NVIDIA A100 80 GB GPU. Our framework supports diverse manual specifications of boundary conditions (e.g., see Fig. 1) and force applications (see Fig. 7), enabling precise spatiotemporal control of material response.

Datasets. We conduct experiments on three dataset types. 1) Human Designed: We evaluate eight PAC-NeRF models (Torus, Bird, Playdoh, Cat, Trophy, Droplet, Letter Cream, and Toothpaste), which exhibit uniform colors rather than detailed textures. We use the static 3DGS reconstructions from GIC (Cai et al. 2024). Since our focus is text-guided physical simulation using various material prompts (see Suppl.), rather than system identification, we do not use their rendered dynamic frames as ground-truth labels. Those frames rely on manually specified parameters and cannot capture the full diversity of realistic distributions.

To assess generalization, we varied input forces across diverse scenes and settings, resulting in notable deformations and dynamic responses (see Suppl.). Qualitative examples are shown in Fig. A1, and average user preferences over each baseline are reported in Tab. 1 (detailed votes in Suppl.). Our method outperforms all baselines in physical realism and prompt adherence, with preferences exceeding 50% across all datasets and over 80% on human-designed and AI-generated scenes, demonstrating strong generalization to novel geometries and textures. For example, in the Jam (Fig. A1, Middle), our method captures Newtonian viscosity with a smooth, cavity-free spread, unlike the baselines (red circle). Similarly, in Toothpaste (Fig. A1, Left), it reproduces non-Newtonian behavior, initially flowing and spreading before settling into a stable mound, while others fail.

In Tab. 2, our method achieves the highest average scores across all objective quantitative metrics while maintaining competitive optimization speed (tested on Bird scene from Fig. 5, Bottom), thanks to its lightweight motion extractor. However, these metrics do not fully align with human perception. For example, the Droplet scene (see Suppl.) scores slightly lower in CLIPSIM and ECMS than some baselines, despite producing the expected water-splashing behavior.

We attribute this gap to two factors: 1) current video-text consistency and motion metrics cannot distinguish dynamic differences across identical scenes, forces, and prompts, and 2) pretrained metric models focus on static appearance and lack the ability to capture diverse materials and motion patterns specified by text (see Suppl. for detailed analysis). One remedy is to compare physical parameters as system identification. However, obtaining GT distributions of physically plausible outcomes is challenging. Similar limitations have been noted in 3D-generation tasks (Yu et al. 2025;Tang et al. 2024). Therefore, we emphasize visual comparisons and present qualitative evaluations in subsequent experiments.

Impact of LMD. To demonstrate the advantage of our learnable motion distillation (LMD) objective, we compare three losses: 1) the optical flow loss L Flow from PhysFlow, which extracts flow from generated videos; 2) the SDS loss L SDS (Eq. 3); and 3) our LMD loss L LMD (Eq. 5). In Fig. 5, because the prompts specify the same materials as in the raw simulation data (Playdoh: Top, plasticine; Bird: Bottom, elastic), we use PAC-NeRF’s manually tuned outputs (Li et al. 2023) as a coarse reference for material behavior, even though it is originally developed for system identification.

The models optimized with L LMD align most closely with this reference, confirming its superior material-parameter optimization (see quantitative results in Tab. 3 and additional ablations of L LMD in Suppl.). Note, the results on Phys-Flow’s project page (Liu et al. 2025b) closely resemble the manual references since they perform system identification on human-designed objects trained on the paired GT videos. LLM reasoning. To validate this, in Fig. 6 we simulate the Hat scene under six conditions: PhysFlow initialization with (using LMD) and without optimization, our initialization with (using LMD) and without optimization, and medianvalue initialization with (using LMD) and without optimization (using the median of the constrained value ranges). PhysFlow’s initialization results in exaggerated early deformations and, even after optimization, still produces unrealistic artifacts, as highlighted in the orange rectangles. Medianvalue initialization is highly sensitive to the upper bound.

In this case, a large Young’s modulus of 2 × 10 11 leads to overly rigid behavior with minimal deformation failing adhere to the provided force and textual description, even after optimization. In contrast, our approach provides a stable and plausible starting point (see Tab. 2). When combined with LMD, it yields material dynamics that are both accurate and visually convincing. These results demonstrate that our range prompts guide the LLM to select appropriate parameters values rather than hallucinating spurious guesses.

Robustness to Varying Simulation Conditions. We evaluate each scene under varied external forces and slight camera perturbations. In Fig. 7 (Top), the telephone is subjected to a different force direction and a shifted viewpoint. Our method still produces the correct elastic deformation guided by the prompt: “The telephone cord is gently vibrating”, whereas PhysFlow remains nearly static as the generated videos themselves fail to provide the correct motion that serves as the GT labels for its loss (L Flow ) (see Figure A6 in Supplement). In Fig. 7 (Bottom), we quintuple the plane’s propeller’s rotational speed. Thanks to the high Young’s modulus and yield strength of metals from our ranges, our simulation preserves structural integrity and yield stress. In contrast, PhysFlow detaches the propeller (red circle) even with the same prompt: “The plane propeller is spinning”. Extension to Heterogeneous Materials. Our approach extends seamlessly to multi-object scenes with different materials. By combining GS segmentation (Wang et al. 2025a) with SAM2 (Ravi et al. 2024), a multimodal LLM cab infer each object’s material properties from a single text prompt. As shown in Fig. 8, the axe is treated as metal and the toy as elastic rubber. In our MPM framework, every particle carries its own material parameters, yielding high-fidelity heterogeneous dynamics. For example, the rubber toy deforms under load while the axe remains rigid. Elastic artifacts in PhysDreamer are highlighted in red (see our attached video). Note, at the time of writing, multi-object simulation code for OmniPhysGS was not available. Adapting our pipeline to complex applications like cinematics or video games involves further engineering beyond this work’s scope.

“An axe hits a rubber toy, causing it to deform” PhysDreamer DreamPhysics PhysFlow Ours

MotionPhysics is a novel framework that uses video diffusion models and multimodal LLMs to drive 3D dynamic scene simulations guided by simple text prompts. A learnable motion distillation module extracts clean motion cues, while an LLM-based embedding initializes materialspecific parameter priors, enabling high-fidelity, physically grounded animations. In future, we aim to support fully automatic configuration of fine-grained simulations from text and extend our motion distillation loss to other animation tasks, such as character rigging and deformation. Limitations. Our method does not model shadow effects, which could improve visual realism. Additionally, while our estimated parameters enable plausible simulations, they are not intended for accurate real-world material measurement.

Table A1: Representative constitutive models and corresponding plastic integration schemes (return mapping), following the implementation in PhysFlow (Liu et al. 2025a).

The elastic law governs the recoverable stress response, while the return mapping algorithm projects the trial state back onto the plastic yield surface.

• Plastic flow and integration (return mapping):

When the stress state exceeds a material-dependent yield criterion, permanent (irreversible) deformation occurs. Yield conditions, flow rules, and hardening laws specify the evolution of the plastic state. In practice, a return mapping algorithm is used at each time step to enforce these constraints numerically.

The generic forms are:

Elastic Law:

f Y (P) < 0, (10) Plastic Update:

Table A1 summarizes classical material models and the return mapping (plastic integration) schemes commonly implemented in computational mechanics and in this work. Below we provide a concise but precise overview of each constitutive model and its plastic integration scheme used:

No plastic yields, f Y (P) ≡ -1 < 0,

F e ← F, Identity mapping (no evolution). ( 14)

f Y (τ ) depends on specific plastic model below, ( 16)

von Mises Plasticity (von Mises yield criterion and radial return mapping)

von Mises Plasticity with Damage (Yield stress decreases with accumulated plastic strain)

Viscoplastic Return Mapping (Foam) (Viscous regularized von Mises yield law)

Drucker-Prager Plasticity (Sand) (Frictional cone yield criterion)

Compressible Neo-Hookean Elasticity (Newtonian Fluid) (Weakly compressible elasticity; no distinct plastic step)

No plastic yields, f Y (P) ≡ -1 < 0,

Herschel-Bulkley Plasticity (Non-Newtonian Fluid) (Rate-dependent viscoplastic fluid)

This comprehensive set of constitutive models allows our model to capture rich elastoplastic behaviors for various real-world material types, supporting high-fidelity simulations.

Following PhysFlow (Liu et al. 2025a), we consider seven representative material types, each characterized by typical parameter sets to cover a broad range of commonly encountered materials: A2.

We utilize OpenAI’s GPT-4 (Achiam et al. 2023), a multimodal large language model (LLM), to infer material types and physical parameters from a textual simulation prompt with an optional image reference. Below is our complete inquiry prompt used in our pipeline:

Inputs:

• Textual simulation prompt:

“An axe is hitting the ground.”

• Reference Image:

Q: What is this object? Based primarily on the textual simulation prompt, determine the most appropriate material type. Then estimate its density (kg/m³) and relevant physical parameters. Use the image as secondary visual reference only.

Warning: Material types and required parameters (with SI units):

• Elastic: E (Pa), ν (unitless)

• Plasticine: E, ν, τ Y (Pa)

• Sand: θ f ric (°)

• Newtonian fluid: µ (Pa•s), κ (kPa)

• Non-Newtonian fluid: µ, κ, τ Y (Pa), η (unitless)

Warning: All inferred physical parameters must strictly fall within the following valid ranges:

• Elastic: Table A2: Material parameters, their SI units, and value ranges for the seven simulated material types (with sources).

ν: 0.1 -0.5

• Plasticine:

-E: 1 × 10 6 -5 × 10 6 Pa Respond in this exact format:

1 { 2 “material_type”: “…”, 3

“density”: …, 4 “E”: …, 5

“nu”: …, 6

“tau_Y”: …, 7

“mu”: …, 8

“kappa”: …, 9

“eta”: …, 10 “theta_fric”: … 11 } Only include fields relevant to the inferred material type.

1 { 2 “material_type”: “Metal”, 3

“density”: 7850, 4 “E”: 2.1e11, 5

“nu”: 0.30, 6

“tau_Y”: 2.5e8 7 } Moreover, following PhysFlow, we apply the same linear or logarithmic scaling to GPT-4’s predictions, followed by the same clamping to ensure numerical stability in our differentiable MLS-MPM simulations.

Our proposed approach allows large language models (LLMs) to directly infer simulation parameters from natural language prompts and visual cues, providing a plausible initialization for subsequent parameter optimization aimed at reproducing the scenario described by the user.

To promote motion distillation during training, we add controlled noise to both the Gaussian centers x g and their spherical harmonics coefficients S g :

where ϵ = 0.1 sets the perturbation scale. These noisy Gaussians are used only for computing the motion-distillation loss during optimization and are discarded during simulation and evaluation. By mildly varying geometry and appearance, this targeted augmentation drives the model to learn motion features that remain stable under small GS perturbations, resulting in more robust and semantically meaningful motion representations.

We adopt a diverse set of boundary condition modules and force application schemes to support complex material behavior and scene interaction. Our framework includes six types of boundary conditions and five distinct external force modules as shown in Table A3, allowing precise spatiotemporal control over particle and grid behavior. We conclude our simulation settings for real-world dataset cases shown in Table A4, which differ from those provided in the supplementary materials of PhysFlow.

We provide additional qualitative comparisons in Figure A1, along with further ablation studies presented below. Additional Ablation of L LMD . We present the simulated sequence corresponding to Fig. 3 (Right) in the main paper. As shown in Figure A3, our learnable motion distillation loss L LMD consistently distills motion patterns across both geometry and texture variations, whereas optimization with L SDS produces divergent simulations that fail to adhere to the user-provided inputs.

Robustness to Varying Text Prompts. We evaluate how faithfully each method can follow different simulation textual descriptions for the same object. As illustrated in Figure A4, our approach reliably produces the intended effects across diverse prompts, confirming that it minimizes biases from geometry or texture and instead prioritizes the motion specified in the user-provided text prompt.

Interactive Evaluation. Following prior work (Zhang et al. 2024), we train our model only once on a single object.

During evaluation, we introduce new simulation conditions to generate diverse interaction scenarios (Figure A5). After the initial training (cyan arrow, Left), the object is able to deform under novel applied forces (white arrows, Right).

Failure Analysis of PhysFlow. We further analyze the failure cases of PhysFlow. Although PhysFlow is also based on the intuition of extracting motion signals, its method relies on optical flow estimated from generated videos. However, these generated videos often fail to reflect the actual motion dictated by simulation conditions (e.g., applied forces), even in real-world scenarios, as illustrated in Fig- ure A6.

In our user study, the left/right ordering of videos in each pair was randomized, and all videos were anonymized to prevent participants from knowing which video was ours or one of the baselines based on the ordering. Participants were also unaware of the number of baseline methods involved in the study. We evaluated 15 different scenes, each compared against 4 baseline methods, resulting in 60 video pairs. These were randomly divided into four distinct questionnaires, each consisting of 15 pairs, one per scene. Upon accessing the survey via the main link, participants were randomly assigned to one of the four questionnaires. A total of 79 users participated in the study, resulting in approximately 20 user evaluations per method per scene on average. The raw voting statistics for all test cases are presented in Table A5.

We present screenshots from our questionnaire interface in the figures that follow. Figure A7 shows the initial instruction page, which includes approved ethics statements, study guidance, and the estimated completion time. Figure A8 illustrates the textual prompt, force visualization, and video comparison interface. Figure A9 displays the two evaluation questions asked to collect subjective ratings. We attach the raw answers and video assignments to the left/right slots in the respective forms in the submission files (see the user study folder for details.).

Here, we provide a detailed analysis of the three metrics, Overall Consistency (OC), CLIPSIM and ECMS, presented in the main paper.

Consider a generated video I = {I 1 , I 2 , . . . , I L } for a user prompt P text , where I l denotes the frame at time step l. To evaluate how well the video adheres to the prompt, we adopt two metrics: Overall Consistency (OC) from VBench (Huang et al. 2024b) and CLIPSIM (Wu et al. 2021)

where ViCLIP (Wang et al. 2023b) denotes the joint embedding model for text and video.

Following OmniPhysGS (Lin et al. 2025), we use the ViT-L/14 variant of CLIP (Radford et al. 2021) for computing CLIPSIM.

To quantify motion realism, we use the Energy-Constrained Motion Score (ECMS) following Phys-Flow (Liu et al. 2025a) and defined as

where F l,l+1 is the optical flow between frames l and l + 1. We estimate F l,l+1 directly from the generated video I using RAFT (Teed and Deng 2020) with the raft-sintel set of pretrained weights. We acknowledge that, although OC, CLIPSIM, and ECMS provide useful quantitative insights, they remain imperfect and can diverge from human judgments.

OC relies on ViCLIP, a learned video-text alignment model trained on caption-video paired datasets that typically lack diverse dynamic behaviors across different materials. Consequently, it assigns close scores to videos of distinct materials when appearance, scene layout, geometry, lighting, environment, and force conditions are identical (see Tab. 2 in the main paper).

CLIPSIM evaluates each frame independently against the prompt and therefore cannot assess whether the temporal dynamics reflect the specified material properties. For example, in the Droplet scene (Figure A10, Top), PhysDreamer (Zhang et al. 2024) achieves the highest CLIPSIM score de-spite modeling the droplet as an elastic object rather than liquid, because each frame visually resembles a water droplet, CLIPSIM overlooks the incorrect temporal behavior.

ECMS, in contrast, ignores prompt adherence entirely. By focusing solely on trajectory matching, it may reward unrealistically smooth motions that contradict the intended material response. For instance, OmniPhysGS (Lin et al. 2025) attains the best ECMS score in Figure A10 (Bottom), even though the toothpaste neither “collapses” nor the droplet fully “splashes,” resulting in motion that conflicts with the prompt.

Developing improved evaluation metrics is an important direction for future work. One key advancement would be a prompt-adherence metric that captures temporal dynamics under fixed scene and force conditions. Another would be a holistic physical-realism metric that jointly evaluates material properties and dynamic behavior as specified by the text prompt. These enhancements, however, lie beyond the scope of the present study.

] “The jam on the toast is being spread”

Real World: We include four PhysDreamer scenes (Alocasia, Carnation, Hat, and Telephone)(Zhang et al. 2024) and four additional scenes: Fox from InstantNGP(Müller et al. 2022), Plane from NeRFStudio(Tancik et al. 2023), Kitchen from Mip-NeRF 360(Barron et al. 2022), and Jam and Sandcastle from PhysFlow. We use the text prompts provided by PhysFlow for all real-world scenes.3) AI Generated: We use the meshes Urchin, Alien, Gentleman, and Axe from Hunyuan3D(Zhao et al. 2025), which are further processed with GaMes(Waczyńska et al. 2024) to obtain their corresponding 3DGS representations. Baselines. We compare with: 1) PhysDreamer (Zhang et al. for consistent frame rates. All baselines share identical simulation settings (boundary conditions, external forces, and text prompts). For material initialization, we follow Phys-Flow using GPT-4 predictions, except OmniPhysGS, which directly optimizes
Real World: We include four PhysDreamer scenes (Alocasia, Carnation, Hat, and Telephone)(Zhang et al. 2024) and four additional scenes: Fox from InstantNGP(Müller et al. 2022), Plane from NeRFStudio(Tancik et al. 2023), Kitchen from Mip-NeRF 360(Barron et al. 2022), and Jam and Sandcastle from PhysFlow. We use the text prompts provided by PhysFlow for all real-world scenes.3) AI Generated: We use the meshes Urchin, Alien, Gentleman, and Axe from Hunyuan3D(Zhao et al. 2025), which are further processed with GaMes(Waczyńska et al. 2024)
Real World: We include four PhysDreamer scenes (Alocasia, Carnation, Hat, and Telephone)(Zhang et al. 2024) and four additional scenes: Fox from InstantNGP(Müller et al. 2022), Plane from NeRFStudio(Tancik et al. 2023), Kitchen from Mip-NeRF 360(Barron et al. 2022), and Jam and Sandcastle from PhysFlow. We use the text prompts provided by PhysFlow for all real-world scenes.3) AI Generated: We use the meshes Urchin, Alien, Gentleman, and Axe from Hunyuan3D(Zhao et al. 2025), which are further processed with GaMes
Real World: We include four PhysDreamer scenes (Alocasia, Carnation, Hat, and Telephone)(Zhang et al. 2024) and four additional scenes: Fox from InstantNGP(Müller et al. 2022), Plane from NeRFStudio(Tancik et al. 2023), Kitchen from Mip-NeRF 360(Barron et al. 2022), and Jam and Sandcastle from PhysFlow. We use the text prompts provided by PhysFlow for all real-world scenes.3) AI Generated: We use the meshes Urchin, Alien, Gentleman, and Axe from Hunyuan3D(Zhao et al. 2025)
Real World: We include four PhysDreamer scenes (Alocasia, Carnation, Hat, and Telephone)(Zhang et al. 2024) and four additional scenes: Fox from InstantNGP(Müller et al. 2022), Plane from NeRFStudio(Tancik et al. 2023), Kitchen from Mip-NeRF 360(Barron et al. 2022), and Jam and Sandcastle from PhysFlow. We use the text prompts provided by PhysFlow for all real-world scenes.3) AI Generated: We use the meshes Urchin, Alien, Gentleman, and Axe from Hunyuan3D
Real World: We include four PhysDreamer scenes (Alocasia, Carnation, Hat, and Telephone)(Zhang et al. 2024) and four additional scenes: Fox from InstantNGP(Müller et al. 2022), Plane from NeRFStudio(Tancik et al. 2023), Kitchen from Mip-NeRF 360(Barron et al. 2022), and Jam and Sandcastle from PhysFlow. We use the text prompts provided by PhysFlow for all real-world scenes.
Real World: We include four PhysDreamer scenes (Alocasia, Carnation, Hat, and Telephone)(Zhang et al. 2024) and four additional scenes: Fox from InstantNGP(Müller et al. 2022), Plane from NeRFStudio(Tancik et al. 2023), Kitchen from Mip-NeRF 360(Barron et al. 2022)
Real World: We include four PhysDreamer scenes (Alocasia, Carnation, Hat, and Telephone)(Zhang et al. 2024) and four additional scenes: Fox from InstantNGP(Müller et al. 2022), Plane from NeRFStudio(Tancik et al. 2023), Kitchen from Mip-NeRF 360
Real World: We include four PhysDreamer scenes (Alocasia, Carnation, Hat, and Telephone)(Zhang et al. 2024) and four additional scenes: Fox from InstantNGP(Müller et al. 2022), Plane from NeRFStudio(Tancik et al. 2023)
Real World: We include four PhysDreamer scenes (Alocasia, Carnation, Hat, and Telephone)(Zhang et al. 2024) and four additional scenes: Fox from InstantNGP(Müller et al. 2022), Plane from NeRFStudio
Real World: We include four PhysDreamer scenes (Alocasia, Carnation, Hat, and Telephone)(Zhang et al. 2024) and four additional scenes: Fox from InstantNGP(Müller et al. 2022)
Real World: We include four PhysDreamer scenes (Alocasia, Carnation, Hat, and Telephone)(Zhang et al. 2024) and four additional scenes: Fox from InstantNGP
Real World: We include four PhysDreamer scenes (Alocasia, Carnation, Hat, and Telephone)(Zhang et al. 2024)
Real World: We include four PhysDreamer scenes (Alocasia, Carnation, Hat, and Telephone)

ComparisonHuman Designed ↑ Real World ↑ AI Generated ↑ Study Results. Mean percentage values over 80% (highlighted in bold) show strong preference for physical realism (Top) and prompt adherence (Bottom).

Comparison

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on open access ArXiv data.

MotionPhysics: Learnable Motion Distillation for Text-Guided Simulation

📝 Original Info

📝 Abstract

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

📄 Full Content

📸 Image Gallery

Reference

Related Posts

Application of deep learning techniques in non-contrast computed tomography pulmonary angiogram for pulmonary embolism diagnosis

FCMBench: A Comprehensive Financial Credit Multimodal Benchmark for Real-world Applications

FaithSCAN: Model-Driven Single-Pass Hallucination Detection for Faithful Visual Question Answering

Start searching

No results found