KOINEU

February 10, 2026

Reading time: 22 minute

...

📝 Original Info

Title:
ArXiv ID: 2512.12534
Date:
Authors: Unknown

📝 Abstract

detailed motion than state-of-the-art baselines while maintaining high visual integrity. Code will be released at https://qiisun.github.io/animus3d_page.

📄 Full Content

We present Animus3D, a text-driven 3D animation framework that generates motion field given a static 3D asset and text prompt. Previous methods mostly leverage the vanilla Score Distillation Sampling (SDS) objective to distill motion from pretrained text-to-video diffusion, leading to animations with minimal movement or noticeable jitter. To address this, our approach introduces a novel SDS alternative, Motion Score Distillation (MSD). Specifically, we introduce a LoRA-enhanced video diffusion model that defines a static source distribution rather than pure noise as in SDS, while another inversionbased noise estimation technique ensures appearance preservation when guiding motion. To further improve motion fidelity, we incorporate explicit temporal and spatial regularization terms that mitigate geometric distortions across time and space. Additionally, we propose a motion refinement module to upscale the temporal resolution and enhance fine-grained details, overcoming the fixed-resolution constraints of the underlying video model. Extensive experiments demonstrate that Animus3D successfully animates static 3D assets from diverse text prompts, generating significantly more substantial and

Text-to-3D animation has long been a foundational component of visual storytelling, entertainment, and simulation. Recent advancements demonstrate that large-scale text-to-image and text-to-video diffusion models can effectively learn valuable priors for generating 3D animations [Jiang et al. 2024;Li et al. 2024b;Liang et al. 2024;Sun et al. 2024;Zhang et al. 2024].

A representative class of techniques that leverage such priors are score distillation sampling (SDS)-based methods [Brooks et al. 2024;Chen et al. 2023Chen et al. , 2024;;HaCohen et al. 2024;He et al. 2022;Hong et al. 2022;Kong et al. 2024;Wang et al. 2025;Xing et al. 2023;Yang et al. 2024]. The core idea of SDS is to render an image of a 3D scene, add noise to the rendered image, and then use a pre-trained diffusion model to denoise it. The denoising process enables the estimation of gradients, which are then used to update the underlying 3D representation, such as neural radiance fields [Mildenhall et al. 2020] or SA Conference Papers ‘25, December 15-18, 2025, Hong Kong, Hong Kong. arXiv:2512.12534v1 [cs.CV] 14 Dec 2025 Gaussian splatting [Kerbl et al. 2023]. Recent studies have explored the theoretical foundations of Score Distillation Sampling (SDS) and formulated it as a domain transportation problem [McAllister et al. 2024;Yang et al. 2023], where the goal is to shift the current data distribution (i.e., rendered outputs from the 3D representation) toward a target distribution. The estimated gradient guides this transformation, as illustrated in Fig. 2-(a). We observe that existing SDS-based motion generation methods all adopt the original formulation of SDS without modification, inherently following this transportation framework. However, this framework reveals several key limitations for motion distillation: 1) The current distribution lacks a well-defined static source as its starting point. In motion generation, a clear static initialization is typical; the absence of its explicit modeling can obscure the starting point of the optimization trajectory, potentially resulting in limited distilled motion. 2) Motion and appearance are inherently entangled. SDS, however, does not account for this interdependency, and its estimated gradient can lead to the degradation of appearance when the distribution evolves towards the motion target.

To address these challenges, we propose Animus3D, a text-driven 3D animation method. At the core of our approach is a novel Motion Score Distillation strategy as depicted in Fig. 2-(b), which consists of two key components. First, we define a source static distribution as a canonical space, modeled using a video diffusion model enhanced with Low-Rank Adaptation (LoRA) [Hu et al. 2022], capable of generating static video frames. Then, we introduce a noise inversion technique. This method estimates deterministic noise for gradient computation, thereby enabling effective control of motion direction while maintaining the integrity of the appearance. We demonstrate that our Motion Score Distillation can predict more accurate transportation directions, enabling more reasonable and substantial motions while preserving the object’s appearance. Beyond the distillation process, we find motion field regularization to be crucial. To further enhance performance, we introduce temporal and spatial regularization terms into our method, which helps mitigate geometric distortions across time and space. Additionally, due to the fixed temporal resolution of video diffusion, the motion details of animated objects are constrained. To address this, we propose an extension of our work through a motion detailization module, which extends the temporal length and enhances motion detail.

To conclude, we summarize our main contributions as follows:

• Animus3D Framework: We propose Animus3D, a text-driven 3D animation framework capable of generating high-quality motion for static 3D assets from diverse text prompts.

A growing body of work [Jiang et al. 2024;Liang et al. 2024;Pan et al. 2024;Ren et al. 2024;Sun et al. 2024;Wu et al. 2024a;Xie et al. 2024;Zeng et al. 2024] explores data-driven approaches for 3D motion generation by leveraging diffusion models to synthesize temporally consistent multi-view images, followed by pixel-wise optimization to recover coherent 3D representations with motion fields. For instance, SV4D [Xie et al. 2024] introduces temporal layers into multi-view diffusion [Voleti et al. 2024], enabling spatio-temporal modeling from monocular video inputs and supporting orbital-view synthesis of dynamic objects. Animate3D [Jiang et al. 2024] extends AnimateDiff [Guo et al. 2024] by incorporating multi-view images, generating temporally synchronized video sequences through 3D object rendering. Although these methods achieve impressive results on general object motion and are typically fast, they typically depend on large-scale training datasets, such as multi-view captures or densely sampled videos of dynamic scenes, which are often costly and difficult to obtain. In contrast, our method focuses on enhancing SDS to bridge the gap between 2D generative priors and 3D animation. By distilling motion knowledge from powerful 2D diffusion models, our approach enables motion generations without requiring extensive multi-view or temporal supervision.

SDS [Poole et al. 2022;Wang et al. 2023a] was introduced for 3D content generation and image/video editing [Hertz et al. 2023;Jeong et al. 2024] by distilling supervision from pre-trained 2D diffusion models [Ho et al. 2022;Rombach et al. 2022]. A common issue with early SDS-based methods [Lin et al. 2023;Poole et al. 2022;Wang et al. 2023a] is over-smoothing and lack of fine geometric or textural details. These methods often rely on high classifier-free guidance (CFG ∼ 100) [Ho and Salimans 2021] to reduce output variance, which tends to cause over-saturation and unnatural results. Pro-lificDreamer [Wang et al. 2023b] significantly improved fidelity by introducing a second diffusion model that is overfit to the current 3D estimate, allowing high-quality outputs with standard CFG values (e.g., 7.5). LucidDreamer [Liang et al. 2023] and SDI [Lukoianov et al. 2024] mitigates SDS’s over-smoothing by replacing the random noise term with one obtained via DDIM inversion and applying multi-step denoising. Recent analyses such as SDS-Bridge [McAllister et al. 2024] and LODS [Yang et al. 2023] further explore theoretical foundations and architectural optimizations for SDS-based 3D generation. Distinct from these works, our approach distills motion from a pre-trained video diffusion model by optimizing a motion field for a given static 3D object.

Recent advances in video diffusion models [Brooks et al. 2024;Chen et al. 2023Chen et al. , 2024;;HaCohen et al. 2024;He et al. 2022;Hong et al. 2022;Kong et al. 2024;Wang et al. 2025;Xing et al. 2023;Yang et al. 2024] have inspired a growing line of research that distills dynamic 3D scenes evolving over time from pre-trained video diffusion models. MAV3D [Singer et al. 2023] is one of the earliest works in text-todynamic object generation, introducing a hexplane representation to model scene dynamics. Some approaches [Bahmani et al. 2024b;Zhao et al. 2023;Zheng et al. 2024] uses a hybrid SDS pipeline that multi-stage optimization between supervision from text-to-image and multi-view diffusion models, improving both geometric consistency and motion fidelity. Other methods [Li et al. 2024a;Ling et al. 2024;Wimmer et al. 2025] explore novel 3D representations to better capture motion. AYG [Ling et al. 2024] employs 3D Gaussian Splatting [Cui et al. 2025;Huang et al. 2024;Kerbl et al. 2023] for efficient and high-fidelity motion representation, while Text2Life [Wimmer et al. 2025] introduces a training-free autoregressive approach to generate consistent video guidance across viewpoints, enhancing the quality of the distilled dynamics. Several approaches also incorporate explicit motion priors to constrain or regularize the motion fields. TC4D [Bahmani et al. 2024a] uses parameterized object trajectories (e.g., translation and rotation) as motion priors. AKD [Li et al. 2025] further extends this idea by incorporating articulated skeletal structures into score distillation, guided by rigid-body physics simulators. However, these methods largely adopt the original SDS formulation without modification and do not explicitly address its limitations in motion generation. In contrast, we propose a novel Motion Score Distillation strategy tailored for motion optimization, and further introduce a motion refinement module to reduce distortion caused by score distillation, resulting in more stable training and improved motion fidelity.

In this section, we first introduce the parametric 3D representation with motion fields. We then provide an overview of SDS. 3D Gaussian Splatting (3D-GS) [Kerbl et al. 2023] uses millions of learnable 3D Gaussians to explicitly represent a scene. Each Gaussian is defined by its center, rotation, scale, opacity, and viewdependent color encoded via spherical harmonics. The scene is rendered through a differentiable splatting-based renderer R cam given camera parameters: 𝑥 = R cam (G). 4D Gaussian Splatting (4D-GS) [Wu et al. 2024b] extends 3D-GS by introducing a motion field to a canonical 3D representation. In our approach, we first reconstruct the static 3D object using 3D-GS, denoted as the canonical space G 𝑐 . The motion field is modeled using a multi-resolution HexPlane with MLP-based decoders [Cao and Johnson 2023]. During training, we keep G 𝑐 fixed and optimize only the motion field. At each timestamp, the model queries the HexPlane using a 4D coordinate (𝑥, 𝑦, 𝑧, 𝜏) and decodes the resulting feature into deformation values for position and rotation. By querying the motion field at each timestamp 𝜏 ∈ {0, . . . ,𝑇 -1}, we generate a sequence of deformed Gaussians G 0:𝑇 -1 . Given camera parameters, we render the resulting 𝑇 -frame video as 𝑥 𝑜 = R cam (G 0:𝑇 -1 ).

Score Distillation Sampling (SDS) leverages the knowledge from pretrained text-to-image diffusion models to optimize a parametric representation like 3D-GS. Given an output sample 𝑥 0 , (e.g., a rendered image from a 3D-GS), SDS operates as follows: stochastic Gaussian noise 𝜖 is added to 𝑥 0 at a randomly sampled timestep 𝑡:

where ᾱ𝑡 is a noise schedule coefficient. After that, a pretrained denoising model 𝜖 𝜃 (𝑥 𝑡 , 𝑡, 𝑐) predicts the noise in 𝑥 𝑡 , conditioned on the timestep 𝑡 and a text prompt 𝑐. SDS uses the difference between the predicted noise and the sampled stochastic noise as the gradient to update the parameterized representation:

where 𝑤 (𝑡) is the weighting function. Recent works [McAllister et al. 2024;Yang et al. 2023] formulate SDS as a domain transportation problem, aiming to find the optimal transport from the current data distribution D 𝑐 to the target distribution D 𝑡 . Here, the rendered sample 𝑥 0 is drawn from D 𝑐 as 𝑥 0 ∼ D 𝑐 , while the text condition 𝑐 describes the target distribution D 𝑡 . SDS approximates the optimal transport step 𝜖 * between D 𝑐 and D 𝑡 at a given timestep 𝑡 by:

Here 𝜖 𝜃 (𝑥 𝑡 , 𝑡, 𝑐) is a projection of the noised image 𝑥 𝑡 onto the target distribution and 𝜖 is a random gaussian noise 𝜖 ∼ N (0, I).

Given the canonical 3D-GS G 𝑐 of a 3D object and a text prompt 𝑐 describing the desired motion, our method aims to automatically predict a motion field 𝑓 (𝜙) for G𝑐. This produces a Gaussian sequence G 0:𝑇 -1 that exhibits substantial, photorealistic motion while preserving the object’s appearance. To achieve this, as illustrated in Fig. 3, we first introduce Motion Score Distillation ( §4.1), an enhanced SDS framework tailored for motion learning. It incorporates dual distribution modeling ( §4.1.1) and a appearance preservation noise estimation ( §4.1.2) to better guide motion generation. We further propose temporal and spatial regularization terms ( §4.2) to constrain the deformation fields and a Motion refinement method to extends the temporal length and enhances motion detail ( §4.3), as shown in Fig. 5. Given dynamic text prompt 𝑐 and static text prompt 𝑐 ′ , the loss gradient is computed with two predicted noises. The gradient will guide the optimization of the motion field. We further design temporal and spatial regularization terms for the motion field to improve the performance.

Building on the explanation of SDS in the previous section, we propose a novel approach called Motion Score Distillation (MSD). MSD aims to estimate the optimal transport from a static source distribution to a dynamic target distribution. Different from SDS, MSD approximates the optimal motion step 𝜖 * motion between the dual distributions at a given timestep 𝑡 as follows:

Thus, our MSD is formulated as follows:

We demonstrate that 𝜖 dynamic -𝜖 static serves as an effective gradient when both the source static and target dynamic distributions are well expressed. Next, we detail the definitions of 𝜖 dynamic and 𝜖 static .

4.1.1 Dual distribution modeling. Given a time sequence {0 : 𝑇 -1}, we define the static video rendered from the static 3D-GS G 𝑐 as 𝑥 𝑠 0:𝑇 -1 , and the dynamic video rendered from the dynamic 3D-GS G 0:𝑇 -1 as 𝑥 𝑑 0:𝑇 -1 . Similar to SDS, the target dynamic distribution can be approximated using a pretrained latent video diffusion model:

Here, the text prompt 𝑐 describes the motion of the object, such as “A walking

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Start searching

No results found