detailed motion than state-of-the-art baselines while maintaining high visual integrity. Code will be released at https://qiisun.github.io/animus3d_page.
๐ Full Content
We present Animus3D, a text-driven 3D animation framework that generates motion field given a static 3D asset and text prompt. Previous methods mostly leverage the vanilla Score Distillation Sampling (SDS) objective to distill motion from pretrained text-to-video diffusion, leading to animations with minimal movement or noticeable jitter. To address this, our approach introduces a novel SDS alternative, Motion Score Distillation (MSD). Specifically, we introduce a LoRA-enhanced video diffusion model that defines a static source distribution rather than pure noise as in SDS, while another inversionbased noise estimation technique ensures appearance preservation when guiding motion. To further improve motion fidelity, we incorporate explicit temporal and spatial regularization terms that mitigate geometric distortions across time and space. Additionally, we propose a motion refinement module to upscale the temporal resolution and enhance fine-grained details, overcoming the fixed-resolution constraints of the underlying video model. Extensive experiments demonstrate that Animus3D successfully animates static 3D assets from diverse text prompts, generating significantly more substantial and
Text-to-3D animation has long been a foundational component of visual storytelling, entertainment, and simulation. Recent advancements demonstrate that large-scale text-to-image and text-to-video diffusion models can effectively learn valuable priors for generating 3D animations [Jiang et al. 2024;Li et al. 2024b;Liang et al. 2024;Sun et al. 2024;Zhang et al. 2024].
A representative class of techniques that leverage such priors are score distillation sampling (SDS)-based methods [Brooks et al. 2024;Chen et al. 2023Chen et al. , 2024;;HaCohen et al. 2024;He et al. 2022;Hong et al. 2022;Kong et al. 2024;Wang et al. 2025;Xing et al. 2023;Yang et al. 2024]. The core idea of SDS is to render an image of a 3D scene, add noise to the rendered image, and then use a pre-trained diffusion model to denoise it. The denoising process enables the estimation of gradients, which are then used to update the underlying 3D representation, such as neural radiance fields [Mildenhall et al. 2020] or SA Conference Papers ‘25, December 15-18, 2025, Hong Kong, Hong Kong. arXiv:2512.12534v1 [cs.CV] 14 Dec 2025 Gaussian splatting [Kerbl et al. 2023]. Recent studies have explored the theoretical foundations of Score Distillation Sampling (SDS) and formulated it as a domain transportation problem [McAllister et al. 2024;Yang et al. 2023], where the goal is to shift the current data distribution (i.e., rendered outputs from the 3D representation) toward a target distribution. The estimated gradient guides this transformation, as illustrated in Fig. 2-(a). We observe that existing SDS-based motion generation methods all adopt the original formulation of SDS without modification, inherently following this transportation framework. However, this framework reveals several key limitations for motion distillation: 1) The current distribution lacks a well-defined static source as its starting point. In motion generation, a clear static initialization is typical; the absence of its explicit modeling can obscure the starting point of the optimization trajectory, potentially resulting in limited distilled motion. 2) Motion and appearance are inherently entangled. SDS, however, does not account for this interdependency, and its estimated gradient can lead to the degradation of appearance when the distribution evolves towards the motion target.
To address these challenges, we propose Animus3D, a text-driven 3D animation method. At the core of our approach is a novel Motion Score Distillation strategy as depicted in Fig. 2-(b), which consists of two key components. First, we define a source static distribution as a canonical space, modeled using a video diffusion model enhanced with Low-Rank Adaptation (LoRA) [Hu et al. 2022], capable of generating static video frames. Then, we introduce a noise inversion technique. This method estimates deterministic noise for gradient computation, thereby enabling effective control of motion direction while maintaining the integrity of the appearance. We demonstrate that our Motion Score Distillation can predict more accurate transportation directions, enabling more reasonable and substantial motions while preserving the object’s appearance. Beyond the distillation process, we find motion field regularization to be crucial. To further enhance performance, we introduce temporal and spatial regularization terms into our method, which helps mitigate geometric distortions across time and space. Additionally, due to the fixed temporal resolution of video diffusion, the motion details of animated objects are constrained. To address this, we propose an extension of our work through a motion detailization module, which extends the temporal length and enhances motion detail.
To conclude, we summarize our main contributions as follows:
โข Animus3D Framework: We propose Animus3D, a text-driven 3D animation framework capable of generating high-quality motion for static 3D assets from diverse text prompts.
A growing body of work [Jiang et al. 2024;Liang et al. 2024;Pan et al. 2024;Ren et al. 2024;Sun et al. 2024;Wu et al. 2024a;Xie et al. 2024;Zeng et al. 2024] explores data-driven approaches for 3D motion generation by leveraging diffusion models to synthesize temporally consistent multi-view images, followed by pixel-wise optimization to recover coherent 3D representations with motion fields. For instance, SV4D [Xie et al. 2024] introduces temporal layers into multi-view diffusion [Voleti et al. 2024], enabling spatio-temporal modeling from monocular video inputs and supporting orbital-view synthesis of dynamic objects. Animate3D [Jiang et al. 2024] extends AnimateDiff [Guo et al. 2024] by incorporating multi-view images, generating temporally synchronized video sequences through 3D object rendering. Although these methods achieve impressive results on general object motion and are typically fast, they typically depend on large-scale training datasets, such as multi-view captures or densely sampled videos of dynamic scenes, which are often costly and difficult to obtain. In contrast, our method focuses on enhancing SDS to bridge the gap between 2D generative priors and 3D animation. By distilling motion knowledge from powerful 2D diffusion models, our approach enables motion generations without requiring extensive multi-view or temporal supervision.
SDS [Poole et al. 2022;Wang et al. 2023a] was introduced for 3D content generation and image/video editing [Hertz et al. 2023;Jeong et al. 2024] by distilling supervision from pre-trained 2D diffusion models [Ho et al. 2022;Rombach et al. 2022]. A common issue with early SDS-based methods [Lin et al. 2023;Poole et al. 2022;Wang et al. 2023a] is over-smoothing and lack of fine geometric or textural details. These methods often rely on high classifier-free guidance (CFG โผ 100) [Ho and Salimans 2021] to reduce output variance, which tends to cause over-saturation and unnatural results. Pro-lificDreamer [Wang et al. 2023b] significantly improved fidelity by introducing a second diffusion model that is overfit to the current 3D estimate, allowing high-quality outputs with standard CFG values (e.g., 7.5). LucidDreamer [Liang et al. 2023] and SDI [Lukoianov et al. 2024] mitigates SDS’s over-smoothing by replacing the random noise term with one obtained via DDIM inversion and applying multi-step denoising. Recent analyses such as SDS-Bridge [McAllister et al. 2024] and LODS [Yang et al. 2023] further explore theoretical foundations and architectural optimizations for SDS-based 3D generation. Distinct from these works, our approach distills motion from a pre-trained video diffusion model by optimizing a motion field for a given static 3D object.
Recent advances in video diffusion models [Brooks et al. 2024;Chen et al. 2023Chen et al. , 2024;;HaCohen et al. 2024;He et al. 2022;Hong et al. 2022;Kong et al. 2024;Wang et al. 2025;Xing et al. 2023;Yang et al. 2024] have inspired a growing line of research that distills dynamic 3D scenes evolving over time from pre-trained video diffusion models. MAV3D [Singer et al. 2023] is one of the earliest works in text-todynamic object generation, introducing a hexplane representation to model scene dynamics. Some approaches [Bahmani et al. 2024b;Zhao et al. 2023;Zheng et al. 2024] uses a hybrid SDS pipeline that multi-stage optimization between supervision from text-to-image and multi-view diffusion models, improving both geometric consistency and motion fidelity. Other methods [Li et al. 2024a;Ling et al. 2024;Wimmer et al. 2025] explore novel 3D representations to better capture motion. AYG [Ling et al. 2024] employs 3D Gaussian Splatting [Cui et al. 2025;Huang et al. 2024;Kerbl et al. 2023] for efficient and high-fidelity motion representation, while Text2Life [Wimmer et al. 2025] introduces a training-free autoregressive approach to generate consistent video guidance across viewpoints, enhancing the quality of the distilled dynamics. Several approaches also incorporate explicit motion priors to constrain or regularize the motion fields. TC4D [Bahmani et al. 2024a] uses parameterized object trajectories (e.g., translation and rotation) as motion priors. AKD [Li et al. 2025] further extends this idea by incorporating articulated skeletal structures into score distillation, guided by rigid-body physics simulators. However, these methods largely adopt the original SDS formulation without modification and do not explicitly address its limitations in motion generation. In contrast, we propose a novel Motion Score Distillation strategy tailored for motion optimization, and further introduce a motion refinement module to reduce distortion caused by score distillation, resulting in more stable training and improved motion fidelity.
In this section, we first introduce the parametric 3D representation with motion fields. We then provide an overview of SDS. 3D Gaussian Splatting (3D-GS) [Kerbl et al. 2023] uses millions of learnable 3D Gaussians to explicitly represent a scene. Each Gaussian is defined by its center, rotation, scale, opacity, and viewdependent color encoded via spherical harmonics. The scene is rendered through a differentiable splatting-based renderer R cam given camera parameters: ๐ฅ = R cam (G). 4D Gaussian Splatting (4D-GS) [Wu et al. 2024b] extends 3D-GS by introducing a motion field to a canonical 3D representation. In our approach, we first reconstruct the static 3D object using 3D-GS, denoted as the canonical space G ๐ . The motion field is modeled using a multi-resolution HexPlane with MLP-based decoders [Cao and Johnson 2023]. During training, we keep G ๐ fixed and optimize only the motion field. At each timestamp, the model queries the HexPlane using a 4D coordinate (๐ฅ, ๐ฆ, ๐ง, ๐) and decodes the resulting feature into deformation values for position and rotation. By querying the motion field at each timestamp ๐ โ {0, . . . ,๐ -1}, we generate a sequence of deformed Gaussians G 0:๐ -1 . Given camera parameters, we render the resulting ๐ -frame video as ๐ฅ ๐ = R cam (G 0:๐ -1 ).
Score Distillation Sampling (SDS) leverages the knowledge from pretrained text-to-image diffusion models to optimize a parametric representation like 3D-GS. Given an output sample ๐ฅ 0 , (e.g., a rendered image from a 3D-GS), SDS operates as follows: stochastic Gaussian noise ๐ is added to ๐ฅ 0 at a randomly sampled timestep ๐ก:
where แพฑ๐ก is a noise schedule coefficient. After that, a pretrained denoising model ๐ ๐ (๐ฅ ๐ก , ๐ก, ๐) predicts the noise in ๐ฅ ๐ก , conditioned on the timestep ๐ก and a text prompt ๐. SDS uses the difference between the predicted noise and the sampled stochastic noise as the gradient to update the parameterized representation:
where ๐ค (๐ก) is the weighting function. Recent works [McAllister et al. 2024;Yang et al. 2023] formulate SDS as a domain transportation problem, aiming to find the optimal transport from the current data distribution D ๐ to the target distribution D ๐ก . Here, the rendered sample ๐ฅ 0 is drawn from D ๐ as ๐ฅ 0 โผ D ๐ , while the text condition ๐ describes the target distribution D ๐ก . SDS approximates the optimal transport step ๐ * between D ๐ and D ๐ก at a given timestep ๐ก by:
Here ๐ ๐ (๐ฅ ๐ก , ๐ก, ๐) is a projection of the noised image ๐ฅ ๐ก onto the target distribution and ๐ is a random gaussian noise ๐ โผ N (0, I).
Given the canonical 3D-GS G ๐ of a 3D object and a text prompt ๐ describing the desired motion, our method aims to automatically predict a motion field ๐ (๐) for G๐. This produces a Gaussian sequence G 0:๐ -1 that exhibits substantial, photorealistic motion while preserving the object’s appearance. To achieve this, as illustrated in Fig. 3, we first introduce Motion Score Distillation ( ยง4.1), an enhanced SDS framework tailored for motion learning. It incorporates dual distribution modeling ( ยง4.1.1) and a appearance preservation noise estimation ( ยง4.1.2) to better guide motion generation. We further propose temporal and spatial regularization terms ( ยง4.2) to constrain the deformation fields and a Motion refinement method to extends the temporal length and enhances motion detail ( ยง4.3), as shown in Fig. 5. Given dynamic text prompt ๐ and static text prompt ๐ โฒ , the loss gradient is computed with two predicted noises. The gradient will guide the optimization of the motion field. We further design temporal and spatial regularization terms for the motion field to improve the performance.
Building on the explanation of SDS in the previous section, we propose a novel approach called Motion Score Distillation (MSD). MSD aims to estimate the optimal transport from a static source distribution to a dynamic target distribution. Different from SDS, MSD approximates the optimal motion step ๐ * motion between the dual distributions at a given timestep ๐ก as follows:
Thus, our MSD is formulated as follows:
We demonstrate that ๐ dynamic -๐ static serves as an effective gradient when both the source static and target dynamic distributions are well expressed. Next, we detail the definitions of ๐ dynamic and ๐ static .
4.1.1 Dual distribution modeling. Given a time sequence {0 : ๐ -1}, we define the static video rendered from the static 3D-GS G ๐ as ๐ฅ ๐ 0:๐ -1 , and the dynamic video rendered from the dynamic 3D-GS G 0:๐ -1 as ๐ฅ ๐ 0:๐ -1 . Similar to SDS, the target dynamic distribution can be approximated using a pretrained latent video diffusion model:
Here, the text prompt ๐ describes the motion of the object, such as “A walking
where ๐ โฒ is a static text description. However, we observe that even when conditioned on a static description, the video diffusion model does not consistently generate videos of truly static objects, thus violating the second requirement. To address this, we propose to efficiently derive a static denoiser by adapting Low-Rank Adaptation (LoRA) [Hu et al. 2022]:
where ๐ lora is the lora-fined denoiser. The LoRA parameters are trained using the static video ๐ฅ ๐ 0:๐ -1 , with the loss function:
4.1.2 Appearance preserved faithful noise estimation. In SDS and its variants [Bahmani et al. 2024a;Ling et al. 2024], we observe that it is difficult to preserve the original appearance of the static object. In some cases, the object even drifts into the background. SDS’s noise estimation entangles motion and appearance, but we only optimize the motion field. This means that the appearance loss has to be compensated by geometric distortion, causing artifacts, which we refer as motion-appearance entanglement. We find that this issue is strongly correlated with the stochastic noise ๐ added during the diffusion process in Eq. 1. We have assessed this observation in Fig. 4. Therefore, instead of adding stochastic noise, we adopt DDIM inversion [Lukoianov et al. 2024;Song et al. 2022] to obtain deterministic and faithful noise. Given a noised input ๐ฅ ๐ก , we first Given an input image, we first add noise using the noise in SDS and our MSD, and then denoise it to obtain an estimated image. We find that the denoised image using SDS is significantly different from the original, exhibiting large appearance changes and background noise. In contrast, our MSD better preserves the appearance and maintains a clearer background. All latents are decoded into pixel space for visualization. We use t=600 in this case.
predict the noise ๐ ๐ (๐ฅ ๐ก , ๐ก, ๐) using the pretrained diffusion model.
We then estimate the corresponding denoised image x0 as:
Subsequently, we apply deterministic forward noising steps to obtain ๐ฅ ๐ก +1 iteratively, continuing until the predefined timestep ๐ก:
DDIM inversion provides a deterministic noise estimate, producing a denoised output x0 that is faithfully consistent with the input video. This facilitates appearance preservation during the optimization process. We then apply this method to ๐ฅ ๐ ๐ก in Eq. 6 and to ๐ฅ ๐ ๐ก in Eq. 8.
(12)
To further improve performance, our method incorporates both temporal and spatial regularization terms.
) for temporal regularization. Inspired by traditional 2D total variation (TV) losses applied in pixel space, we propose a TV-3D loss to encourage temporal smoothness in motion. This loss directly penalizes abrupt changes in the 3D positions of Gaussians across consecutive timesteps. Specifically, it computes the L 1 norm of the positional differences for each Gaussian between adjacent frames:
Here, ๐ฅ ๐,๐ denotes the position of the ๐-th 3D Gaussian at timestep ๐.
By operating in the 3D Gaussian space, this constraint effectively enforces temporal consistency in the underlying geometric motion.
As-rigid-as-possible (ARAP) for spatial regularization. To facilitate the learning of rigid motion dynamics while preserving the high-fidelity appearance of the static reference model, we employ an ARAP [Sorkine and Alexa 2007] regularization term for
where ๐ ๐ ๐ , ๐ ๐ ๐ โ R 3 are the spatial positions of point ๐ and its neighbor ๐ at the current frame ๐. ๐ ๐ ๐ , ๐ ๐ ๐ โ R 3 are their corresponding positions in the canonical 3D-GS. N ๐ denotes the set of neighboring point indices for ๐ ๐ ๐ , defined in the reference configuration (e.g., points within a fixed radius of ๐ ๐ ๐ ). ๐ ๐ ๐ โ ๐๐ (3) is the optimal local rigid rotation for point ๐ ๐ at frame ๐. This ARAP loss enforces spatial consistency by encouraging locally rigid deformations, thereby promoting realistic motion while preserving geometric fidelity.
The fixed frame length in video diffusion models limits the ability of SDS-generated 3D animations to capture fine-grained motion details. To address this, we introduce a motion refinement module that leverages a high-capacity, pre-trained rectified flow-based text-to-video model to generate long, detailed animation sequences, illustrated in Fig. 5. Given a time sequence {0 : ๐ -1}, we interpolate it to produce ๐ โฒ = 2๐ -1 frames. We then render the 3D-GS with the motion field, resulting in a higher-resolution video ๐ฅ ๐ 0:๐ โฒ -1 with spatial dimensions ๐ป โฒ ร๐ โฒ ร๐ โฒ . Following the SDEdit [Meng et al. 2022] framework, we add noise to ๐ฅ ๐ 0:๐ โฒ and apply iterative denoising to generate a refined video x๐ 0:๐ โฒ -1 . This process preserves the original motion while enhancing temporal consistency and motion details. Finally, we optimize the motion field using an L1 loss between the refined video x๐ 0:๐ โฒ -1 and the initial input ๐ฅ ๐ 0:๐ โฒ -1 . This results in a motion field capable of producing longer and more detailed animations than the original.
Implementation details. We implement our method using threestudio [Guo et al. 2023]. We use ModelScopeT2V [Wang et al. 2023c] as the base video model. The image resolution is set to 256, with 16 frames per video. All experiments are conducted on a single 24G GPU, with approximately 3k iterations for LoRA (rank=4, al-pha=4) fine-tuning with learning rate 1e-5, 5k iterations for motion distillation, and 100 iterations for motion detailization.
“A wandering lion.”
“A swimming clown fish.”
“A giraffe is walking.”
View 2
Fig. 6. Qualitative comparison with the-state-of-the-art methods. Our methods demonstrate the substantial motion and higher visual fidelity of 3D animation. It is recommend to watch the demo for better visualization.
Fig. 7. Comparison with the concurrent work [Li et al. 2025]. AKD utilizes skeleton-based motion, which tends to result in perceptible stiffness in its animations.
Evaluation metrics. Following previous works [Bahmani et al. 2024a;Ling et al. 2024], we utilize CLIP-image to evaluate the semantic similarity between the rendered canonical 3D-GS and the 3D-GS with the motion field. For this, we render 8 views evenly spaced around the azimuth for calculation. Additionally, we use CLIP-text to assess the alignment of the rendered object with the given text prompt. To evaluate the overall quality of the rendered videos, we compute FID [Heusel et al. 2017] and FVD [Unterthiner et al. 2018].
Comparison setting. We compare our method with two state-ofthe-art SDS-based 3D motion generation methods: AYG [Ling et al. 2024] and TC4D [Bahmani et al. 2024a]. For TC4D, since the authors did not release the animated 3D objects, we were unable to use the same 3D assets for animation. Instead, we extracted screenshots from their results, then used Trellis [Xiang et al. 2024] to generate comparable 3D objects for evaluation. Since AYG did not release their code, we opted to use the self-reproduced results for a fairer comparison. To ensure consistency across evaluations, we used the same 3D object generation approach as with AYG. We also include comparison with 4dfy [Bahmani et al. 2024b] and Dream-in-4D [Zheng et al. 2024] in the supplementary materials. For all comparisons, we used the same motion description. For a fair comparison, we do not employ motion refinement ( ยง4.3).
We report the quantitative results in Table 1. Our method achieves a better CLIP-Image score, as it incorporates an appearance-preserving faithful noise estimation that prevents appearance degradation during distillation. Moreover, our method outperforms others in the CLIP-Text score, demonstrating that our MSD can better generate motions that align with the user’s intension. Furthermore, the improved FID and FVD values confirm that our method produces high-quality results. Fig. 6 presents a visual comparison. AYG struggles to generate semantic-level and large motions due to its use of a simple SDS variant, as seen in the giraffe example. Furthermore, the motion generated by AYG is not smooth and exhibits noticeable flickering, particularly evident in the fish case. TC4D shows simple translations along the trajectory with minimal skeleton movement. It also suffers from significant distortions and appearance changes, as demonstrated in the lion example. While both AYG and TC4D can animate the 3D object, they fail to produce natural and realistic motions. In contrast, our framework, leveraging MSD along with regularizations, generates more substantial and realistic 3D motions.
We conduct a user preference study to evaluate performance in Table 2. Users are asked to evaluate five key aspects of the generated dynamic object. First, overall quality provides a general evaluation of the rendered object. Appearance preservation focuses on detecting any undesirable appearance deformations. Motion dynamism assesses the extent of the object’s movement, with a preference for larger motions. Motion-text alignment measures how well the generated motion corresponds to the text prompt. Finally, motion realism evaluates the naturalness of the generated motion. For each aspect, users are asked to select their preferred option from AYG, TC4D, and our method. We received 17 valid responses and present the user preference rates for each aspect. Our method achieves the highest preference rate across all aspects, demonstrating that it generates more natural and realistic motions. we observe that generating substantial and large motion is difficult without explicitly modeling the static distribution. While variant (d) produces sufficient motion, the lack of faithful noise significantly compromises the original appearance fidelity and introduces notable background artifacts. Without our MSD, the approach degrades to conventional SDS, which either fails to generate noticeable motion with a small CFG (a) or results in meaningless motion and severe appearance distortion with a large CFG (b). In contrast, our full MSD method is uniquely effective in generating realistic motion while robustly preserving the original appearance fidelity.
5.3.2 Denoising with approximated static distribution. As discussed in ยง4.1.1, an alternative approach to defining the source static distribution is to use a static text prompt, such as “low motion, static statue, not moving, no motion,
The effectiveness of motion regularization. As illustrated in Fig. 11 (a), the TV-3D loss encourages temporal smoothness in Gaussian points, and removing this loss results in large displacements between consecutive frames. In panel (b), we show that the ARAP loss provides a crucial spatial constraint for maintaining rigid motion; without this loss, the object experiences significant distortion or may even break apart. Therefore, we incorporate both temporal and spatial regularization terms in our method to further enhance the results of MSD.
We illustrate the motion distillation process in Fig. 13, where our framework progressively optimizes the motion field of the 3D object (goat) over multiple optimization steps. This allows the 3D object to appear “moving” while preserving its original appearance. Due to the frame length limitation of the underlying video diffusion, SDS can only generate fixed-length animations, resulting in noncontinuous motion (i.e., large motion changes between consecutive frames) in the original 3D animation. In contrast, our motion refinement approach generates more fine-grained motion (longer 3D animations) by leveraging a larger video diffusion model. As shown in Fig. 12, our motion refinement transforms non-continuous motion, caused by the fixed-length video diffusion, into more natural and smooth animation sequences. We present additional results in Fig. 8. Our method supports a variety of motion descriptions, such as “playing, " “walking, " “flying, " and “swimming. " Furthermore, our approach can animate not only animals but also general objects, as demonstrated with the red flag of Hong Kong.
In this paper, we introduce Motion Score Distillation (MSD) for text-driven 3D animation. Our method formulates score distillation as distribution transportation, enhancing conventional techniques through dual distribution modeling and faithful noise. Specifically, we tackle the challenge of static video distribution modeling by using LoRA-enhanced video diffusion, and we perform appearancepreserving faithful noise estimation to mitigate the appearance changes often encountered in SDS. Additionally, we integrate spatialtemporal geometric motion regularizations and apply motion detailization using large video models to ensure scalability. Experimental results on text-driven 3D animation, along with comprehensive ablation studies, demonstrate that our method outperforms current state-of-the-art approaches and validates its effectiveness. In Fig. 14, we illustrate a limitation of our method: it struggles to model new content appearing in the scene, such as ejected fluid. Instead of generating this new content, the model tends to introduce distortions as a form of compensation. This issue could potentially be addressed in the future by designing new particle generation and modeling strategies. Another challenge is the optimization time-a common drawback of score distillation methods-which typically requires several hours. This inefficiency could be mitigated through techniques such as amortized training [Lorraine et al. 2023] or the adoption of more efficient data structures [Mรผller et al. 2022].
MethodsCLIP-Imageโ CLIP-Text โ FID โ FVD โ AYG[Ling et al.