3DProxyImg: Controllable 3D-Aware Animation Synthesis from Single Image via 2D-3D Aligned Proxy Embedding

3DProxyImg: Controllable 3D-Aware Animation Synthesis from Single Image via 2D-3D Aligned Proxy Embedding
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

3D animation is central to modern visual media, yet traditional production pipelines remain labor-intensive, expertise-demanding, and computationally expensive. Recent AIGC-based approaches partially automate asset creation and rigging, but they either inherit the heavy costs of full 3D pipelines or rely on video-synthesis paradigms that sacrifice 3D controllability and interactivity. We focus on single-image 3D animation generation and argue that progress is fundamentally constrained by a trade-off between rendering quality and 3D control. To address this limitation, we propose a lightweight 3D animation framework that decouples geometric control from appearance synthesis. The core idea is a 2D-3D aligned proxy representation that uses a coarse 3D estimate as a structural carrier, while delegating high-fidelity appearance and view synthesis to learned image-space generative priors. This proxy formulation enables 3D-aware motion control and interaction comparable to classical pipelines, without requiring accurate geometry or expensive optimization, and naturally extends to coherent background animation. Extensive experiments demonstrate that our method achieves efficient animation generation on low-power platforms and outperforms video-based 3D animation generation in identity preservation, geometric and textural consistency, and the level of precise, interactive control it offers to users.


💡 Research Summary

The paper introduces 3DProxyImg, a lightweight framework that generates controllable 3D‑aware animations from a single static image. The authors identify a fundamental trade‑off in existing approaches: traditional 3D pipelines deliver precise geometric control but are labor‑intensive and computationally heavy, while recent video‑synthesis methods produce high‑quality frames but lack explicit 3D controllability and interactivity. To bridge this gap, 3DProxyImg decouples geometric control from appearance synthesis through a 2D‑3D aligned proxy embedding.

Pipeline Overview

  1. Coarse 3D Reconstruction & Alignment – An input image is first processed by a monocular depth estimator (VGGT) to obtain a pixel‑aligned point cloud (P_vggt) and by a single‑view 3D generation model (HunYuan3D) to produce a full mesh (P_hy). Because these two outputs differ in scale, pose, and completeness, the authors employ an ICP‑based matching followed by a mask‑driven optimization (T_opt) that minimizes the discrepancy between the projected mask and the object mask. The result is an aligned point set P_aligned that faithfully represents the object geometry in the image coordinate system.

  2. Proxy Construction – P_aligned is down‑sampled to a sparse set of proxy vertices V, each equipped with a learnable texture feature vector f_i. Positional encoding (γ) is applied to these features to capture high‑frequency details, mirroring the strategy used in NeRF. The vertices are triangulated, depth‑sorted, and projected onto the image plane (V_β) using the camera parameters from VGGT. For each pixel inside the projected region, barycentric interpolation of the three surrounding vertex features yields a per‑pixel feature f_p, which is decoded by an MLP D_θ into RGB color.

  3. Appearance Optimization – Two complementary loss terms guide the learning of the proxy features and decoder:

    • L_MSE – a pixel‑wise mean‑squared error between the rendered view (β_ref) and the original image, ensuring fidelity for the known viewpoint.
    • L_SDS – Score Distillation Sampling loss derived from a pretrained 2D diffusion model (e.g., Stable Diffusion). Random camera poses are sampled, the diffusion model predicts noise ε̂, and the gradient of the discrepancy (ε̂ – ε) with respect to the rendered latent z_t is back‑propagated to update θ and f_i. This term enforces multi‑view consistency and injects high‑frequency texture details that are missing from the coarse geometry.
      The total loss L_total = α₁·L_MSE + α₂·L_SDS balances single‑view fidelity and novel‑view realism.
  4. Animation & Interaction – Because the proxy vertices form a lightweight mesh‑like graph, standard rigging and skinning pipelines can be applied directly. Users can specify positional constraints on a subset of vertices (e.g., hand or limb anchors). The system propagates these constraints through a Position‑Based Dynamics (PBD) solver that respects edge‑length preservation and local rigidity, yielding physically plausible deformations. Background regions are handled by a separate proxy propagation module that fills in missing content while preserving overall scene coherence.

Key Contributions & Findings

  • Proxy‑Based Decoupling – By treating geometry as a coarse carrier and delegating fine appearance to a diffusion‑guided implicit renderer, the method avoids the need for accurate watertight meshes or expensive NeRF training.
  • Efficient Multi‑View Consistency – SDS provides a powerful, data‑free prior that aligns the rendered proxy textures across arbitrary viewpoints, eliminating the requirement for dense multi‑view supervision.
  • Interactive Control – The sparse proxy graph enables real‑time manipulation (rotation, translation, scaling) and supports artist‑friendly rigging without re‑training.
  • Low‑Resource Viability – Experiments demonstrate real‑time performance on mobile GPUs, with frame rates exceeding 30 fps, and superior metrics in identity preservation, geometric fidelity, and texture consistency compared to state‑of‑the‑art video‑based methods.

Implications
3DProxyImg offers a practical pathway for AI‑assisted 3D content creation where only a single reference image is available. Its blend of coarse geometry, diffusion‑driven texture synthesis, and mesh‑compatible proxy representation makes it suitable for AR/VR applications, rapid prototyping in game development, and interactive storytelling tools that demand both visual quality and precise 3D control without the overhead of full 3D pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment