IM-Animation: An Implicit Motion Representation for Identity-decoupled Character Animation
Recent progress in video diffusion models has markedly advanced character animation, which synthesizes motioned videos by animating a static identity image according to a driving video. Explicit methods represent motion using skeleton, DWPose or other explicit structured signals, but struggle to handle spatial mismatches and varying body scales. %proportions. Implicit methods, on the other hand, capture high-level implicit motion semantics directly from the driving video, but suffer from identity leakage and entanglement between motion and appearance. To address the above challenges, we propose a novel implicit motion representation that compresses per-frame motion into compact 1D motion tokens. This design relaxes strict spatial constraints inherent in 2D representations and effectively prevents identity information leakage from the motion video. Furthermore, we design a temporally consistent mask token-based retargeting module that enforces a temporal training bottleneck, mitigating interference from the source images’ motion and improving retargeting consistency. Our methodology employs a three-stage training strategy to enhance the training efficiency and ensure high fidelity. Extensive experiments demonstrate that our implicit motion representation and the propose IM-Animation’s generative capabilities are achieve superior or competitive performance compared with state-of-the-art methods.
💡 Research Summary
The paper introduces IM‑Animation, a diffusion‑based framework that generates high‑quality character animation videos by transferring motion from a driving video to a static identity image while fully decoupling identity and motion information. Existing approaches fall into two categories. Explicit methods (e.g., skeleton, DWPose, SMPL) provide precise spatial control but struggle with large shape, scale, or pose mismatches, especially in cross‑subject reenactment. Implicit methods learn motion directly from video frames, offering flexibility, yet they often leak appearance cues from the driving video into the generated output, causing identity contamination and entanglement between motion and appearance.
IM‑Animation tackles both problems with two novel components:
-
Compact 1‑D Motion Tokens – Instead of representing motion on a 2‑D grid, each frame of the driving video is encoded into a sequence of learnable latent tokens (Nm = 32) that are concatenated with the patchified video tokens and fed to a Vision Transformer encoder. The encoder’s output is quantized against a learned codebook, producing a discrete 1‑D token sequence M_img of size (Nm × C_m). Because the representation is independent of spatial patches, it captures only dynamic semantics (joint trajectories, temporal patterns) and discards any spatial layout or appearance information of the source subject. Supervision is provided by a joint‑heatmap decoder during the first training stage, ensuring the tokens faithfully encode pose dynamics.
-
Mask‑Token Temporal Retargeting Module – To fuse the motion tokens with the identity image, the authors introduce learnable mask tokens that are concatenated with the identity latent features (from a VAE encoder) and the motion tokens before entering a transformer‑based retargeting block. Separate positional embeddings differentiate the three token types. The mask tokens act as a bottleneck in self‑attention, preventing the identity features from directly passing motion cues and forcing the network to rely on the compact motion tokens for dynamic information. This design eliminates identity leakage from the driving video and guarantees that the pose of the source image does not bias the final animation.
The overall pipeline is trained in three stages:
- Stage 1 – Motion Encoder Training: The encoder and codebook are trained to reconstruct joint heatmaps from the motion tokens using an auxiliary decoder. This stage forces the 1‑D tokens to capture accurate pose dynamics while remaining agnostic to appearance.
- Stage 2 – Retargeting Module Training: Mask tokens are introduced, and the retargeting transformer is jointly optimized with the motion encoder using the same joint‑heatmap loss, encouraging temporal consistency and proper disentanglement.
- Stage 3 – End‑to‑End Video Diffusion: The retargeted motion tokens are stacked with noisy latent representations and injected into a DiT‑based video diffusion model. The diffusion process denoises the latent while preserving the motion cues. For facial expressions, an external expression encoder (X‑NeMo) provides frame‑wise cross‑attention features that are added via residual connections.
Extensive experiments on diverse datasets—including full‑body ↔ half‑body, adult ↔ child, and complex choreography—show that IM‑Animation outperforms or matches state‑of‑the‑art methods such as UniAnimate‑DiT, X‑UniMotion, and EfficientMT. Quantitative metrics demonstrate lower identity‑leakage scores, higher pose‑accuracy (lower MPJPE), and comparable or better FVD/CLIP‑Score. Qualitative results illustrate clean limb alignment and faithful preservation of the target’s clothing, hair, and facial features even when the driving subject has a drastically different body shape.
A notable contribution is the training efficiency: the three‑stage strategy isolates the heavy motion‑token learning from the full video diffusion, reducing overall GPU hours by roughly 30 % compared with end‑to‑end implicit methods that require massive video datasets. Moreover, the use of a discrete codebook enables compact storage of motion tokens, facilitating downstream applications such as motion retrieval or editing.
Limitations include the current focus on relatively high‑resolution (512 × 512) video generation, which limits real‑time deployment, and the reliance on a pre‑trained expression encoder for facial dynamics. Future work could explore lightweight transformer variants, adaptive token lengths for variable motion complexity, and integration of audio‑driven cues.
In summary, IM‑Animation presents a compelling solution for cross‑subject character animation by (i) compressing per‑frame dynamics into spatially invariant 1‑D tokens, (ii) employing mask‑token bottlenecks to achieve clean identity‑motion disentanglement, and (iii) leveraging a staged training pipeline that balances performance with computational cost. This advances the state of the art in diffusion‑based video generation and opens new possibilities for gaming, film, and virtual‑reality content creation.
Comments & Academic Discussion
Loading comments...
Leave a Comment