NECromancer: Breathing Life into Skeletons via BVH Animation

NECromancer: Breathing Life into Skeletons via BVH Animation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Motion tokenization is a key component of generalizable motion models, yet most existing approaches are restricted to species-specific skeletons, limiting their applicability across diverse morphologies. We propose NECromancer (NEC), a universal motion tokenizer that operates directly on arbitrary BVH skeletons. NEC consists of three components: (1) an Ontology-aware Skeletal Graph Encoder (OwO) that encodes structural priors from BVH files, including joint semantics, rest-pose offsets, and skeletal topology, into skeletal embeddings; (2) a Topology-Agnostic Tokenizer (TAT) that compresses motion sequences into a universal, topology-invariant discrete representation; and (3) the Unified BVH Universe (UvU), a large-scale dataset aggregating BVH motions across heterogeneous skeletons. Experiments show that NEC achieves high-fidelity reconstruction under substantial compression and effectively disentangles motion from skeletal structure. The resulting token space supports cross-species motion transfer, composition, denoising, generation with token-based models, and text-motion retrieval, establishing a unified framework for motion analysis and synthesis across diverse morphologies. Demo page: https://animotionlab.github.io/NECromancer/


💡 Research Summary

NECromancer (NEC) introduces a universal motion tokenizer that works directly on arbitrary BVH skeletons, removing the long‑standing limitation of most motion tokenizers that are tied to a fixed human topology. The system consists of three main components. First, the Ontology‑aware Skeletal Graph Encoder (OwO) converts the static rest‑pose information stored in a BVH file—joint names, parent‑child relationships, and offset vectors—into a graph and processes it with multiple graph‑attention layers. OwO is pretrained with three self‑supervised objectives: a geometric loss that forces the network to predict relative joint offsets, a topological loss that requires correct identification of the lowest common ancestor for any pair of joints (proving that the full tree can be reconstructed), and a semantic loss that aligns each joint’s embedding with a CLIP text embedding of its name. The resulting per‑joint embeddings capture geometry, topology, and semantics in a compact vector that can be reused for any skeleton.

Second, the Topology‑Agnostic Tokenizer (TAT) takes a motion sequence expressed as per‑joint rotations (root translation + 6‑D rotation, child joints 6‑D rotation) and, at each (down‑sampled) timestep, augments the set of joint features with a learnable “virtual joint” token. This virtual token attends to all real joints through stacked spatio‑temporal Transformer blocks, gradually summarizing the whole‑body pose into a single vector per frame. The virtual joint vectors are then quantized using Residual Vector Quantization (RVQ) into a discrete codebook of size K, producing a token sequence that is completely independent of the number of joints or their ordering. During decoding, the frozen OwO embeddings of the target skeleton are concatenated with the quantized latent tokens, allowing the decoder to reconstruct joint rotations that respect the specific morphology while preserving the motion dynamics encoded in the token stream.

Third, the authors curate the Unified BVH Universe (UvU), a large‑scale benchmark that aggregates 47 807 high‑quality BVH motions from HumanML3D, Objaverse‑XL, and Truebones Zoo. All motions are converted to a common BVH representation, filtered for physical plausibility, and enriched with automatically generated text captions using Qwen2.5‑VL. The dataset contains a wide variety of species—humans, quadrupeds, fantasy creatures—with joint counts ranging from ~20 to >120, providing a rigorous testbed for both seen and unseen topologies.

Experiments demonstrate that NEC can reconstruct motions with high fidelity even under strong compression (temporal ratios 4–8), achieving average joint position errors of 2–3 cm and outperforming RVQ‑VAE baselines by over 10 % in R‑Precision@10. The discrete token space enables downstream tasks such as cross‑species motion transfer (e.g., applying a human dance to a dragon), motion composition, denoising, and text‑to‑motion generation using a simple transformer‑based language model. Qualitative results show that the tokenizer correctly disentangles motion dynamics from skeletal structure, allowing the same token sequence to be rendered on vastly different bodies while preserving semantic meaning (e.g., “wing flapping” vs. “arm raising”).

In summary, NEC makes three substantive contributions: (1) a graph‑based skeletal encoder that embeds geometry, topology, and semantics; (2) a topology‑agnostic tokenization pipeline that replaces joint‑wise quantization with a virtual‑joint summary, enabling a single universal codebook; and (3) a curated BVH‑centric dataset that standardizes evaluation across heterogeneous morphologies. The work opens the door to truly universal motion models that can be trained, edited, and queried across any articulated creature, and suggests future extensions toward integrating skinning weights, physics‑based simulation, and higher‑resolution mesh generation for full 4D content creation.


Comments & Academic Discussion

Loading comments...

Leave a Comment