MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing

MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

3D morphing remains challenging due to the difficulty of generating semantically consistent and temporally smooth deformations, especially across categories. We present MorphAny3D, a training-free framework that leverages Structured Latent (SLAT) representations for high-quality 3D morphing. Our key insight is that intelligently blending source and target SLAT features within the attention mechanisms of 3D generators naturally produces plausible morphing sequences. To this end, we introduce Morphing Cross-Attention (MCA), which fuses source and target information for structural coherence, and Temporal-Fused Self-Attention (TFSA), which enhances temporal consistency by incorporating features from preceding frames. An orientation correction strategy further mitigates the pose ambiguity within the morphing steps. Extensive experiments show that our method generates state-of-the-art morphing sequences, even for challenging cross-category cases. MorphAny3D further supports advanced applications such as decoupled morphing and 3D style transfer, and can be generalized to other SLAT-based generative models. Project page: https://xiaokunsun.github.io/MorphAny3D.github.io/.


💡 Research Summary

MorphAny3D introduces a training‑free framework for high‑quality 3D morphing by exploiting the Structured Latent (SLAT) representation of the recent Trellis 3D generator. Traditional 3D morphing pipelines rely on a two‑stage process: (1) establishing dense correspondences between source and target shapes, often using handcrafted landmarks, functional maps, or optimal transport, and (2) interpolating these correspondences to synthesize intermediate geometry. While effective for intra‑category transformations, these methods struggle with cross‑category cases where reliable correspondences are hard to obtain, leading to implausible deformations and texture inconsistencies.

The authors observe that naïve approaches—such as generating a 2D morphing sequence and lifting each frame independently into 3D, or directly interpolating the initial noise and conditioning vectors of Trellis—fail to preserve temporal coherence and structural plausibility. They note that the key to successful morphing lies in how SLAT features are blended inside the attention mechanisms of the generator.

MorphAny3D therefore proposes two novel attention‑based modules:

  1. Morphing Cross‑Attention (MCA) – In the cross‑attention layers, the source and target image conditions are treated as keys and values. For a given frame n, the query originates from the current noisy latent, while the keys and values are linearly blended using the morphing weight αₙ (α₀ = 0, α_N = 1). This fusion injects both source and target visual cues directly into the generation process, dramatically improving semantic plausibility (lowest FID in experiments).

  2. Temporal‑Fused Self‑Attention (TFSA) – In self‑attention, the query again comes from the current latent, but the keys and values now incorporate SLAT features from the previous frame. By conditioning the current generation on the immediate past, TFSA enforces smooth transitions over time, yielding the best perceptual path length (PPL) scores.

A naïve combination of KV‑fused cross‑attention and self‑attention was found to degrade structural quality, indicating that indiscriminate fusion harms the plausibility‑smoothness trade‑off. MCA and TFSA are therefore applied sequentially rather than simultaneously, preserving each module’s strengths while avoiding interference.

An additional orientation correction step addresses abrupt pose changes that arise because Trellis‑generated assets exhibit systematic orientation biases. By analyzing the statistical distribution of object rotations in the training set, the authors compute a corrective rotation for each intermediate frame, effectively stabilizing the viewpoint throughout the morph.

The pipeline proceeds as follows: (i) obtain initial SLAT latents and image conditions for source and target (either from real assets via inversion or from Trellis generation), (ii) compute a spherical interpolation of the latents using αₙ, (iii) feed the interpolated latent through the MCA‑augmented cross‑attention, (iv) apply TFSA with the previous frame’s SLAT features, (v) decode the resulting SLAT into a mesh, NeRF, or 3DGS representation, and (vi) apply orientation correction. The authors use 50 frames (αₙ = n/49) for all experiments.

Extensive quantitative evaluation on ShapeNet and a custom cross‑category benchmark shows that MorphAny3D outperforms three baselines: (a) matching‑based 3D morphing, (b) 2D morphing followed by independent 3D generation, and (c) direct latent interpolation. MorphAny3D achieves lower FID (indicating higher realism) and lower PPL (indicating smoother motion). Qualitative results demonstrate convincing transformations such as “bee → biplane” and “chair → car,” where structural integrity and texture continuity are preserved despite large semantic gaps.

Beyond basic morphing, the framework supports decoupled morphing (separately interpolating geometry and texture) and 3D style transfer, showcasing the flexibility of SLAT‑based manipulation. The authors also transfer MCA and TFSA to other SLAT‑based generators (e.g., SDF‑GAN) and observe comparable improvements, confirming the generality of their approach.

Limitations include dependence on the resolution of the underlying SLAT model and occasional loss of fine‑grained details for highly complex topologies (e.g., foliage, human bodies). Future work may explore higher‑resolution SLAT encodings, integration with physics‑based simulation for dynamic morphing, and learning‑based refinement of the orientation correction step.

In summary, MorphAny3D demonstrates that structured latent representations, when intelligently blended within attention layers, enable training‑free, high‑fidelity, temporally coherent 3D morphing across arbitrary object categories, opening new possibilities for animation, game asset creation, and interactive 3D content generation.


Comments & Academic Discussion

Loading comments...

Leave a Comment