AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation

AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Despite significant progress in text-driven 4D human-object interaction (HOI) generation with supervised methods, the scalability remains limited by the scarcity of large-scale 4D HOI datasets. To overcome this, recent approaches attempt zero-shot 4D HOI generation with pre-trained image diffusion models. However, interaction cues are minimally distilled during the generation process, restricting their applicability across diverse scenarios. In this paper, we propose AnchorHOI, a novel framework that thoroughly exploits hybrid priors by incorporating video diffusion models beyond image diffusion models, advancing 4D HOI generation. Nevertheless, directly optimizing high-dimensional 4D HOI with such priors remains challenging, particularly for human pose and compositional motion. To address this challenge, AnchorHOI introduces an anchor-based prior distillation strategy, which constructs interaction-aware anchors and then leverages them to guide generation in a tractable two-step process. Specifically, two tailored anchors are designed for 4D HOI generation: anchor Neural Radiance Fields (NeRFs) for expressive interaction composition, and anchor keypoints for realistic motion synthesis. Extensive experiments demonstrate that AnchorHOI outperforms previous methods with superior diversity and generalization.


💡 Research Summary

The paper “AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation” presents a novel framework for generating dynamic 3D sequences of humans interacting with objects from textual descriptions, without relying on any ground-truth 4D HOI data.

The core challenge addressed is the scarcity of large-scale 4D HOI datasets, which limits supervised methods. While recent zero-shot approaches leverage pre-trained image diffusion models, they fail to sufficiently distill intricate interaction cues, leading to limited applicability. AnchorHOI advances the field by thoroughly exploiting hybrid priors from both image diffusion models (IDMs) and video diffusion models (VDMs).

The key innovation is an “anchor-based prior distillation” strategy. Instead of directly optimizing high-dimensional 4D HOI parameters under the guidance of diffusion models—a challenging task—AnchorHOI introduces intermediate “anchors” that act as bridges. The process involves a tractable two-step approach: first constructing interaction-aware anchors from the textual input using the priors, and then leveraging these anchors to guide the generation of the target 4D HOI.

The framework pipeline consists of two sequential components:

  1. Interaction Composition via Anchor NeRF: Given an interaction text prompt (e.g., “a person swinging a tennis racket”), the method first uses SDS to optimize a coarse, combined human-object NeRF under IDM guidance. A human-isolated “Anchor NeRF” is then extracted via multi-view feature alignment. Separately, a canonical 3D human avatar (SMPL-X model) is generated. To capture an interaction-specific pose, the SMPL-X avatar’s parameters (pose, scale, global transformation) are optimized to align its projected 3D joints with the 2D skeletons detected from multiple views of the Anchor NeRF. The object is initialized from the segmented part of the Anchor NeRF and refined via SDS with constraints to preserve the overall interaction layout, resulting in a static 3D HOI instance.
  2. Motion Synthesis via Anchor Keypoint: Following static composition, a video diffusion model is conditioned on the text (with optional motion description) to generate a 2D video depicting the HOI motion. From this video, “Anchor Keypoints” are extracted per frame, defined as a combination of human body keypoints and human-object contact keypoints. These keypoints provide robust motion cues even under occlusions common in VDM outputs. Starting from the static 3D HOI, a dynamic 4D sequence is generated by optimizing the motion parameters (poses and transformations for both human and object across frames) to track the temporal trajectory of these Anchor Keypoints.

Extensive qualitative and quantitative evaluations demonstrate that AnchorHOI significantly outperforms previous state-of-the-art methods in both the quality/diversity of static 3D interaction composition and the realism/generalization of dynamic 4D sequences. By introducing the anchor-based distillation paradigm with tailored NeRF and keypoint anchors, the work effectively overcomes the major hurdles of adaptive human pose optimization and compositional motion extraction in zero-shot 4D HOI generation.


Comments & Academic Discussion

Loading comments...

Leave a Comment