From Frames to Sequences: Temporally Consistent Human-Centric Dense Prediction

From Frames to Sequences: Temporally Consistent Human-Centric Dense Prediction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this work, we focus on the challenge of temporally consistent human-centric dense prediction across video sequences. Existing models achieve strong per-frame accuracy but often flicker under motion, occlusion, and lighting changes, and they rarely have paired human video supervision for multiple dense tasks. We address this gap with a scalable synthetic data pipeline that generates photorealistic human frames and motion-aligned sequences with pixel-accurate depth, normals, and masks. Unlike prior static data synthetic pipelines, our pipeline provides both frame-level labels for spatial learning and sequence-level supervision for temporal learning. Building on this, we train a unified ViT-based dense predictor that (i) injects an explicit human geometric prior via CSE embeddings and (ii) improves geometry-feature reliability with a lightweight channel reweighting module after feature fusion. Our two-stage training strategy, combining static pretraining with dynamic sequence supervision, enables the model first to acquire robust spatial representations and then refine temporal consistency across motion-aligned sequences. Extensive experiments show that we achieve state-of-the-art performance on THuman2.1 and Hi4D and generalize effectively to in-the-wild videos.


💡 Research Summary

The paper tackles the persistent problem of temporal flickering in human‑centric dense prediction tasks such as monocular depth, surface normal, and segmentation mask estimation when applied to video streams. While prior work has achieved strong per‑frame accuracy, they typically ignore temporal coherence and lack paired video‑level supervision for multiple dense tasks. To close this gap, the authors introduce two complementary contributions: a scalable synthetic data pipeline and a ViT‑based architecture that explicitly incorporates human geometric priors.

The data pipeline builds on character creation tools (DAZ 3D, MakeHuman, Character Creator) to generate roughly 200 K unique human identities by independently sampling body shape, clothing, and shoes. Texture diversity is further enhanced through hue shifts, noise injection, solid‑color replacements, and external texture datasets. Each identity is rendered in Blender both as static frames (with random camera viewpoints covering face, upper‑body, and full‑body) and as dynamic sequences driven by AMASS motion capture data. For each frame, pixel‑accurate RGB, depth, surface normal, and part segmentation masks are produced, yielding a unified dataset that supplies frame‑level labels for spatial learning and sequence‑level labels for temporal learning.

On the modeling side, the backbone is a DINOv2 ViT encoder followed by a DPT‑style decoder that produces multi‑scale features. Three lightweight task heads predict depth, normals, and foreground/background masks. To inject human‑specific knowledge, the authors use Continuous Surface Embedding (CSE) vectors as a geometric prior. CSE embeddings are projected to match the decoder’s channel dimension and spatial resolution, then added element‑wise to the decoder features, encouraging the network to respect human body topology.

A parallel lightweight CNN branch extracts edge and texture cues, which are concatenated with the decoder features. Because this fusion can introduce redundant appearance information, a Channel Weight Adaptation (CWA) module is applied: global average pooling creates a channel descriptor, a two‑layer MLP with sigmoid activation generates per‑channel scaling factors, and the fused features are re‑weighted accordingly. This mechanism suppresses channels dominated by lighting or texture while amplifying geometry‑relevant channels, thereby improving the reliability of the geometry predictions.

Training proceeds in two stages. Stage 1 pre‑trains the model on synthetic static images using depth losses (scale‑shift alignment, gradient regularization) and normal losses (L1, cosine similarity, edge‑aware gradient loss, multi‑scale Laplacian loss) to obtain strong spatial representations. Stage 2 adds temporal attention blocks (four in total) to the decoder and fine‑tunes the network on synthetic video sequences, incorporating a flow‑guided stabilization term to enforce temporal consistency across frames.

Extensive evaluation on the THuman2.1 and Hi4D benchmarks demonstrates state‑of‑the‑art performance for both depth and normal estimation, surpassing prior methods such as DAViD and Sapiens. Qualitative results on in‑the‑wild videos show a marked reduction in flickering, even under fast motion, occlusions, and lighting changes. The authors also release the synthetic pipeline to the community, enabling further research on temporally consistent, multi‑task human‑centric vision. Overall, the work illustrates that combining high‑fidelity, temporally annotated synthetic data with a geometry‑aware transformer architecture can simultaneously achieve spatial accuracy and temporal stability in dense human‑centric prediction.


Comments & Academic Discussion

Loading comments...

Leave a Comment