Reading time: 23 minute
...

📝 Original Info

  • Title:
  • ArXiv ID: 2512.19678
  • Date:
  • Authors: Unknown

📝 Abstract

Starting Image 1 st Frame 200 th Frame 3DGS Reconstruction * Corresponding author. warping historical content into novel views, this cache acts as a structural scaffold, ensuring each new frame respects prior geometry. However, static warping inevitably leaves holes and artifacts due to occlusions. We address this using a Spatio-Temporal Diffusion (ST-Diff) model designed for a "fill-and-revise" objective. Our key innovation is a spatio-temporal varying noise schedule: blank regions receive full noise to trigger generation, while warped regions receive partial noise to enable refinement. By dynamically updating the 3D cache at every step, WorldWarp maintains consistency across video chunks. Consequently, it achieves state-of-the-art fidelity by ensuring that 3D logic guides structure while diffusion logic perfects texture. Project page: https://hyokong.github.io/worldwarp-page/.

📄 Full Content

Given only a single starting image (left) and a specified camera trajectory, our method generates a long and coherent video sequence. The core of our approach is to generate the video chunk-bychunk, where each new chunk is conditioned on forward-warped "hints" from the previous one. A novel diffusion model then generates the next chunk by correcting these hints and filling in occlusions using a spatio-temporal varying noise schedule. The high geometric consistency of our 200-frame generated sequence is demonstrated by its successful reconstruction into a high-fidelity 3D Gaussian Splatting (3DGS) [25] model (right). This highlights our model's robust understanding of 3D geometry and its capability to maintain long-term consistency.

Generating long-range, geometrically consistent video presents a fundamental dilemma: while consistency demands strict adherence to 3D geometry in pixel space, stateof-the-art generative models operate most effectively in a camera-conditioned latent space. This disconnect causes current methods to struggle with occluded areas and complex camera trajectories. To bridge this gap, we propose WorldWarp, a framework that couples a 3D structural anchor with a 2D generative refiner. To establish geometric grounding, WorldWarp maintains an online 3D geometric cache built via Gaussian Splatting (3DGS). By explicitly

Novel View Synthesis (NVS) has emerged as a cornerstone problem in computer vision and graphics, with transformative applications in virtual reality, immersive telepresence, and generative content creation. While traditional NVS methods excel at view interpolation, which generates new views within the span of existing camera poses [2,25,40], the frontier of the field lies in view extrapolation [16,32,33,37,55,67]. This far more challenging task involves generating long, continuous camera trajectories that extend significantly beyond the original scene, effectively synthesizing substantial new content and structure [37,55]. The ultimate goal is to enable interactive exploration of dynamic, 3D-consistent worlds from only a limited set of starting images.

The central challenge in generating long-range, cameraconditioned video lies in finding an effective 3D conditioning. Existing works have largely followed two main strategies. The first is camera pose encoding, which embeds abstract camera parameters as a latent condition [16,29,39,54,55,67]. This approach, however, relies heavily on the diversity of the training dataset and often fails to generalize to Out-Of-Distribution (OOD) camera poses, while also providing minimal information about the underlying 3D scene content [11,20,32,41,44,77]. The second strategy, which uses an explicit 3D spatial prior, was introduced to solve this OOD issue [11,20,32,77]. While these priors provide robust geometric grounding, they are imperfect, suffering from occlusions (blank regions) and distortions from 3D estimation errors [55,77]. This strategy typically employs standard inpainting or video generation techniques [11,20,32], which are ill-suited to simultaneously handle the severe disocclusions and the geometric distortions present in the warped priors, leading to artifacts and inconsistent results.

To address this critical gap, we propose WorldWarp, a novel framework that generates long-range, geometricallyconsistent novel view sequences. Our core insight is to break the strict causal chain of AR models and the static nature of explicit 3D priors. Instead, WorldWarp operates via an autoregressive inference pipeline that generates video chunkby-chunk (see Fig. 3). The key to our system is a Spatio-Temporal Diffusion (ST-Diff) model [49,60], which is trained with a powerful bidirectional, non-causal attention mechanism. This non-causal design is explicitly enabled by our core technical idea: using forward-warped images from future camera positions as a dense, explicit 2D spatial prior [9]. At each step, we build an “online 3D geometric cache” using 3DGS [25], which is optimized only on the most recent, high-fidelity generated history. This cache then renders high-quality warped priors for the next chunk, providing ST-Diff with a rich, geometrically-grounded signal that guides the generation of new content and fills occlusions.

The primary advantage of WorldWarp is its ability to avoid the irreversible error propagation that plagues prior work [55,77]. By dynamically re-estimating a short-term 3DGS cache at each step, our method continuously grounds itself in the most recent, accurate geometry, ensuring highfidelity consistency over extremely long camera paths. We demonstrate the effectiveness of our approach through extensive experiments on challenging real-world and synthetic datasets for long-sequence view extrapolation, achieving state-of-the-art performance in both geometric consistency and visual fidelity. In summary, our main contributions are: • WorldWarp, a novel framework for long-range novel view extrapolation that generates video chunk-by-chunk using an autoregressive inference pipeline. • Spatio-Temporal Diffusion (ST-Diff), a non-causal diffusion model that leverages bidirectional attention conditioned on forward-warped images as a dense geometric prior. • An online 3D geometric cache mechanism, which uses test-time optimized 3DGS [25] to provide high-fidelity warped priors while preventing the irreversible error propagation of static 3D representations. • State-of-the-art performance on challenging view extrapolation benchmarks, demonstrating significantly improved geometric consistency and image quality over existing methods.

Novel view synthesis. Novel view synthesis (NVS) is a challenging problem that can be categorized into two aspects: view interpolation [2,3,25,31,40,42,46,51,52,62,71,78,79,85] and extrapolation [16,30,32,33,37,48,55,67,76,83]. View interpolation task aims to generate novel views within the distributions of the training views [2,3,25,40,78] even if the training views are sparse [31,42,62,85] or the training views are captured in the wild with occlusions [46,51,52]. View extrapolation tasks [16,30,32,33,37,48,55,67,76,83] focus on generating novel views which are extended significantly beyond the original scenes, introducing substantial new contents, by leveraging powerful pre-trained video diffusion models [21,35,49,60,73].

Auto-regressive video diffusion models. The field of video generation has seen a prominent trend towards either diffusion-based or autoregressive (AR) methodologies. Parallel (non-autoregressive) video diffusion systems often employ bidirectional attention to process and denoise all frames concurrently [4-6, 10, 14, 15, 18, 19, 28, 43, 59, 61, 74]. Conversely, AR-based techniques produce content in a sequential manner. This category encompasses several architectures, such as models based on pure next-token prediction [7,23,27,45,65,68,72], more recent hybrid systems integrating AR and diffusion principles [8,12,13,22,24,34,38,69,75,82], and rolling diffusion variants that em-ploy progressive noise schedules [26,50,53,58,70,80]. However, these AR strategies are ill-suited for this work’s specific task. Learning an effective camera embedding for them is non-trivial, and their causal structure is incompatible with using warped images from future camera positions as conditional hints. Consequently, this work employs a non-autoregressive framework [57] to leverage this future information.

Camera pose encoding and 3D explicit spatial priors.

Spatially consistent view generation relies on conditioning. One method, camera pose encoding, models camera geometry using absolute extrinsics [16,39,54,55,67] or relative representations like CaPE [29]. While useful for viewpoint control, these encodings lack 3D scene content. An alternative, explicit 3D spatial priors, builds 3D models (e.g., meshes, point clouds, 3DGS [25]) [11,20,32,41,44,77] for re-projection and inpainting. This provides geometric grounding but suffers from error propagation from the initial 3D estimation [55,77] and high computational cost. Instead, we utilize forward-warped images from future camera positions as a distinct explicit prior. These warped images serve as a dense, geometrically-grounded 2D hint, bypassing the error-prone and costly 3D reconstruction pipeline while offering a richer conditional signal than mere pose encoding.

One major challenge in adding precise camera control to video diffusion models is finding a good way to represent 3D camera movement. Simply using raw camera intrinsics K and extrinsics E is often suboptimal, as their numerical values (e.g., translation t) are unconstrained and difficult for a network to correlate with visual content.

A more effective paradigm is to translate these abstract parameters into a dense, pixel-wise representation that provides a clearer geometric interpretation. For example, Plücker embeddings [56] define a 6D ray vector for each pixel. This transforms the abstract matrices into a dense tensor P ∈ R n×6×h×w , which is much more informative for the diffusion model. This principle of using dense, geometrically-grounded priors is a key consideration for enabling fine-grained camera control.

The Diffusion Forcing Transformer (DFoT) [57] paradigm reframes the noising operation as progressive masking, where each frame x t in a video is assigned an independent noise level k t ∈ [0, 1]. This contrasts with conventional models that use a single noise level k for all frames. The model εθ is then trained on a per-frame noise prediction loss:

The critical advantage of this per-frame noise approach is that it enables a model to be trained with non-causal attention, learning to denoise a frame by conditioning on an arbitrary, partially-masked set of other frames.

This non-causal paradigm is particularly well-suited for our task. In typical video generation, a causal architecture is necessary as the future is unknown. However, in cameraconditioned novel view synthesis, we can generate a strong, geometry-consistent prior for all future frames simultaneously via forward-warping. These warped images provide a powerful non-causal conditioning signal. This insight is the foundation of our ST-Diff model, allowing us to discard restrictive causal constraints and employ a bidirectional, spatio-temporal diffusion strategy.

We address the task of novel view synthesis, where the goal is to generate a target view x t given a source view x s and corresponding camera poses {p s , p t }. To this end, we introduce Spatio-Temporal Diffusion with Warped Priors (ST-Diff), a bidirectional diffusion model designed for this task. Unlike causal, autoregressive video generation, where future frames are unknown, the camera-conditioned setting allows us to form a strong geometric prior for the target frame by projecting the source view. This key insight allows us to discard causal constraints and employ a more powerful bidirectional attention mechanism across all frames.

Our method first prepares geometric priors in the pixel space and then performs all diffusion, compositing, and noising operations in the latent space using a pre-trained VAE encoder E(•) and decoder D(•) [49,60]. We use x to denote data in pixel-space and z for latent-space data.

One-to-all pixel-space warping. Given a training video sequence X = {x i } T i=1 , we first sample a single source frame x s from the sequence. We then create a full sequence of warped priors by warping this single source frame x s to every other frame’s viewpoint, including its own. To do this, we use pre-estimated depth maps D i and camera parameters (extrinsics E i and intrinsics K i ) for all frames, obtained from a 3D geometry foundation model [9]. First, the source image x s and its depth D s are unprojected into a 3D RGB point cloud P s :

This single point cloud P s is then rendered into all T target viewpoints using a differentiable point-based renderer. This “one-to-all” warping process yields two new sequences: a warped prior sequence, X s→V = {x s→t } T t=1 , and a corresponding validity mask sequence, M = {M t } T t=1 . Each mask M t indicates which pixels in x s→t were successfully rendered from P s .

Latent-space composite sequence. The training pipeline of our WorldWarp is illustrated in Fig. 2. With the pixelspace assets prepared, we move entirely to the latent space. We separately encode the new warped sequence X s→V and the original ground-truth sequence X . We encode both:

We also downsample the mask sequence M to match the latent dimensions, yielding M latent = {M latent,t } T t=1 . A clean composite latent sequence Z c is then created in the latent space. For each frame t, the composite z c,t takes its features from the warped latent z s→t in valid (“warped”) regions and fills the remaining (“filled”) regions with features from the ground-truth latent z t (which is the t-th element of Z):

(5) This entire sequence Z c = {z c,t } T t=1 serves as the x 0equivalent (clean signal) for the diffusion model.

Spatially and temporally-varying noise. Our noising strategy extends the per-frame independent noise concept with a new, region-specific dimension, as shown in Fig. 2. The noise applied is varied at two levels simultaneously. First, at a temporal level, each frame t in the sequence Z c gets a different, independently sampled noise schedule. Second, at a spatial level, we apply different noise levels within each frame, distinguishing between the “warped” and “filled” regions. For each frame t, we therefore sample a pair of noise levels, (σ warped,t , σ filled,t ). A spatially-varying noise map Σ t is constructed using the latent-space mask: (6) We then generate the final noisy input sequence Z noisy = {z noisy,t } T t=1 by sampling a noise sequence E = {ϵ t } T t=1 ∼ N (0, I):

A key architectural modification is required to process this spatiallyand temporally varying noise. Standard diffusion models [60] typically accept a single timestep embedding (e.g., shape B × 1) for an entire image or video chunk. Our ST-Diff model, however, is adapted to process a unique noise level for every token. We broadcast the noise map sequence Σ V = {Σ t } T t=1 to the full latent sequence dimensions (B × T × H ′ × W ′ ) and pass it through the time embedding network, thus generating a unique time-axis and spatial-axis embedding for each corresponding token.

Training objective. We train our ST-Diff model G θ which takes the entire noisy sequence Z noisy , the sequence of noise maps Σ V , and other conditioning c (e.g., text, camera poses) as input. Critically, the model is trained to denoise the composite sequence Z noisy while regressing towards a target defined by the original ground-truth latent sequence Z. The target velocity sequence is V target = {ϵ t -z t } T t=1 . Our training objective is the L 2 loss, summed over the entire sequence:

where

This loss forces the model to learn the complex relationship between the warped, GT-filled, and final target latents across the entire video.

The inference process is illustrated in Fig. 3. Our inference process generates novel view sequences autoregressively, producing a video chunk-by-chunk in a for-loop manner. Unlike training, which uses a fixed-radius point cloud representation, our inference pipeline leverages a dynamic, testtime optimized 3D representation as an explicit geometric cache. This process, illustrated in Fig. 3, integrates 3D Gaussian Splatting [25] (3DGS) for high-fidelity warping and a Vision-Language Model (VLM) [1] for semantic guidance.

Online 3D Geometric Cache. At the beginning of each iteration k of the generation loop, we take the available history (either the initial source views for k = 1 or the video chunk generated in the previous iteration k -1). We first process these frames using a 3D geometry model (TTT3R) [9] to estimate their camera poses and an initial 3D point cloud. This point cloud is then used to initialize a 3D Gaussian Splatting (3DGS) representation, which we optimize for a few hundred steps (e.g., 200 steps) using the history frames and their estimated poses. This resulting online-optimized 3DGS model serves as an explicit, high-fidelity 3D representation cache. Compared to the fixed-radius point clouds used dur-ing training, this 3DGS provides significantly higher-quality features for the non-blank (warped) regions, which is critical for maintaining geometric consistency.

Chunk-based Generation with ST-Diff. With the geometric and semantic conditioning prepared, we first render the sequence of prior images, X s→V , from the 3DGS cache. These are encoded into latents Z s→V = {z s→t } T t=1 , and we also obtain the corresponding latent-space masks M latent = {M latent,t } T t=1 . Our goal is twofold: to fill in the blank (occluded) regions and to revise the non-blank (warped) regions, which may suffer from blur or distortion.

We achieve this by initializing the reverse diffusion process from a spatially-varying noise level, analogous to imageto-video translation. Let the full reverse schedule consist of N timesteps, from T N = 1000 down to T 1 = 1. We define a strength parameter τ ∈ [0, 1], which maps to an intermediate timestep T start and its corresponding noise level σ start . We set the noise level for the blank (filled) regions to σ filled = σ T N , which corresponds to pure noise.

For each frame t, we construct a spatially-varying noise map Σ start,t using the latent-space mask:

We then generate the initial noisy latent sequence Z start = {z start,t } T t=1 for the reverse process. This is done by applying the noise map Σ start,t to the warped latent z s→t , using a sampled Gaussian noise ϵ t :

This formulation effectively initializes the blank regions with pure noise (as σ filled ≈ 1.0) while applying a partial, strength-controlled noising to the warped regions. We visualize videos generated by our method against those by GenWarp [55], CameraCtrl [16], and VMem [32]. Our WorldWarp generalizes to diverse camera motion, showcasing the spatial and temporal consistency.

Our ST-Diff model (G θ ) then takes this spatially-mixed latent sequence Z start , the VLM text prompt, and the corresponding spatially-varying time embeddings as input. It denoises the sequence beginning from its spatially-varying timesteps (e.g., T start for warped regions and T N for blank regions) down to T 1 to generate the k-th chunk of novel views. This newly generated chunk is then used as the history for the next iteration (k + 1), and the entire process repeats.

We fine-tune WorldWarp based on Wan2.1-T2V-1.3B [60] model, with resolution 720x480 and batch size 8, on 8 H200 GPUs for 10k iterations. We apply TTT3R [9] as the 3D reconstruction foundation model for estimating camera parameters and depth maps. Please refer to the supplementary material for more details.

Datasets and evaluation metrics. We conduct experiments on two public scene-level datasets: RealEstate10K (Re10K) [84] and DL3DV [36] datasets. Our evaluation of novel view synthesis quality comprises three main components: 1) Perceptual quality: We measure the distributional similarity between generated views and the test set using the Fréchet Image Distance (FID) [17]. 2) Detail preservation: Following [47], we assess the model’s ability to preserve image details across views by computing PSNR, SSIM [66], and LPIPS [81]. 3) Geometric alignment: We evaluate camera pose accuracy against the ground truth (R gt , t gt ), following [67]. We use DUST3R [64] to extract poses (R gen , t gen ) from generated views. We then compute the rotation distance (R dist ) and translation distance (t dist ):

where tr stands for the trace of a matrix. Per [16], estimated poses are expressed relative to the first frame, and translation is normalized by the furthest frame.

We present a comprehensive quantitative evaluation on the RealEstate10K dataset in Table 1, assessing generation quality (PSNR, LPIPS) and camera pose accuracy (R dist , T dist ) Table 1. Quantitative comparison for single-view NVS on the RealEstate10K [84] dataset. We report performance for both short-term (50 th frame) and long-term (200 th frame) synthesis. For each metric, the best , second best , and third best results are highlighted. Our method significantly outperforms all baselines across most metrics, demonstrating superior quality and temporal consistency.

Long-term (200 th frame) for short-term (50 th frame) and long-term (200 th frame) synthesis. Our method achieves state-of-the-art results, outperforming all baselines across all 12 metrics. This advantage is most pronounced in the challenging long-term setting: while most methods suffer significant quality degradation, our model maintains the highest PSNR (17.13) and LPIPS (0.352), surpassing strong competitors like SEVA, VMem, and DFoT. This high fidelity is crucial, as pose estimation (using Master3R) fails on the low-quality or blurry outputs from baselines. Consequently, our model achieves the lowest long-term pose error (R dist 0.697, T dist 0.203). This highlights a clear distinction: camera-embedding methods (Mo-tionCtrl, CameraCtrl) suffer severe pose drift, and while 3D-aware methods (GenWarp, VMem) are more stable, our spatial-temporal noise diffusion strategy significantly surpasses both, proving its superior ability to mitigate cumulative camera drift. Qualitative results are in Fig. 4 and the supplementary.

We further validate our model on the more challenging DL3DV dataset in Tab. 2. Despite the complex trajectories degrading performance for all methods, our model maintains a commanding lead in all 12 metrics, demonstrating superior robustness. In the demanding long-term (200 th frame) setting, our model’s PSNR (14.53) decisively outperforms the next-best competitors, DFoT (13.51) and VMem (12.28). This fidelity is again proven critical for pose accuracy. On this complex dataset, our model remains the most stable, achieving the lowest R dist (1.007) and T dist (0.412). The weaknesses of competing approaches are magnified here, as 3D-aware methods like GenWarp (1.351R dist ) and VMem (1.419R dist ) lose stability. This proves our spatial-temporal noise diffusion strategy is more effective at preserving 3D consistency and mitigating severe camera drift on complex, long-range trajectories. Visualizations are in Fig. 4 and the supplementary material.

Figure 5. Illustration of the ST-Diff’s generating process. We illustrate the GT images, the warped images which serve as the condition for ST-Diff, the corresponding validity mask, and our final generated frames. The comparisons show that our ST-Diff successfully fills in the blank areas (initialized from a full noise level) while simultaneously revising distortions and enhancing details in the non-blank regions (initialized from a partial noise level) during the diffusion process.

We conduct ablation studies on the RealEstate10K dataset in Table 3 to validate our two core design choices: the 3DGSbased cache and the spatial-temporal noise diffusion model.

Caching Mechanism. We first analyze the effect of our caching module. The “No Cache” baseline, which relies only on the initial image, fails completely in long-term generation, with PSNR dropping to 9.22. This confirms the necessity of a 3D cache for long-range synthesis. We then compare our full model, “Caching by online optimized 3DGS,” against “Caching by RGB point cloud.” Although our model is trained on warped point clouds (with unoptimized, uniform radii), using a simple point cloud cache at Table 2. Single-view NVS on DL3DV dataset [36] Short-term evaluation is on the 50 th frame, and long-term is on frames 200 th . This dataset is significantly more challenging due to complex camera trajectories and diverse environments. All methods show a noticeable performance drop compared to RealEstate10K [84]. For each metric, the best , second best , and third best results are highlighted.

Short-term (50 th frame) Long-term (200 th frame) Inferencing efficiency. We provide a detailed breakdown of the inference latency per video chunk in Table 4. The average total time to generate one chunk (49 frames) is 54.5 seconds. The primary computational bottleneck is the iterative denoising process of our spatial-temporal diffusion model (ST-Diff), which requires 42.5 seconds for 50 steps, accounting for approximately 78% of the total time. In contrast, all 3D-related components are highly efficient: estimating the initial 3D representation with TTT3R takes 5.8s, optimizing the 3DGS cache takes only 2.5s, and forward warping is near-instant at 0.2s. This analysis demonstrates that the 3D-aware caching and conditioning, while critical for quality and consistency, add only a minimal computational overhead (8.5s total) compared to the main generative backbone.

In this work, we propose WorldWarp, a novel autoregressive framework for long-range, geometrically-consistent novel view extrapolation. Our method is designed to overcome the key limitation of prior work: the inability of standard generative models to handle imperfect 3D-warped priors. We introduce the ST-Diff model, a non-causal diffusion model trained with a spatially-temporally-varying noise schedule. This design explicitly trains the model to solve the fill-andrevise problem, simultaneously filling blank regions from pure noise while revising distorted content from a partiallynoised state. By coupling this model with an online 3D geometric cache to avoid irreversible error propagation, World-Warp achieves state-of-the-art performance, setting a new bar for long-range, camera-controlled video generation.

WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion Supplementary Material

Training. We implement our WorldWarp using the Wan2.1-T2V-1.3B diffusion transformer [60] as the generative backbone. All parameters of the model are fully fine-tuned in an end-to-end manner. The model is trained for 10k steps with a global batch size of 64 (8 per GPU) on 8 NVIDIA H200 GPUs. We utilize the AdamW optimizer with a learning rate of 5 × 10 -5 and apply a 1,000-step linear warmup. The training video resolution is set to 480 × 720.

Inference. The video generation process is initiated from a single source image. Subsequent content is synthesized auto-regressively in chunks of 49 frames. To ensure temporal continuity, we utilize a fixed context overlap of 5 frames for every iteration following the initial chunk. To establish the global coordinate system, we first estimate camera poses and intrinsics from the reference video using TTT3R [9]. For generation lengths exceeding the reference trajectory, we employ a velocity-based extrapolation strategy, computing linear velocity for translation and Spherical Linear Interpolation (SLERP) for rotation based on a 20-frame smoothing window. During the generation of each chunk, we optimize the online 3DGS cache for 500 iterations with a learning rate of 1.6 × 10 -3 to render the warped priors X s→V . We utilize the spatially-temporally schedule described in Fig. 3 in the main text for the reverse diffusion process. Specifically, the latent representations of the 5 context overlap frames are enforced as hard constraints. For the target frames, we set the strength parameter τ = 0.8. Consequently, regions with valid geometric warps (M latent = 1) are initialized with a reduced noise level σ start corresponding to τ , preserving the structural integrity of the 3D cache. Conversely, occluded or blank regions (M latent = 0) are initialized with standard Gaussian noise (σ filled = σ T N ≈ 1.0) to facilitate generative inpainting. The diffusion model employs a Flow Match Euler Discrete Scheduler with 50 denoising steps.

We illustrate more results on the RealEstate10K [84] and DL3DV [36] datasets in Fig. 6 and Fig. 7. Please refer to video supplementary material for more results.

To further demonstrate the robust generalization capacity of ST-Diff, we extend our evaluation beyond standard photorealistic benchmarks to scenes rendered in diverse artistic styles in Fig. 8. By prompting the model with specific stylistic descriptors, such as “Van Gogh style” or “Studio Ghibli style,” we generate a variety of stylized video sequences.

The results illustrate that our method successfully synthesizes these highly stylized scenes while strictly preserving the underlying 3D geometric consistency. These qualitative results validate that our proposed training strategy effectively integrates fine-grained geometric control without sacrificing the rich semantic and aesthetic generalization capabilities inherent in the pre-trained model. This confirms that adapting the foundation model into an asynchronous diffusion framework does not compromise its ability to interpret opendomain text prompts.

To validate the effectiveness of our geometry-aware inference strategy, we visualize the evolution of the noise schedule matrix Σ V throughout the reverse diffusion process. As shown in Fig. 9, the visualization is structured as a spatiotemporal grid where each row corresponds to a latent temporal token t (derived from the 49 video frames via VAE encoding) and columns progress through the denoising steps from left (T = 999) to right (T = 0).

The map explicitly corroborates our dual-schedule formulation. The top two rows, representing the 5 history context frames (corresponding to ∼2 latent tokens), remain fully constrained with zero noise (dark purple) throughout the process, ensuring seamless transitions from previous chunks. In the subsequent 11 rows (the generated tokens), we observe a distinct spatial modulation. The valid geometric regions, projected from the 3DGS cache, are maintained at a reduced noise level τ (intermediate green/teal) to preserve high-fidelity structural details. In contrast, occluded or blank regions are initialized with high-variance noise (yellow) to facilitate the generative hallucination of new content. This confirms that our model effectively balances geometric preservation with generative inpainting during the autoregressive process.

Despite the effectiveness of ST-Diff in generating geometrically consistent long-term videos, our method is subject to certain limitations common to autoregressive video generation frameworks.

Error Accumulation in Long-horizon Generation. Although our model is trained in an asynchronous diffusion manner, where we apply varying noise strengths to different frames and spatial regions to mimic inference conditions, generating infinite-length video sequences with perfect fidelity remains an unresolved challenge. In our autoregressive Dependency on Geometric Priors. Our method operates on the premise that forward-warped images provide strong geometric hints for generation. Therefore, our performance is heavily dependent on the accuracy of the upstream 3D geometry foundation models, such as TTT3R [9] or VGGT [63], used for depth and camera pose estimation. In scenarios where these pre-trained estimators struggle, including complex outdoor environments with extreme lighting, transparency, or lack of texture, the estimated depth maps and poses may be inaccurate. This results in incorrect warping results that deviate significantly from the true geometry. While ST-Diff is designed to correct artifacts, it may fail to recover high-quality frames when the geometric guidance is fundamentally flawed or contains excessive noise.

Short-term (50 th frame)Long-term (200 th frame)

Short-term (50 th frame)

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut