DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass

Reading time: 20 minute
...

๐Ÿ“ Original Info

  • Title: DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass
  • ArXiv ID: 2512.13122
  • Date: 2025-12-15
  • Authors: Vivek Alumootil, Tuan-Anh Vu, M. Khalid Jawed

๐Ÿ“ Abstract

Figure 1. DePT3R achieves robust dense point tracking and reconstruction accuracy across unposed sequences while requiring less memory usage, highlighting the effectiveness of our approach for long-range, dynamic scenes.

๐Ÿ“„ Full Content

Understanding 3D scenes from images remains a foundational challenge in computer vision, with wide-ranging implications in autonomous navigation, augmented reality, and robotics. The ability to accurately track points and reconstruct 3D structures within complex and dynamic environments is vital for enabling intelligent systems to perceive and interact effectively with the real world [30,31]. Traditional methods typically rely on extensive processing pipelines and auxiliary inputs, including depth maps and precise camera parameters, none of which are explicitly required for human visual perception. Moreover, these conventional approaches often fall short when handling dynamic scenes, which constitute a substantial portion of real-world environments, highlighting a significant gap between computational perception capabilities and human visual competence [26].

Recent advances, notably DUSt3R [36], have significantly pushed the boundaries of pose-free 3D reconstruction. DUSt3R employs an extensively trained asymmetric transformer that implicitly aligns image features via crossattention, enabling strong performance in static-scene reconstruction and downstream tasks such as point correspondence and scene flow estimation. While originally designed for static environments, DUSt3R also exhibits promising behavior on dynamic content, motivating further investigation into unified reconstruction and tracking. However, DUSt3R relies on pairwise image processing, which limits scalability and efficiency when applied to longer sequences and dynamic scenes. Subsequent work has therefore explored globally aggregated attention for multi-view reconstruction, achieving improved scalability and accuracy in static settings (e.g., Fast3R [40], VGGT [33]).

Among learning-based global formulations, VGGT [33] is a strong baseline that directly predicts camera parameters, pointmaps, depth, and 2D point tracks in a feed-forward manner, thereby avoiding additional optimization stages. Despite its strong reconstruction capabilities, VGGT exhibits two key limitations for dynamic scene understanding: (i) it struggles under substantial non-rigid deformations, and (ii) its memory footprint constrains dense tracking and limits efficiency when processing many frames jointly.

Motivated by these limitations, we introduce DePT3R, a feed-forward framework that jointly performs dense point tracking and 3D reconstruction of dynamic scenes from unposed monocular image sequences via global image aggregation (Figure 1). Instead of chaining pairwise correspondences across time, DePT3R adopts a frame-to-query formulation: given an observation time t and a query time q, the model predicts per-frame geometry (pointmaps/depth) and a motion field that directly maps points from t โ†’ q in a single forward pass. This design enables long-range deformation reasoning without accumulating drift from frame-to-frame composition, while retaining the simplicity and efficiency of a feed-forward pipeline.

Empirically, we evaluate DePT3R on multiple dynamicscene benchmarks using the metrics and protocols described in Section 4. Under this evaluation, DePT3R improves 3D point tracking accuracy over prior learning-based baselines on PointOdyssey, DynamicReplica, and Panoptic Studio, and yields comparable reconstruction quality on PointOdyssey and TUM RGB-D (Table 1). We further observe that models trained on short clips generalize to longer test sequences in our setting, and we report memory measurements indicating improved feasibility for dense tracking compared to query-based tracking in the same resolution regime (Table 1).

In summary, we make the following contributions: โ€ข We introduce DePT3R, a framework that jointly performs dense point tracking and 3D reconstruction from unposed monocular image sequences without requiring auxiliary depth or external pose inputs. โ€ข We propose a frame-to-query formulation that predicts a motion field mapping points from each observation time to a specified query time, enabling tracking without explicit frame-to-frame chaining. โ€ข We extend a globally aggregated transformer backbone with a dedicated motion head, query conditioning, and an intrinsic embedding to incorporate camera intrinsics. โ€ข We validate DePT3R using PointOdyssey, Dynami-cReplica, and Panoptic Studio for 3D point tracking. We use PointOdyssey and TUM RGB-D for reconstruction and provide memory measurements for dense tracking, compared with a query-based tracker operating at a similar resolution.

Tabula rasa methods aim to reconstruct 3D scenes solely from raw input observations without reliance on prior scene knowledge. Early techniques, such as Structure from Motion (SfM), relied heavily on matching handcrafted features across multiple source images to estimate camera poses and reconstruct scene structure [10]. The advent of Neural Radiance Fields (NeRF) [20] initiated a paradigm shift by formulating reconstruction as an optimizationbased novel-view synthesis problem, where scenes are represented as continuous neural fields. InstantNGP [21] accelerated this optimization by coupling a compact neural network with a multi-resolution hash table. Further improvements in rendering speed were achieved by 3D Gaussian Splatting [16], which explicitly represents scenes as collections of anisotropic Gaussian functions, efficiently rasterized for real-time rendering. Subsequently, Mip-Splatting [42] reduced rendering aliasing by constraining Gaussian sizes. Despite impressive results in view synthe-

Deep learning has driven rapid progress in 3D reconstruction. Early methods employed multi-stage pipelines with learned feature extractors or end-to-end models, focusing on cost volumes and correlation operations [32]. More recently, transformer-based architectures have gained prominence, mitigating the limitations of CNNs, restricted receptive fields, and weak long-range dependency modeling [47].

One notable breakthrough was DUSt3R [36], which enabled 3D reconstruction from unposed image pairs using asymmetric transformers linked through cross-attention mechanisms. The predictions from individual image pairs were subsequently globally optimized in a standard coordinate system. Subsequent developments included symmetric architectures and global attention layers, which significantly improved memory efficiency for processing larger image sets [40].

MASt3R [18] built on DUSt3R’s architecture by adding additional point-map predictions, allowing for more accurate pixel matching. Furthermore, recent methods have combined DUSt3R-inspired reconstruction strategies with efficient Gaussian splatting-based rendering, facilitating rapid, feed-forward novel-view synthesis [2,3,6,25].

Adapting DUSt3R for dynamic scenes, MonST3R [44] demonstrated its effectiveness through targeted fine-tuning. CUT3R [35] further developed this concept by introducing an online, recurrent reconstruction framework that incrementally processes images and updates scene reconstructions. Recent advances in learning-based 3D reconstruction have culminated in VGGT [33], which serves as a robust foundation for 3D reconstruction. This method employs a large transformer-based architecture to predict essential 3D attributes, including camera intrinsics, point maps, depth maps, and point tracks. By directly predicting these attributes, VGGT eliminates the need for post-processing, achieving state-of-the-art results in 3D point and camera pose reconstruction. Despite these advancements, challenges remain due to substantial camera and object motion, underscoring the need for further exploration in this area.

Optical flow estimation, traditionally framed as an energy minimization problem, was transformed by FlowNet [5], a pioneering deep learning-based solution that leverages convolutional neural networks. RAFT [28] introduced an iterative refinement strategy, querying a 4D cost volume for flow updates. Flowformer [11] improved upon this by transforming the 4D cost volume into tokenized representations processed by transformer layers.

Beyond pairwise flow, recent research has advanced pixel tracking across multiple frames. Particle Video (PIPs) [9] revisited multi-frame point trajectory estimation, leveraging temporal priors to improve occlusion handling, but its eight-frame context window limited its applicability. TAP-Vid [4] reformulated the tracking problem, established a benchmark, and introduced TAP-Net as a simple baseline. CoTracker [15] recognized statistical interdependencies among trajectories, utilizing transformers to estimate large sets of point tracks jointly.

Estimation of 3D motion of points has also advanced significantly. RAFT-3D [29] employed rigid motion embeddings to estimate scene flow between RGBD pairs. Omnimotion [34] unified 2D and 3D tracking with a quasi-3D canonical representation but still required per-scene optimization. Introduced by Xiao et al., SpatialTracker [38] enables feed-forward tracking by creating a triplane scene encoding with depth estimation. The core mecha-nism is a transformer that iteratively refines query point trajectories, producing updated paths using the prior trajectory, the query point’s features, and the features of nearby points. DELTA [22] efficiently computed dense 3D trajectories, while TAPIP3D [43] leveraged depth to elevate image features into a global coordinate system for enhanced performance. Building on the DUSt3R architecture, St4RTrack [7] adapted it for 3D point tracking by incorporating an additional prediction head and chaining motion predictions for longer trajectories. However, its pairwise processing limits the exploitation of temporal attention, thereby restricting its ability to handle significant camera motion effectively. POMATO [45] addressed these limitations by introducing a temporal attention module, eliminating scale normalization, and enhancing inter-frame interactions. Nonetheless, its fundamentally pairwise processing strategy continued to restrict fully global interactions, indicating room for further advancement. Stereo4D [12] attached a temporal motion head to the DUSt3R architecture used to estimate the position of scene points at any query time between the time of the two input frames and introduced a large dataset of internet videos with noisy point trajectories to train the motion head.

DePT3R jointly produces dense point tracks and reconstructs dynamic scenes from a sequence of RGB inputs with a single forward pass. An overview is shown in Figure 2.

Pointmap Representation. We adopt the time-dependent pointmap representation introduced by St4RTrack [7]. This representation assumes that every pixel in an image corresponds to a 3D point and maps each pixel to a specific 3D position. A pointmap c X a t โˆˆ R Hร—W ร—3 encodes the 3D positions of scene points visible in frame a, at the time of frame t, expressed within frame c’s coordinate system. Specifically, c X a t (i, j) denotes the 3D position, at the time of frame t, of the scene point corresponding to pixel (i, j) in frame a, with this position given in frame c’s coordinate system. To illustrate, c X a 1 captures the 3D positions of scene points visible in frame a at the time of frame 1, while c X a 2 captures the positions of these same points at the time of frame 2.

In a static scene, the time subscript can be omitted due to the absence of point motion. As noted by DUSt3R [36], we can obtain a X a by unprojecting the pixels of frame a into 3D space with the intrinsic matrix K a and depth map D:

a [iD(i, j), jD(i, j), D(i, j)] T , Subsequently, c X a can be readily obtained by transforming these 3D positions into frame c โ€ฒ s coordinate system using the camera poses of frames a and c.

Conversely, in a dynamic scene, most pointmaps cannot be described by simple geometric transforms alone, as they inherently encode scene motion. Note that by leveraging c X a t1 and c X a t2 , we can effectively analyze the motion of the scene points visible in frame a from time of frame t 1 to the time of frame t 2 . Feature Backbone. We leverage Visual Grounded Geometry Transformer (VGGT) [33], a robust architecture for unposed 3D reconstruction trained on extensive static and dynamic datasets, as our core backbone.

In the VGGT method, each input image is first decomposed into image tokens via DINOv2 [23]. These tokens, along with a learnable camera token and four learnable register tokens per image, are then input to a global aggregator module. Since predictions are made in the initial frame’s coordinate system, the camera and register tokens for the first image differ from those for subsequent frames, enabling the network to discriminate among frames. The aggregator module consists of alternating local and global attention layers: local layers facilitate interactions among tokens within the same image, while global layers promote interactions across images.

Output tokens from the aggregator are processed through dedicated Dense Prediction Transformer (DPT) heads [24] -specifically, a pointmap DPT head and a depth DPT head-to regress corresponding pointmaps and depth maps. These heads also output aleatoric uncertainty maps. Camera tokens are further processed through four self-attention layers followed by a linear projection to predict normalized camera extrinsics and intrinsics.

Our method takes a sequence of N frames, (I t ) N t=1 , and a query time q, and performs joint reconstruction of the scene and 3D tracking of observable points. To this end, for each frame I t , we predict two pixel-wise maps expressed in the coordinate system of the first frame I 1 :

1 Xt t , capturing point positions at the observation time, and

q -1 Xt t , capturing the motion of points from time of frame t to the time of query frame q. By jointly estimating these maps, our method concurrently reconstructs the 3D structure and tracks point motion to the query frame. Notably, our tracking approach establishes direct correspondences between each frame and the query frame, rather than using pairwise, frame-to-frame tracking, which enables the efficient capture of overall motion and deformation over extended intervals.

We integrate an additional DPT head-the motion head-to the VGGT network architecture to regress the motion map 1 M t q . This structural integration preserves VGGT’s pretrained capabilities while enabling motion estimation. Despite sharing identical inputs, the reconstruction and motion heads produce temporally distinct pointmaps. Prior works [7,19] proved that shared image tokens can effectively encode both reconstruction and motion information, thus permitting accurate motion prediction without degrading reconstruction quality.

Previous work [41] has identified the importance of providing intrinsic information for accurate 3D reconstruction. Since camera intrinsic parameters are often readily available through calibration, we propose incorporating an intrinsic embedding to enable the model to leverage these crucial parameters directly. To implement the intrinsic embedding, we concatenate a subset of the normalized camera intrinsic parameters, specifically the focal lengths f x and f y and the principal point’s y-coordinate p y , and pass this vector through a linear layer. Note that the principal point’s x-coordinate p x is excluded from this embedding because all input images are scaled to a fixed width of 514 pixels [33]. The resulting feature is then added to all image tokens, providing the model with a direct representation of the camera’s intrinsic properties.

Following VGGT, our multi-task loss is defined as:

The camera loss L camera is computed using Huber loss:

where ฤi denotes the ground-truth camera parameters and g i the ground truth. The depth loss (L depth ) is a combination of three key components: regression loss (L reg ), confidence loss (L conf ), and gradient loss (L grad ).

The regression loss (L reg ) quantifies the average error between predicted ( Di ) and ground-truth depths (D i ) for all supervised pixels:

The confidence loss (L conf ) weights these errors using a predicted uncertainty map (ฮฃ D i ):

Finally, the gradient loss (L grad ) promotes smoothness in the predicted depth map by comparing the gradients of the predicted and GT depths, weighted by uncertainty:

We employ a similar approach for both point loss and motion loss. However, the motion loss does not include the gradient-based loss.

We designed our experimental setup to evaluate the scalability and versatility of DePT3R across diverse environments, utilizing large datasets and comprehensive metrics that assess both tracking accuracy and reconstruction quality. Datasets. Our training was conducted on five synthetic datasets. PointOdyssey (PO) [46], DynamicReplica (DR) [14], and Kubric Movi-F [8] offer camera and scene motion, as well as ground truth mesh vertex trajectories, which we use for sparse point tracking supervision during training. Virtual KITTI 2 [1] and TartanAir [37] do not include point trajectories; therefore, we only use them to supervise camera poses, depth maps, and point maps.

We evaluate our method on PointOdyssey and Dynami-cReplica, as well as Panoptic Studio (PS) [13] and the TUM RGB-D SLAM Benchmark [27]. We also include a qualitative evaluation on the Stereo4D dataset [12]. We do not evaluate on the Aria Digital Twin Benchmark [17] because the VGGT method was trained on it. Baselines. We selected recent state-of-the-art (SOTA) methods for comparison, including three point tracking methods (SpatialTracker [38], MonST3R [44], and St4RTrack [7]) and five 3D reconstruction methods (DUSt3R [36], MASt3R [18], MonST3R [44], St4RTrack [7], and VGGT [33]). Metrics. We use two metrics to evaluate the efficacy and quality of our method: APD and EPE metrics.

The APD metric was originally proposed by TAPVid-3D [17] for tracking in 3D. Following the methodology of St4RTrack [7], we calculate the APD as the average percentage of predicted points whose error falls below a set of thresholds. Specifically, let P i t denote the prediction for the ith point at time t, let P i t denote the ground truth position. The APD metric is calculated as below:

where ฮด 3D = {0.1m, 0.3m, 0.5m, 1.0m}. Since predictions are made in the coordinate system of the first frame’s camera, we align the estimated 3D positions with the ground truth by scaling them by the ratio of the median of the norms of the estimated and ground-truth positions. That is, we multiply the predicted coordinates by the s factor:

The EPE metric is defined as the average Euclidean distance between the ground truth and scaled predicted positions for all points:

Implementation Details. To generate the ground-truth motion maps, we project the visible 3D mesh vertices for each frame onto the image plane using the known camera extrinsics and intrinsics, then round to the nearest integer pixel coordinate. Mesh vertices are excluded if they are not visible in the query frame.

To effectively utilize VGGT’s extensive pretraining knowledge, we initialize the weights of the alternating attention transformer, point head, and depth head with those of VGGT, and the motion head with those of the VGGT pointmap head.

During training, sequences of 2 to 10 frames are randomly selected from all available sequences. For Kubric Movi-F, we use a stride of 1; for all other datasets, we use a stride varying between 1 and 4. TartanAir and Virtual KITTI 2 are sampled with half the frequency of the different datasets.

We observe that training with the final tracking objective immediately, using model weights initialized with VGGT’s weights, degrades performance. Thus, we perform training in two phases. In the first phase, we train DePT3R using only the camera, depthmap, and pointmap losses. The intrinsic embedding is used in the first phase, but the query embedding is not. In the second phase, we integrate the query embedding and motion head, along with the tracking loss. To improve training stability and speed, we do not use a confidence-based loss in the second phase.

We employ common data augmentation strategies, such as color jitter, random aspect ratios, and random center crops [36], applied uniformly over each image of the sequence. We use the Adam optimizer with a warmup phase, followed by cosine scheduling. Gradient checkpointing and bfloat16 precision are utilized to enhance efficiency and reduce GPU memory usage. In the first phase, we train on all datasets, using a learning rate of 5 ร— 10 -5 for the intrinsic embedding and a learning rate of 5 ร— 10 -6 for the other weights. In the second phase, we train only on PointOdyssey, DynamicReplica, and Kubric, using a learning rate of 1 ร— 10 -5 for the motion head and the query embedding, and a learning rate of 1 ร— 10 -6 for all other

3D Point Tracking. We first evaluate the 3D point tracking task across three datasets: the test sets of the synthetic datasets PointOdyssey and DynamicReplica, which feature significant camera and scene motion, and PanopticStudio, a real-world dataset with no camera motion. For evaluation, we followed the same setup of St4RTrack [7] and selected 64 consecutive frames from 50 randomly selected sequences in each dataset. To compare with previous work, we restrict our evaluation to the trajectories of the points from the first frame, even though our method can track all points. Following St4RTrack, we exclude sequences from PointOdyssey that contain fog or are generated in the Kubric style.

As shown in Table 2, our method significantly outperforms all baselines across all datasets. Remarkably, even though our model was trained only on sequences of at most 10 images, it achieves impressive performance on much longer sequences. Moreover, unlike other approaches that require a windowing strategy or global optimization to handle extended sequences due to memory or architecture constraints, our method can generate predictions for a large number of frames in a single forward pass. 3D Reconstruction. Following St4RTrack [7], we diverge from previous work, directly evaluating the accuracy of our estimated 3D reconstruction with the APD and EPE metrics. We use the TUM RGB-D Benchmark, a real-world dataset with significant camera motion, scene motion, and motion blur, as well as the test set of PointOdyssey. We report metrics on 50 randomly chosen sequences of 64 consecutive frames. For the TUM RGB-D Benchmark, we filter out points with depths between 0.1 and 5 meters, as the depth camera’s accuracy degrades at long ranges. On the TUM RGB-D Benchmark (Table 3), DePT3R achieves an APD of 92.22 and an EPE of 0.0968, competitive with VGGT and significantly better than the other baselines. On PointOdyssey, DePT3R achieves an EPE of 0.0406 and an APD of 98.01, noticeably better than all of the baselines. These results demonstrate that DePT3R achieves state-ofthe-art 3D reconstruction accuracy on both synthetic and real-world data, including out-of-distribution scenarios, using synthetic training data.

We showed visualization of our method for both 3D point tracking and 3D reconstruction in Figure 3. Our method demonstrates strong generalization to large, realistic datasets with significant scene motion, despite being trained on a small collection of unrealistic datasets with minimal scene motion.

Ablation Studies. Table 4 evaluates the roles of the intrinsic embedding and random center-crop augmentation on the Panoptic Studio dataset.

โ€ข No intrinsic embedding: Without the intrinsic embedding, the model struggles to deal with the inherent scale ambiguity in unposed 3D reconstruction and dynamic objects (see Figure 4). โ€ข No center-crop augmentation: Including center-crop Computational Comparison. We do not compare with SpatialTrackerV2 [39], a recent 3D point-tracking method, because its training data and computational requirements are significantly larger than ours. However, we note that it, along with other state-of-the-art point-tracking methods, struggles to scale to large numbers of query points due to substantial memory demands. To highlight this limitation, we evaluate the GPU memory usage of SpatialTrackerV2, VGGT’s 2D point tracker, and our DePT3R for varying numbers of query points. We use a 10-frame video, with a resolution of 518 ร— 518 per frame. SpatialTrackerV2 exceeds 48 GB of memory at 40k query points and ultimately fails due to out-of-memory (OOM) errors, and VGGT exhausts GPU memory at just 22.5k query points. In contrast, our method successfully performs dense point tracking, generating over 268k point tracks (518 ร— 518 pixels), while consuming only 12 GB of memory.

Summary. In this paper, we introduce DePT3R, a simple, novel framework for simultaneously performing dense 3D reconstruction and point tracking of dynamic scenes from unposed monocular video. Despite training exclusively on synthetic datasets, our approach achieves strong performance on challenging, real-world benchmarks, highlighting the inherent synergy between reconstruction and point tracking tasks. Unlike prior methods, DePT3R leverages global temporal attention to enhance both accuracy and computational efficiency, enabling dense tracking of all visible points in a single, computationally efficient forward pass. Additionally, we effectively leverage intrinsic information to improve reconstruction and motion estimation significantly.

Limitations. First, while DePT3R uses global attention over multiple frames, it does not explicitly model the sequential/causal structure of video (e.g., temporal recur- . memory usage comparison between Spatial-TrackerV2, VGGT and our DePT3R method across varying numbers of query points. SpatialTrackerV2 and VGGT exhibit a rapid increase in GPU memory consumption, exhausting the 48 GB memory limit at just 40k and 22.5k query points, respectively. In contrast, DePT3R efficiently handles 268k query points, requiring only 12 GB of memory. All experiments were performed on an RTX A6000 GPU (48 GB). Zoom in for better visualization.

rence or explicit temporal priors), which may limit robustness under heavy occlusion, motion blur, or very long videos where temporal continuity constraints are beneficial. Second, the current formulation produces motion estimates relative to the first-frame coordinate system and is limited in temporal scope, as described in the paper (i.e., it

does not yet provide a general mechanism for temporally continuous trajectories parameterized by arbitrary query times). Third, the approach relies on point-tracking annotations (obtained from synthetic sources in training), which are expensive and difficult to scale in real-world settings; reducing supervision (e.g., via self-/weak-supervised objectives or distillation) remains an important direction. Finally, the current evaluation protocol emphasizes APD/EPE (with global median scaling) and restricts quantitative tracking evaluation to trajectories originating from the first frame; more exhaustive evaluation of dense-any-point tracking and absolute-scale behavior would strengthen conclusions about real deployment.

Future Works. Promising extensions include incorporating time-conditioned query mechanisms (to predict tracks at arbitrary timestamps), explicit temporal modeling (e.g., causal attention or recurrence), and training strategies that reduce dependence on dense trajectory labels while preserving the feed-forward efficiency of the current framework.

๐Ÿ“ธ Image Gallery

ablation.jpg framework.jpg memory.png qual.jpeg teaser.jpeg

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut