Joint Learning of Depth, Pose, and Local Radiance Field for Large Scale Monocular 3D Reconstruction

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Photorealistic 3-D reconstruction from monocular video collapses in large-scale scenes when depth, pose, and radiance are solved in isolation: scale-ambiguous depth yields ghost geometry, long-horizon pose drift corrupts alignment, and a single global NeRF cannot model hundreds of metres of content. We introduce a joint learning framework that couples all three factors and demonstrably overcomes each failure case. Our system begins with a Vision-Transformer (ViT) depth network trained with metric-scale supervision, giving globally consistent depths despite wide field-of-view variations. A multi-scale feature bundle-adjustment (BA) layer refines camera poses directly in feature space–leveraging learned pyramidal descriptors instead of brittle keypoints–to suppress drift on unconstrained trajectories. For scene representation, we deploy an incremental local-radiance-field hierarchy: new hash-grid NeRFs are allocated and frozen on-the-fly when view overlap falls below a threshold, enabling city-block-scale coverage on a single GPU. Evaluated on the Tanks and Temples benchmark, our method reduces Absolute Trajectory Error to 0.001-0.021 m across eight indoor-outdoor sequences–up to 18x lower than BARF and 2x lower than NoPe-NeRF–while maintaining sub-pixel Relative Pose Error. These results demonstrate that metric-scale, drift-free 3-D reconstruction and high-fidelity novel-view synthesis are achievable from a single uncalibrated RGB camera.

💡 Research Summary

The paper presents a unified framework that simultaneously learns metric‑scale depth, drift‑free camera poses, and an incremental hierarchy of local neural radiance fields (NeRFs) from a single uncalibrated monocular video. The authors identify three failure modes that cripple existing large‑scale monocular reconstruction pipelines: (1) depth estimated by self‑supervised methods is ambiguous up to an unknown scale, leading to “ghost” geometry; (2) pose estimation drifts over long trajectories when only photometric losses are used; (3) a single global NeRF cannot represent scenes that span hundreds of metres because of memory and representation limits.

To address these issues, the system consists of three tightly coupled modules.

Metric‑Scale Depth Estimation – A Vision‑Transformer (ViT) backbone processes 16×16 patches, followed by a lightweight CNN decoder that predicts dense depth maps. Depth supervision combines a photometric reprojection term, an edge‑aware smoothness term, and a novel metric regularizer that anchors the median depth of detected “standing‑person” pixels to a known human height (≈1.7 m). This forces the network to output absolute, metre‑scale depth even under wide‑angle lenses.
Feature‑Based Bundle Adjustment (FBA) – Two consecutive frames are passed through a shared U‑Net to obtain multi‑scale feature maps and learned confidence masks. For each 3‑D point visible in both frames, a feature residual is computed and minimized using a robust Huber loss within a Levenberg‑Marquardt optimizer. The optimization updates the full SE(3) pose as well as intrinsic parameters, propagating gradients through the Jacobian. A coarse‑to‑fine schedule refines large motions at low resolutions and fine details at higher resolutions. Temporal consistency is reinforced by forward and backward optical‑flow losses that compare the flow induced by the current depth‑pose estimate with a state‑of‑the‑art RAFT flow field.
Incremental Local Radiance Fields – Each local field is a tiny hash‑grid MLP (as in Instant‑NGP) that maps a 3‑D location and view direction to density and color. When the camera exits the contracted unit cube of the current field, that field is frozen and a new one is spawned. Frozen fields supply an L2 colour prior to guarantee seamless hand‑over, while inactive fields stop receiving gradients, keeping GPU memory below 7 GB. This hierarchical spawning enables city‑block‑scale coverage on a single GPU without the prohibitive memory cost of a monolithic NeRF.

All three modules are trained end‑to‑end in a progressive windowed schedule: an initial bootstrap on the first five frames, followed by a sliding window of 32 frames where depth warm‑up, FBA pose refinement, and radiance fine‑tuning are performed sequentially. The overall loss is a weighted sum of photometric, depth, FBA, and flow terms, ensuring that each component continuously regularizes the others.

Experimental Evaluation – On the eight Tanks & Temples sequences, the method achieves Absolute Trajectory Error (ATE) between 0.001 m and 0.021 m, which is up to 18× lower than BARF and 2× lower than NoPe‑NeRF, while maintaining sub‑pixel Relative Pose Error. For view synthesis, on the Static Hikes indoor set the approach reaches PSNR = 20.19 dB, SSIM = 0.704, and LPIPS = 0.62, outperforming LocalRF, DS‑NeRF, and NoPe‑NeRF. Ablation studies confirm that the ViT depth backbone, feature‑based BA, and incremental field spawning are each critical: replacing ViT with ResNet triples ATE, swapping to pixel‑level BA degrades rotation error by 40 %, and disabling field spawning reduces PSNR by 1.5 dB.

Limitations and Future Work – The hash‑grid resolution limits reconstruction of very thin structures (e.g., power lines). Dynamic objects can leave ghost artifacts, and the current implementation requires a desktop‑class GPU, precluding real‑time mobile deployment. The authors propose integrating large‑scale depth foundation models (e.g., Depth Anything V2), adding transient‑slot NeRFs for dynamic scenes, converting frozen fields to explicit meshes for downstream tasks, and fusing lightweight IMU/GPS priors to further curb drift.

Conclusion – By jointly optimizing metric depth, feature‑space bundle adjustment, and a scalable hierarchy of local NeRFs, the paper delivers the first monocular‑RGB pipeline that reconstructs hundred‑metre trajectories with centimetre‑level accuracy and photorealistic novel‑view synthesis. This unified approach eliminates the need for external SfM bootstraps or calibrated intrinsics, paving the way for practical AR/VR mapping, autonomous robot perception, and large‑scale digital‑twin creation in unstructured environments.

Joint Learning of Depth, Pose, and Local Radiance Field for Large Scale Monocular 3D Reconstruction

💡 Research Summary

Comments & Academic Discussion

Leave a Comment