TransformerFusion: Monocular RGB Scene Reconstruction using Transformers

TransformerFusion: Monocular RGB Scene Reconstruction using Transformers
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce TransformerFusion, a transformer-based 3D scene reconstruction approach. From an input monocular RGB video, the video frames are processed by a transformer network that fuses the observations into a volumetric feature grid representing the scene; this feature grid is then decoded into an implicit 3D scene representation. Key to our approach is the transformer architecture that enables the network to learn to attend to the most relevant image frames for each 3D location in the scene, supervised only by the scene reconstruction task. Features are fused in a coarse-to-fine fashion, storing fine-level features only where needed, requiring lower memory storage and enabling fusion at interactive rates. The feature grid is then decoded to a higher-resolution scene reconstruction, using an MLP-based surface occupancy prediction from interpolated coarse-to-fine 3D features. Our approach results in an accurate surface reconstruction, outperforming state-of-the-art multi-view stereo depth estimation methods, fully-convolutional 3D reconstruction approaches, and approaches using LSTM- or GRU-based recurrent networks for video sequence fusion.


💡 Research Summary

TransformerFusion presents a novel end‑to‑end framework for dense 3D scene reconstruction from a single monocular RGB video stream. The method departs from traditional multi‑view stereo pipelines that rely on cost‑volume construction or simple feature averaging, which treat every frame equally and suffer when some frames are blurred, partially occluded, or otherwise low‑quality. Instead, the authors harness the self‑attention mechanism of transformers to learn, in a fully supervised manner, which views contribute the most informative evidence for each spatial location in the scene.

The pipeline consists of four main stages. First, each input frame is processed by a 2‑D convolutional encoder Θ that extracts two levels of image features: a coarse map Φ_c and a fine map Φ_f. Second, these 2‑D features are unprojected into a volumetric grid defined in world coordinates. The grid is sampled at a coarse resolution of 30 cm and a fine resolution of 10 cm. For every grid point p, the corresponding pixel location in each of the N frames is obtained via full perspective projection using known intrinsics K_i and extrinsics (R_i, t_i). The pixel’s RGB feature φ_i, its depth d_i, a binary validity flag v_i, and the viewing direction r_i are concatenated and linearly embedded into a D‑dimensional token θ_i (D = 256).

Third, two independent transformer networks—one for coarse tokens (T_c) and one for fine tokens (T_f)—receive the N token sequences for each grid point. Each transformer comprises eight blocks of multi‑head self‑attention (four heads) followed by feed‑forward layers, layer‑norm, and residual connections. The output of the transformer is a fused 3‑D feature ψ and the attention weights w from the first attention layer. The attention weights serve a dual purpose: they guide a view‑selection mechanism that retains only the K = 16 most relevant frames per point, thereby keeping computational cost bounded even for long video sequences, and they provide interpretability by indicating which frames the network relied on.

After temporal fusion, spatial refinement is performed with 3‑D convolutional networks. The coarse fused features ψ_c are refined by a shallow 3‑D CNN C_c (three residual blocks) that preserves spatial resolution, then up‑sampled via nearest‑neighbor interpolation to the fine grid. The up‑sampled coarse features are concatenated with the fine fused features ψ_f and processed by another 3‑D CNN C_f, yielding refined fine features ˜ψ_f. Simultaneously, two auxiliary 3‑D CNNs (M_c, M_f) predict near‑surface occupancy masks m_c and m_f. These masks identify voxels that lie within a small distance of the ground‑truth surface; voxels with low mask values are treated as free space and excluded from subsequent fine‑level processing, dramatically reducing the amount of computation required for high‑resolution occupancy prediction.

The final occupancy field is obtained by trilinearly interpolating both the refined coarse and fine feature volumes at any query point p, concatenating the two interpolated vectors, and feeding them into a lightweight multi‑layer perceptron S (three feed‑forward blocks with ReLU, residual connections, and layer‑norm). The MLP outputs a scalar o ∈


Comments & Academic Discussion

Loading comments...

Leave a Comment