Feat2GS: Probing Visual Foundation Models with Gaussian Splatting

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Given that visual foundation models (VFMs) are trained on extensive datasets but often limited to 2D images, a natural question arises: how well do they understand the 3D world? With the differences in architecture and training protocols (i.e., objectives, proxy tasks), a unified framework to fairly and comprehensively probe their 3D awareness is urgently needed. Existing works on 3D probing suggest single-view 2.5D estimation (e.g., depth and normal) or two-view sparse 2D correspondence (e.g., matching and tracking). Unfortunately, these tasks ignore texture awareness, and require 3D data as ground-truth, which limits the scale and diversity of their evaluation set. To address these issues, we introduce Feat2GS, which readout 3D Gaussians attributes from VFM features extracted from unposed images. This allows us to probe 3D awareness for geometry and texture via novel view synthesis, without requiring 3D data. Additionally, the disentanglement of 3DGS parameters - geometry ($\boldsymbol{x}$, $α$, $Σ$) and texture ($\boldsymbol{c}$) - enables separate analysis of texture and geometry awareness. Under Feat2GS, we conduct extensive experiments to probe the 3D awareness of several VFMs, and investigate the ingredients that lead to a 3D aware VFM. Building on these findings, we develop several variants that achieve state-of-the-art across diverse datasets. This makes Feat2GS useful for probing VFMs, and as a simple-yet-effective baseline for novel-view synthesis. Code and data are available at https://fanegg.github.io/Feat2GS/.

💡 Research Summary

Feat2GS introduces a unified framework for probing the 3D awareness of visual foundation models (VFMs) without requiring any 3D ground‑truth data. The authors observe that while VFMs are trained on massive 2‑D image collections, it remains unclear how much geometric and texture information they implicitly capture. Existing probing methods rely on 2.5‑D tasks (depth, normal) or sparse two‑view correspondences, which ignore texture awareness and need labeled 3D data, limiting scale and diversity.

To overcome these limitations, Feat2GS extracts frozen feature maps from a variety of pretrained VFMs (e.g., DINOv2, MAE, CLIP, SAM, RADIO) and feeds each pixel‑level feature into a lightweight two‑layer MLP readout. This readout regresses the parameters of a 3D Gaussian Splatting (3DGS) primitive: 3‑D position x, opacity α, covariance Σ, and spherical‑harmonic (SH) texture coefficients c. The readout is deliberately small (256 units per layer, ReLU) to act as a pure information conduit rather than a memorizing network.

The 3DGS parameters are split into geometry (x, α, Σ) and texture (c), enabling three probing modes:

Geometry mode – geometry parameters are read out from features, texture is freely optimized.
Texture mode – texture coefficients are read out, geometry is freely optimized.
All mode – both geometry and texture are read out directly.

Camera poses for the unposed, sparse, casual image sets are initialized with the unconstrained stereo reconstructor DUSt3R and then jointly refined together with the readout and Gaussian parameters using a photometric loss between rendered views and the input images. A warm‑start step regresses the readout toward a point‑cloud generated by DUSt3R to avoid poor local minima.

Evaluation is performed on seven diverse multi‑view datasets (LLFF, DTU, DL3D‑V, MipNeRF360, MV‑imgNet, T&T, etc.) covering indoor, outdoor, object‑centric, and unbounded scenes, with view counts ranging from 2 to 7. The authors use standard 2‑D image quality metrics (PSNR, SSIM, LPIPS) for novel view synthesis (NVS) and demonstrate a strong correlation with traditional 3‑D reconstruction metrics (accuracy, completeness, distance) on DTU, justifying the use of NVS as a proxy for 3‑D awareness.

Across ten VFMs, self‑supervised ViT‑based models (DINOv2, MAE) achieve the best balance of geometry and texture reconstruction, while texture‑focused models (SAM, RADIO) excel at color fidelity but lag in geometric accuracy. The experiments also reveal that concatenating features from multiple VFMs yields a surprisingly strong baseline, leading the authors to propose three variants (All‑Concat, Geometry‑Focused, Texture‑Focused) that surpass the current state‑of‑the‑art InstantSplat in all reported metrics.

Key contributions are: (1) Feat2GS as a VFM probing tool that disentangles geometry and texture awareness without 3‑D labels; (2) an extensive analysis of mainstream VFMs across diverse datasets, providing insights into training objectives and data that foster 3‑D awareness; (3) a simple yet effective NVS baseline that sets new performance records. The work opens avenues for designing VFMs with intrinsic 3‑D understanding, evaluating them at scale, and leveraging the probing framework for downstream tasks such as 3‑D generation, reconstruction, and robotics.

Feat2GS: Probing Visual Foundation Models with Gaussian Splatting

💡 Research Summary

Comments & Academic Discussion

Leave a Comment