ASSR-NeRF: Arbitrary-Scale Super-Resolution on Voxel Grid for High-Quality Radiance Fields Reconstruction
NeRF-based methods reconstruct 3D scenes by building a radiance field with implicit or explicit representations. While NeRF-based methods can perform novel view synthesis (NVS) at arbitrary scale, the performance in high-resolution novel view synthesis (HRNVS) with low-resolution (LR) optimization often results in oversmoothing. On the other hand, single-image super-resolution (SR) aims to enhance LR images to HR counterparts but lacks multi-view consistency. To address these challenges, we propose Arbitrary-Scale Super-Resolution NeRF (ASSR-NeRF), a novel framework for super-resolution novel view synthesis (SRNVS). We propose an attention-based VoxelGridSR model to directly perform 3D super-resolution (SR) on the optimized volume. Our model is trained on diverse scenes to ensure generalizability. For unseen scenes trained with LR views, we then can directly apply our VoxelGridSR to further refine the volume and achieve multi-view consistent SR. We demonstrate quantitative and qualitatively that the proposed method achieves significant performance in SRNVS.
💡 Research Summary
The paper tackles a fundamental limitation of Neural Radiance Fields (NeRF) when attempting high‑resolution novel view synthesis (HRNVS) from low‑resolution (LR) training images: the reconstructed radiance field lacks fine details, leading to blurry or noisy high‑resolution renders. While single‑image super‑resolution (SISR) methods can enrich textures, applying them per view breaks multi‑view consistency. Existing NeRF‑SR approaches either require a high‑resolution reference view for each scene or are constrained to a fixed up‑sampling factor, limiting practicality.
ASSR‑NeRF (Arbitrary‑Scale Super‑Resolution NeRF) proposes a unified pipeline that directly performs 3D super‑resolution on the voxel‑grid representation of a radiance field, guaranteeing multi‑view consistency and allowing arbitrary scaling factors without any HR reference. The method consists of two main components:
-
Voxel‑Based Distilled Feature Field – A teacher‑student distillation scheme transfers low‑level texture priors from a pre‑trained 2‑D image super‑resolution network (a Residual Dense Network, RDN) into a 3‑D voxel grid. Two explicit grids are maintained: a density grid V_d and a feature grid V_f (C‑dimensional). For each query point, trilinear interpolation yields density σ_q and raw feature f′_q; a lightweight FeatureNet converts these into view‑dependent distilled features f_q. The loss combines the standard photometric term (L_photo) with a feature distance term (L_feat) that aligns rendered features with those extracted by the teacher network, weighted by λ=0.5. This distillation aligns the latent spaces of all scenes, enabling a single VoxelGridSR model to operate across diverse scenes.
-
VoxelGridSR Module – Inspired by recent arbitrary‑scale SR works (e.g., LIIF, CiaoSR), VoxelGridSR treats super‑resolution as a local implicit function. For a query coordinate x_q, it gathers the eight neighboring voxel positions {x_i}, each providing distilled feature f_i, density σ_i, and offset s_i = x_i – x_q. Query, key, and value vectors are produced via small MLPs: Q = MLP_q(f_q), K_i = MLP_k(
Comments & Academic Discussion
Loading comments...
Leave a Comment