A Portable Multiscopic Camera for Novel View and Time Synthesis in Dynamic Scenes

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present a portable multiscopic camera system with a dedicated model for novel view and time synthesis in dynamic scenes. Our goal is to render high-quality images for a dynamic scene from any viewpoint at any time using our portable multiscopic camera. To achieve such novel view and time synthesis, we develop a physical multiscopic camera equipped with five cameras to train a neural radiance field (NeRF) in both time and spatial domains for dynamic scenes. Our model maps a 6D coordinate (3D spatial position, 1D temporal coordinate, and 2D viewing direction) to view-dependent and time-varying emitted radiance and volume density. Volume rendering is applied to render a photo-realistic image at a specified camera pose and time. To improve the robustness of our physical camera, we propose a camera parameter optimization module and a temporal frame interpolation module to promote information propagation across time. We conduct experiments on both real-world and synthetic datasets to evaluate our system, and the results show that our approach outperforms alternative solutions qualitatively and quantitatively. Our code and dataset are available at https://yuenfuilau.github.io/.

💡 Research Summary

The paper introduces a portable multiscopic camera system and a dedicated neural radiance field (NeRF) model for simultaneous novel view synthesis (NVS) and novel time synthesis (NTS) in dynamic scenes. The hardware consists of five synchronized RGB cameras arranged in a compact 30 cm × 30 cm layout (top‑left‑center‑right‑bottom), enabling capture of multiview video at 30 fps. This lightweight setup is designed for easy deployment by general users, contrasting with bulky fixed camera arrays used in prior work.

On the algorithmic side, the authors extend the classic NeRF representation to a six‑dimensional input: a 3‑D spatial coordinate (x, y, z), a 1‑D temporal coordinate t, and a 2‑D viewing direction d. To handle the temporal dimension more effectively, a separate time‑encoding network W(t) maps t into a latent vector t₀, which is concatenated with the spatial coordinates before being fed to a two‑stage multilayer perceptron (MLP). The first MLP (8 layers) predicts density σ and an intermediate feature vector ℓ_c from (x, y, z, t₀). The second MLP combines ℓ_c with the viewing direction d to output RGB radiance c. Volume rendering integrates σ and c along sampled points on each camera ray using standard quadrature, producing pixel colors.

A key novelty is the joint optimization of camera intrinsics and extrinsics together with the NeRF parameters. Instead of relying on pre‑calibrated poses from structure‑from‑motion or external sensors, the method treats intrinsic parameters as unconstrained real variables and extrinsics as elements of the SE(3) Lie group, updating them via gradient descent during training. This makes the system robust to calibration errors and simplifies deployment.

To enforce temporal consistency, the authors incorporate the Super‑SloMo video interpolation model. For each pair of adjacent frames I_t and I_{t+1} from a given camera, SloMo generates an intermediate frame I_{t+δ} (0 ≤ δ < 1). The NeRF is also asked to render the same intermediate time using the same spatial‑temporal input. A photometric loss between the interpolated frame and the NeRF output encourages the network to produce smooth, coherent predictions across time, mitigating the entanglement that would otherwise arise from directly feeding raw (x, y, z, t) into the MLP.

The authors evaluate the approach on two datasets. The real‑world dataset comprises indoor and outdoor scenes captured with the five‑camera rig, featuring moving people, objects, and varying illumination. The synthetic dataset is generated using Habitat‑Sim, allowing ground‑truth rendering from arbitrary viewpoints and timestamps. Quantitative metrics (PSNR, SSIM, LPIPS) show that the proposed method outperforms static NeRF, Hyper‑NeRF, D‑NeRF, and other recent dynamic‑scene baselines by 2–3 dB in PSNR on average. Qualitative results demonstrate sharper motion boundaries, more accurate specular highlights, and faithful reconstruction of occlusion dynamics.

Contributions are summarized as: (1) a portable, lightweight multiscopic camera prototype; (2) a time‑aware NeRF that jointly learns scene radiance, density, and camera parameters; (3) the integration of frame interpolation to strengthen temporal continuity; (4) a publicly released codebase and both real‑world and synthetic datasets. Limitations include the relatively small number of views (five), which may be insufficient for highly complex or large‑scale environments, and the computational cost associated with dense ray sampling and deep MLPs. Future work could explore scaling the camera array, employing hybrid grid‑MLP representations, or designing more efficient training pipelines to approach real‑time rendering.

A Portable Multiscopic Camera for Novel View and Time Synthesis in Dynamic Scenes

💡 Research Summary

Comments & Academic Discussion

Leave a Comment