Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention

Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Feedforward models for novel view synthesis (NVS) have recently advanced by transformer-based methods like LVSM, using attention among all input and target views. In this work, we argue that its full self-attention design is suboptimal, suffering from quadratic complexity with respect to the number of input views and rigid parameter sharing among heterogeneous tokens. We propose Efficient-LVSM, a dual-stream architecture that avoids these issues with a decoupled co-refinement mechanism. It applies intra-view self-attention for input views and self-then-cross attention for target views, eliminating unnecessary computation. Efficient-LVSM achieves 29.86 dB PSNR on RealEstate10K with 2 input views, surpassing LVSM by 0.2 dB, with 2x faster training convergence and 4.4x faster inference speed. Efficient-LVSM achieves state-of-the-art performance on multiple benchmarks, exhibits strong zero-shot generalization to unseen view counts, and enables incremental inference with KV-cache, thanks to its decoupled designs.


💡 Research Summary

The paper introduces Efficient‑LVSM, a transformer‑based large view synthesis model that addresses the inefficiencies of the prior LVSM architecture. LVSM concatenates all input‑view tokens (image patches + pose embeddings) and target‑view pose tokens into a single sequence and applies full self‑attention. This design suffers from two major drawbacks: (1) quadratic computational and memory cost with respect to the number of input views (O(N²)), which becomes prohibitive when many views are available, and (2) a single set of attention parameters is forced to process heterogeneous tokens—rich image‑plus‑geometry tokens for inputs and pose‑only tokens for targets—limiting the model’s ability to learn specialized representations for each role.

Efficient‑LVSM replaces this monolithic design with a dual‑stream architecture and a decoupled co‑refinement attention mechanism. The system consists of an Input Encoder and a Target Decoder:

  • Input Encoder processes each input view independently. It applies intra‑view self‑attention only within the patches of a single view, thereby reducing the per‑layer complexity from O(N²·P²) to O(N·P²) (where P is the number of patches per view). Because each view is handled separately, the encoder’s key/value pairs can be cached (KV‑cache) and reused across multiple target generations.

  • Target Decoder generates each novel view. For every decoder layer it first runs self‑attention on the target tokens (self‑then‑cross pattern) to let the target query refine its own structure, then performs cross‑attention where the target tokens attend to the cached input‑view representations. This separation allows distinct parameter sets for input and target streams, avoiding the “one‑size‑fits‑all” limitation of LVSM.

The authors further enhance the architecture with layer‑wise co‑refinement: each decoder layer queries the corresponding encoder layer rather than only the final encoder output. This enables the decoder to combine fine‑grained low‑level details (from early encoder layers) with high‑level semantic context (from deeper layers), leading to richer feature synthesis.

To inject strong visual priors without sacrificing inference speed, the paper adopts REP‑A (Representation Distillation) that aligns intermediate token projections with features from a pretrained DINOv3 vision encoder. Experiments show that REP‑A yields modest gains for Efficient‑LVSM but little improvement for LVSM, indicating that the decoupled design better leverages external semantics.

Complexity analysis (Table 1) demonstrates that Efficient‑LVSM’s overall cost is O(N·M) for N input views and M target views, compared to LVSM’s O(N²·M) (decoder‑only) or O(N²) (encoder‑decoder). Empirically, on RealEstate10K with two input views, Efficient‑LVSM achieves 29.86 dB PSNR (0.2 dB over LVSM), 0.895 SSIM, and 0.102 LPIPS, while training converges twice as fast and inference is 4.4× quicker. The model also attains state‑of‑the‑art results on several other benchmarks (DepthSplat, MVSplat, GS‑LRM) and shows strong zero‑shot generalization when the number of input views at test time differs from training.

A key practical advantage is incremental inference via KV‑caching. Once the input encoder processes a set of views, their keys and values are stored. Adding a new target view only requires a forward pass through the decoder using the cached inputs. Adding a new input view merely involves encoding that view and appending its keys/values to the cache. This property makes Efficient‑LVSM suitable for interactive applications such as live view synthesis in AR/VR or robotics where cameras may be added or removed on the fly.

In summary, Efficient‑LVSM presents a well‑motivated redesign of large‑scale view synthesis transformers: it reduces computational complexity, separates the learning of input‑scene understanding and target‑view rendering, leverages multi‑layer co‑refinement, and enables fast, incremental inference. The paper’s extensive ablations and benchmark comparisons substantiate the claimed improvements, positioning Efficient‑LVSM as a strong baseline for future research in scalable, real‑time novel view synthesis.


Comments & Academic Discussion

Loading comments...

Leave a Comment