Less is More: Skim Transformer for Light Field Image Super-resolution

Less is More: Skim Transformer for Light Field Image Super-resolution
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

A light field image captures scenes through its micro-lens array, providing a rich representation that encompasses spatial and angular information. While this richness comes at significant data redundancy, most existing methods tend to indiscriminately utilize all the information from sub-aperture images (SAIs) in an attempt to harness every visual cue regardless of their disparity significance. However, this paradigm inevitably leads to disparity entanglement, a fundamental cause of inefficiency in light field image processing. To address this limitation, we introduce the Skim Transformer, a novel architecture inspired by the “less is more” philosophy. It features a multi-branch structure where each branch is dedicated to a specific disparity range by constructing its attention score matrix over a skimmed subset of SAIs, rather than all of them. Building upon it, we present SkimLFSR, an efficient yet powerful network for light field image super-resolution. Requiring only 67% of the prior leading method’s parameters}, SkimLFSR achieves state-of-the-art results surpassing the best existing method by 0.63 dB and 0.35 dB PSNR at the 2x and 4x tasks, respectively. Through in-depth analyses, we reveal that SkimLFSR, guided by the predefined skimmed SAI sets as prior knowledge, demonstrates distinct disparity-aware behaviors in attending to visual cues. Last but not least, we conduct an experiment to validate SkimLFSR’s generalizability across different angular resolutions, where it achieves competitive performance on a larger angular resolution without any retraining or major network modifications. These findings highlight its effectiveness and adaptability as a promising paradigm for light field image processing.


💡 Research Summary

Light‑field (LF) imaging captures a scene from many viewpoints, producing a 4‑D tensor of sub‑aperture images (SAIs) that encodes both spatial and angular information. While this rich representation is valuable for downstream tasks, it also brings massive redundancy. Existing LF super‑resolution (LFSR) methods—especially those based on Vision Transformers—treat all SAIs uniformly, feeding the entire angular stack into a single self‑attention module. The authors identify this as “disparity entanglement”: heterogeneous disparity cues (large‑parallax foreground vs. small‑parallax background) are processed homogeneously, which both wastes computation and hampers the network’s ability to model depth‑related cues.

The paper proposes the Skim Transformer, a novel transformer architecture built on the “less is more” principle. Instead of attending over the full set of SAIs, each Skim Transformer branch selects a skimmed subset of SAIs that is pre‑defined to correspond to a specific disparity range. The architecture contains multiple branches (N DSA branches), each receiving a disjoint channel slice of the input LF tensor. Within a branch, a Disparity Self‑Attention (DSA) module builds query, key, and value vectors only from its skimmed SAIs, thereby constructing an attention score matrix that is focused on a particular disparity scale (e.g., outer SAIs for large disparity, inner SAIs for small disparity). This design yields two major benefits:

  1. Computational efficiency – the quadratic cost of self‑attention is applied to a much smaller set of tokens, reducing FLOPs and memory usage.
  2. Disparity disentanglement – each branch learns to specialize on a specific depth range, allowing the network to separate foreground‑background cues rather than mixing them.

Based on the Skim Transformer, the authors build SkimLFSR, a three‑stage LF super‑resolution pipeline:

  1. Initial feature extraction – four 3×3 convolutions operate on the spatial subspace to produce low‑level features.
  2. Deep feature extraction – a stack of N Correlation Blocks is applied. Each block contains (a) a Skim Transformer (spatial‑disparity attention) and (b) an Angular Transformer (standard self‑attention across SAIs). This combination captures long‑range dependencies in both spatial‑disparity and angular dimensions.
  3. Image generation – deep features are fused via 1×1 convolutions and upsampled with a PixelShuffle layer to obtain the high‑resolution LF.

Two lightweight connection tricks further boost performance with negligible overhead: (i) a raw image connection that concatenates the original LF tensor to the output of the final correlation block, and (ii) a learnable skip connection that scales each block’s input by channel‑wise learnable coefficients α. These improve PSNR by roughly 0.1 dB.

Experimental results demonstrate that SkimLFSR outperforms the previous state‑of‑the‑art Many‑to‑Many Transformer (M2MTNet) while using only 67 % of its parameters. On the standard 2× LFSR benchmark it gains +0.63 dB PSNR, and on 4× it gains +0.35 dB. A lightweight variant (37 % of parameters, 35 % FLOPs, 28 % faster inference) still surpasses almost all prior methods. Qualitative examples show sharper edges and better texture reconstruction, especially in regions with large disparity.

A particularly compelling finding is the angular‑resolution agnosticism of the Skim Transformer. Because attention is built only on the skimmed SAI set, the model does not depend on the total angular resolution used during training. The authors evaluate the network on a larger angular grid (e.g., 13×13 instead of 9×9) without any retraining or architectural changes and observe competitive PSNR, confirming that the learned disparity representations are transferable across angular configurations.

In-depth analysis of the learned attention maps reveals that each DSA branch indeed focuses on depth‑specific cues: the “large‑disparity” branch attends strongly to foreground objects with high parallax, while the “small‑disparity” branch emphasizes background structures. Remarkably, this depth‑aware behavior emerges despite training only on a regression LFSR loss, indicating that the Skim Transformer implicitly captures underlying geometry.

Overall, the paper makes the following contributions:

  1. Identification of disparity entanglement as a key inefficiency in existing LF‑Transformer methods.
  2. Introduction of the Skim Transformer, which selectively samples SAIs and employs a multi‑branch disparity‑aware attention mechanism.
  3. Construction of SkimLFSR, achieving state‑of‑the‑art LF super‑resolution with substantially fewer parameters and faster inference.
  4. Demonstration of implicit depth discrimination and angular‑resolution independence, broadening the applicability of LF models.

The work opens new avenues for efficient, geometry‑aware processing of high‑dimensional visual data, and its principles could be extended to other LF tasks such as depth estimation, view synthesis, or even to other multi‑view domains like multi‑camera video and light‑field microscopy.


Comments & Academic Discussion

Loading comments...

Leave a Comment