FLOL: Fast Baselines for Real-World Low-Light Enhancement
Low-Light Image Enhancement (LLIE) is a key task in computational photography and imaging. The problem of enhancing images captured during night or in dark environments has been well-studied in the computer vision literature. However, current deep learning-based solutions struggle with efficiency and robustness for real-world scenarios (e.g., scenes with noise, saturated pixels). We propose a lightweight neural network that combines image processing in the frequency and spatial domains. Our baseline method, FLOL, is one of the fastest models for this task, achieving results comparable to the state-of-the-art on popular real-world benchmarks such as LOLv2, LSRW, MIT-5K and UHD-LL. Moreover, we are able to process 1080p images in real-time under 12ms. Code and models at https://github.com/cidautai/FLOL
💡 Research Summary
The paper introduces FLOL, a lightweight yet high‑performance model for Low‑Light Image Enhancement (LLIE) that targets real‑world deployment where both speed and robustness are essential. The authors first formalize the low‑light imaging process as y = γ(x) + n, where γ captures sensor response (including ISO gain and clipping) and n denotes read‑shot noise. They argue that most recent deep‑learning solutions either have excessive parameters and FLOPs or fail to generalize from synthetic training data to real‑world scenes with complex noise and saturation patterns.
FLOL’s architecture is built around a two‑stage pipeline that processes the image in both the Fourier and spatial domains. In the first stage, called Fourier Illumination Enhancement (FIE), the input RGB image is transformed with a 2‑D Fast Fourier Transform (FFT) and split into amplitude (magnitude) and phase components. The amplitude encodes global illumination information, while the phase preserves structural details. A Metaformer‑style block (FIE‑Block) equipped with a “Free‑Process” feed‑forward network operates directly on the frequency representation. This FFN, borrowed from NAFNet’s simple‑gate design, works on a low‑dimensional feature space (16 channels) to keep computation low. The network predicts a low‑resolution “Module Map” that is upsampled and applied to the original amplitude via element‑wise division, effectively brightening the image. The phase is left unchanged, and an inverse FFT (iFFT) yields an intermediate result x_lol that has improved illumination but still contains significant noise and artifacts.
The second stage, the Denoiser, refines x_lol. The intermediate result and the original dark input are concatenated and fed into an encoder composed of strided 3×3 convolutions. The encoder’s output is split into a spatial branch and a frequency branch; the latter again uses FFT‑IFFT cycles to capture global context. Crucially, a Signal‑to‑Noise Ratio (SNR) map is computed from the frequency branch, following prior work (e.g., SNR‑Net). The spatial and frequency outputs (O_S and O_F) are fused using the SNR map: F = O_S × R + O_F × (1 − R). This weighted combination allows the network to apply stronger denoising where the SNR is low (typically darker regions) while preserving details where the SNR is high. The decoder employs sub‑pixel convolution (PixelShuffle) for up‑sampling and adds a global residual connection that injects the original image as a prior, producing the final clean output (\hat{x}).
Implementation details emphasize efficiency: the whole model contains roughly 0.18 M parameters (≈180 k) and operates with about 7–10× fewer FLOPs than comparable state‑of‑the‑art methods. On an RTX 3080 GPU, FLOL processes a full HD (1920 × 1080) image in under 12 ms, far faster than Retinexformer (≈175 ms) and comparable to the much larger FourLLIE.
The authors evaluate FLOL on several paired real‑world datasets (LOLv2‑Real, LSRW‑Nikon, LSRW‑Huawei, UHD‑LL, MIT‑5K) and on synthetic data (LOLv2‑Synthetic). Quantitative results show competitive PSNR/SSIM values: e.g., on LOLv2‑Real, FLOL achieves 19.10 dB PSNR and 0.5833 SSIM, while using 180 × fewer parameters than UHDFour. On LSRW and UHD‑LL, FLOL reaches PSNR 25.01 dB / SSIM 0.888 and PSNR 22.10 dB / SSIM 0.910 respectively, matching or surpassing many heavyweight baselines. Unpaired image quality assessments (BRISQUE, MANIQA, TReS) also indicate that FLOL’s outputs are perceptually comparable to the best methods while being far more lightweight.
Ablation studies confirm the importance of each component: removing the Fourier branch degrades illumination recovery; omitting the SNR‑guided fusion harms denoising performance; and using full‑resolution amplitude maps instead of the low‑resolution module map increases FLOPs dramatically with minimal quality gain.
The paper acknowledges limitations: the amplitude‑only illumination enhancement may struggle with highly localized lighting variations, and the current design is optimized for single‑image processing without temporal consistency mechanisms for video. Future work is suggested to explore cross‑attention between Fourier and spatial features, incorporate self‑supervised domain adaptation to further close the synthetic‑real gap, and extend the architecture for real‑time video LLIE.
In summary, FLOL presents a well‑balanced solution that unifies frequency‑domain illumination correction and spatial‑domain denoising within a compact network. Its impressive speed (≤12 ms for 1080p) and strong performance across diverse real‑world benchmarks make it a compelling baseline for low‑light enhancement, especially for resource‑constrained platforms such as mobile devices and edge AI hardware.
Comments & Academic Discussion
Loading comments...
Leave a Comment