Wavelet-Domain Masked Image Modeling for Color-Consistent HDR Video Reconstruction

Wavelet-Domain Masked Image Modeling for Color-Consistent HDR Video Reconstruction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

High Dynamic Range (HDR) video reconstruction aims to recover fine brightness, color, and details from Low Dynamic Range (LDR) videos. However, existing methods often suffer from color inaccuracies and temporal inconsistencies. To address these challenges, we propose WMNet, a novel HDR video reconstruction network that leverages Wavelet domain Masked Image Modeling (W-MIM). WMNet adopts a two-phase training strategy: In Phase I, W-MIM performs self-reconstruction pre-training by selectively masking color and detail information in the wavelet domain, enabling the network to develop robust color restoration capabilities. A curriculum learning scheme further refines the reconstruction process. Phase II fine-tunes the model using the pre-trained weights to improve the final reconstruction quality. To improve temporal consistency, we introduce the Temporal Mixture of Experts (T-MoE) module and the Dynamic Memory Module (DMM). T-MoE adaptively fuses adjacent frames to reduce flickering artifacts, while DMM captures long-range dependencies, ensuring smooth motion and preservation of fine details. Additionally, since existing HDR video datasets lack scene-based segmentation, we reorganize HDRTV4K into HDRTV4K-Scene, establishing a new benchmark for HDR video reconstruction. Extensive experiments demonstrate that WMNet achieves state-of-the-art performance across multiple evaluation metrics, significantly improving color fidelity, temporal coherence, and perceptual quality. The code is available at: https://github.com/eezkni/WMNet


💡 Research Summary

The paper introduces WMNet, a novel deep learning framework for high‑dynamic‑range (HDR) video reconstruction from low‑dynamic‑range (LDR) inputs that simultaneously addresses two long‑standing challenges: color fidelity and temporal consistency. The core innovation is Wavelet‑Domain Masked Image Modeling (W‑MIM). Instead of masking pixels in the spatial domain, the authors decompose each LDR frame using a three‑level 2‑D Haar discrete wavelet transform (DWT) into low‑frequency (L) and three high‑frequency (H) sub‑bands. A full‑zero mask is applied to all high‑frequency components, forcing the network to learn to restore color and fine details from heavily degraded inputs. Simultaneously, a random mask with a gradually increasing ratio (0 → 0.5) is applied to the low‑frequency component, following a curriculum‑learning schedule that first emphasizes color reconstruction and later introduces more complex structural learning. This self‑reconstruction pre‑training (Phase I) yields an encoder that is already adept at color and detail recovery.

Phase II fine‑tunes the same encoder for the actual LDR‑to‑HDR mapping while integrating two temporal modules. The Temporal Mixture of Experts (T‑MoE) consists of several expert branches that process features from adjacent frames at the same resolution; a learned gating mechanism dynamically weights each expert per frame, enabling adaptive fusion of neighboring information and reducing flickering. The Dynamic Memory Module (DMM) stores scene‑level memory slots. For each frame, a similarity‑based read operation retrieves the most relevant contextual cues, which are then merged back into the feature stream, providing long‑range temporal context without the interference typical of batch‑level or dataset‑wide memories.

To evaluate the method, the authors reorganize the public HDRTV4K dataset into scene‑based splits (HDRTV4K‑Scene) and add a set of long‑duration scenes (HDRTV4K‑LongScene). Experiments on these benchmarks, using PSNR‑L, HDR‑VDP‑3, ΔE (color difference), and temporal warping error, demonstrate that WMNet outperforms previous state‑of‑the‑art approaches by a substantial margin (e.g., +1.2 dB PSNR‑L, 15 % lower ΔE, 30 % reduction in temporal error). Ablation studies confirm that both W‑MIM and the temporal modules contribute critically: removing W‑MIM degrades color accuracy, while omitting T‑MoE or DMM leads to noticeable flicker and loss of long‑range coherence.

In summary, WMNet leverages wavelet‑domain masking to enforce robust color learning, employs a curriculum to progressively increase difficulty, and augments the backbone with expert‑based frame fusion and scene‑specific memory. This combination yields HDR video reconstructions that are both color‑accurate and temporally stable, setting a new benchmark for the field. Future work may explore lighter expert designs, multi‑scale memory hierarchies, and real‑time deployment.


Comments & Academic Discussion

Loading comments...

Leave a Comment