Unleashing Degradation-Carrying Features in Symmetric U-Net: Simpler and Stronger Baselines for All-in-One Image Restoration
All-in-one image restoration aims to handle diverse degradations (e.g., noise, blur, adverse weather) within a unified framework, yet existing methods increasingly rely on complex architectures (e.g., Mixture-of-Experts, diffusion models) and elaborate degradation prompt strategies. In this work, we reveal a critical insight: well-crafted feature extraction inherently encodes degradation-carrying information, and a symmetric U-Net architecture is sufficient to unleash these cues effectively. By aligning feature scales across encoder-decoder and enabling streamlined cross-scale propagation, our symmetric design preserves intrinsic degradation signals robustly, rendering simple additive fusion in skip connections sufficient for state-of-the-art performance. Our primary baseline, SymUNet, is built on this symmetric U-Net and achieves better results across benchmark datasets than existing approaches while reducing computational cost. We further propose a semantic enhanced variant, SE-SymUNet, which integrates direct semantic injection from frozen CLIP features via simple cross-attention to explicitly amplify degradation priors. Extensive experiments on several benchmarks validate the superiority of our methods. Both baselines SymUNet and SE-SymUNet establish simpler and stronger foundations for future advancements in all-in-one image restoration. The source code is available at https://github.com/WenlongJiao/SymUNet.
💡 Research Summary
All‑in‑one image restoration aims to recover high‑quality images from inputs degraded by a variety of factors—noise, haze, rain, blur, low‑light, etc.—using a single model. Recent works have increasingly relied on sophisticated mechanisms such as multimodal prompts, Mixture‑of‑Experts (MoE), diffusion priors, or vision‑language agents. While these approaches push performance boundaries, they also inflate model size, computational cost, and deployment complexity, and often obscure the intrinsic degradation cues already present in the raw image features.
The authors identify a fundamental bottleneck: most state‑of‑the‑art all‑in‑one networks adopt an asymmetric U‑Net‑like architecture. In these designs the decoder is “heavy”: after each skip connection the channel dimension doubles, and additional refinement blocks are appended. This asymmetry dilutes the degradation‑aware features extracted by the encoder, creates a mismatch between encoder and decoder feature spaces, and leads to training instability when multiple degradation types compete for representation capacity.
To address this, the paper proposes SymUNet, a strictly symmetric U‑Net. The encoder, decoder, and bottleneck all consist of the same efficient Transformer block (borrowed from Restormer). Each scale maintains identical channel width, and skip connections are simple element‑wise addition, preserving the original feature scale and avoiding abrupt channel expansion. No auxiliary refinement modules are used; the final restored image is obtained by a 3×3 convolution on the deepest decoder output added as a residual to the input. This design yields three key advantages: (1) degradation‑specific cues are preserved end‑to‑end, (2) the network learns compact representations due to consistent channel dimensions, and (3) parameter count and FLOPs are dramatically reduced.
Empirical results on three‑task (denoising, dehazing, deraining) and five‑task (adding deblurring and low‑light enhancement) benchmarks show that SymUNet outperforms all competing methods—including MoE‑based AirNet, prompt‑conditioned Restormer variants, and diffusion‑based frameworks—while using fewer parameters and less compute. In the PSNR‑FLOPs trade‑off plot SymUNet occupies the optimal top‑left quadrant, confirming its efficiency.
To demonstrate that simple semantic guidance can further boost performance without sacrificing simplicity, the authors extend SymUNet to SE‑SymUNet. A frozen CLIP ViT‑L/14 model extracts a set of patch‑level semantic tokens Z from the input image. At each decoder stage, image features f are refined by cross‑attention with Z (Semantic Guidance), and Z is simultaneously updated by attending to the refined features (Semantic Refinement). This bidirectional loop injects high‑level degradation priors (e.g., “hazy”, “rainy”, “noisy”) while adding negligible parameters. SE‑SymUNet consistently gains 0.1–0.2 dB PSNR over SymUNet, especially on tasks where semantic context is informative (low‑light, deblurring).
Ablation studies confirm that (i) symmetry of encoder‑decoder block counts, (ii) constant channel dimensions, and (iii) removal of auxiliary refinement blocks each contribute positively to performance; (iv) the additive skip connection is sufficient when degradation cues are preserved. Visualizations of learned residual features illustrate that each task’s characteristic patterns (haze veiling, rain streak orientation, noise texture) are clearly encoded in the encoder output and flow unchanged to the decoder.
The paper also discusses extensibility: the symmetric core can be combined with alternative backbones (ConvNeXt, Swin), frequency‑domain modules, or other large‑scale multimodal models (e.g., BLIP, Florence) for richer semantic priors. Potential future directions include temporal symmetry for video restoration, self‑supervised learning on unlabeled real‑world degradation data, and hardware‑aware lightweight variants for mobile deployment.
In summary, the work overturns the prevailing belief that more complex architectures are necessary for all‑in‑one restoration. By returning to a balanced, symmetric U‑Net design, it demonstrates that degradation‑aware features are naturally learnable and can be efficiently propagated, achieving state‑of‑the‑art results with a fraction of the computational budget. The modest semantic enhancement via frozen CLIP further validates that high‑level priors can be incorporated in a lightweight, plug‑and‑play manner, opening a clear path for future research toward simpler, faster, and more robust universal image restoration systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment