Multi-level Wavelet Convolutional Neural Networks
In computer vision, convolutional networks (CNNs) often adopts pooling to enlarge receptive field which has the advantage of low computational complexity. However, pooling can cause information loss and thus is detrimental to further operations such as features extraction and analysis. Recently, dilated filter has been proposed to trade off between receptive field size and efficiency. But the accompanying gridding effect can cause a sparse sampling of input images with checkerboard patterns. To address this problem, in this paper, we propose a novel multi-level wavelet CNN (MWCNN) model to achieve better trade-off between receptive field size and computational efficiency. The core idea is to embed wavelet transform into CNN architecture to reduce the resolution of feature maps while at the same time, increasing receptive field. Specifically, MWCNN for image restoration is based on U-Net architecture, and inverse wavelet transform (IWT) is deployed to reconstruct the high resolution (HR) feature maps. The proposed MWCNN can also be viewed as an improvement of dilated filter and a generalization of average pooling, and can be applied to not only image restoration tasks, but also any CNNs requiring a pooling operation. The experimental results demonstrate effectiveness of the proposed MWCNN for tasks such as image denoising, single image super-resolution, JPEG image artifacts removal and object classification.
💡 Research Summary
The paper addresses two longstanding issues in convolutional neural networks for vision tasks: the information loss caused by conventional pooling operations and the gridding artifacts introduced by dilated convolutions. To overcome these problems, the authors propose a Multi‑level Wavelet Convolutional Neural Network (MWCNN) that replaces pooling with the discrete wavelet transform (DWT) for down‑sampling and uses the inverse wavelet transform (IWT) for up‑sampling. By employing Haar wavelets, DWT is implemented as four fixed convolutional filters (LL, LH, HL, HH) with stride 2, which simultaneously reduces spatial resolution by a factor of two and preserves both low‑frequency (LL) and high‑frequency (LH, HL, HH) information. Because DWT is invertible, IWT can reconstruct the original feature map without any loss, unlike average or max pooling that discard high‑frequency details.
The architecture builds upon a U‑Net backbone. In the encoder, each DWT step is followed by a shallow three‑layer fully‑connected network (FCN) block that learns non‑linear transformations on the four sub‑bands. The decoder mirrors this process: IWT restores the spatial resolution, and skip connections fuse low‑level detail with high‑level semantics. By stacking two or three levels of wavelet decomposition, the receptive field expands exponentially while the number of parameters and FLOPs increase only modestly. The authors mathematically relate MWCNN to dilated convolutions (showing that dilated filters are a special case with sparse sampling) and to average pooling (average pooling corresponds to using only the LL sub‑band). Thus MWCNN can be viewed as a more general, information‑preserving down‑sampling scheme.
Extensive experiments cover four representative tasks:
- Image Denoising – On standard benchmarks (Set12, BSD68), MWCNN outperforms DnCNN and MemNet by 0.2–0.3 dB in PSNR, demonstrating superior preservation of fine textures thanks to the high‑frequency sub‑bands.
- Single‑Image Super‑Resolution (SISR) – For 4× up‑scaling, MWCNN achieves state‑of‑the‑art PSNR on Set5, Set14, and BSD100, surpassing SRCNN, VDSR, DRRN, and other deep models while maintaining a relatively low computational cost.
- JPEG Artifact Removal – The network effectively restores lost high‑frequency components, yielding higher PSNR/SSIM than ARCNN and DnCNN across various compression qualities.
- Image Classification – Replacing the conventional pooling layers in a ResNet‑50‑like classifier with DWT leads to a modest (≈0.4 %) increase in Top‑1 accuracy on ImageNet, confirming that the richer multi‑scale representation benefits high‑level tasks as well.
In terms of speed, MWCNN runs 2–3× faster than very deep models such as DRRN and MemNet on a GTX 1080, and only slightly slower than lightweight models like LapSRN, yet it offers a dramatically larger receptive field and higher output quality. The authors also discuss the limitations: the current implementation uses only Haar wavelets; exploring more sophisticated bi‑orthogonal wavelets or learning the wavelet filters could further improve performance.
Overall, MWCNN demonstrates that integrating wavelet transforms into CNNs provides an elegant, lossless down‑sampling mechanism that simultaneously enlarges receptive fields and preserves high‑frequency details. This makes it a versatile drop‑in replacement for pooling in a wide range of vision applications, from low‑level restoration to high‑level classification.
Comments & Academic Discussion
Loading comments...
Leave a Comment