Learning Deep Image Priors for Blind Image Denoising
Image denoising is the process of removing noise from noisy images, which is an image domain transferring task, i.e., from a single or several noise level domains to a photo-realistic domain. In this paper, we propose an effective image denoising met…
Authors: Xianxu Hou, Hongming Luo, Jingxin Liu
Learning Deep Image Priors for Blind Image Denoising Xianxu Hou 1 Hongming Luo 1 Jingxin Liu 1 Bolei Xu 1 K e Sun 2 Y uanhao Gong 1 Bozhi Liu 1 Guoping Qiu 1 , 3 1 College of Information Engineering and Guangdong K e y Lab for Intelligent Information Processing, Shenzhen Uni versity , China 2 School of Computer Science, The Uni versity of Nottingham Ningbo China 3 School of Computer Science, The Uni versity of Nottingham, UK Abstract Image denoising is the pr ocess of removing noise fr om noisy images, which is an image domain transferring task, i.e ., fr om a single or sever al noise level domains to a photo- r ealistic domain. In this paper , we pr opose an ef fective im- age denoising method by learning two image priors fr om the perspective of domain alignment. W e tac kle the domain alignment on two le vels. 1) the featur e-level prior is to learn domain-in variant features for corrupted images with differ- ent level noise; 2) the pixel-level prior is used to push the denoised imag es to the natural image manifold. The two image priors ar e based on H -diverg ence theory and imple- mented by learning classifiers in adversarial tr aining man- ners. W e evaluate our appr oach on multiple datasets. The r esults demonstrate the effectiveness of our appr oach for r o- bust image denoising on both synthetic and real-world noisy images. Furthermor e, we show that the featur e-level prior is capable of alleviating the discr epancy between differ ent level noise. It can be used to impr ove the blind denoising performance in terms of distortion measur es (PSNR and SSIM), while pixel-level prior can ef fectively impr ove the per ceptual quality to ensure the r ealistic outputs, which is further validated by subjective evaluation. 1. Introduction Image denoising is a fundamental problem in lo w-level vision as well as an important pre-processing step for many other image restoration problems [57, 23]. It aims at recov- ering a noise-free image x from its noisy observation(s) y by follo wing the degradation model y = x + v . As in many previous literatures [5, 14, 57, 41, 58], v is usually assumed additiv e white Gaussian noise (A WGN) of standard devi- ation σ . Therefore, prior knowledge modeling on images plays an essential part in image denoising. The main success of the recent image denoising meth- ods comes from the ef fective image prior modeling ov er the input images [5, 14, 20]. State of the art model-based meth- ods such as BM3D [12] and WNNM [20] can be further extended to remove unknown noises. Ho wever , there are a few dra wbacks of these methods. First, these methods usu- ally inv olve a complex and time-consuming optimization process in the testing stage. Second, the image priors em- ployed in most of these approaches are hand-crafted, such as nonlocal self-similarity and gradients, which are mainly based on the internal information of the input image without any e xternal information. In parallel, there is another type of denoising methods based on discriminati ve learning. They aim to train a deep denoising network with paired training datasets (noisy and clear images) and learn the underlying noise model implic- itly to achieve fast inference [6, 57, 58, 29], among which DnCNN [56] and FFDNet [58] ha ve obtained remarkable results. Ho wev er existing discriminativ e learning methods are usually designed to a specific noise le vel with limited flexibility . Though DnCNN-B [56] can be trained for dif- ferent noise le vels for blind image denoising, it still cannot generalize well to real-world noisy images. FFDNet [58] still requires a tunable noise level map as the input to tackle various noise le vels. Therefore, it is of great interest to de- velop general image priors which can help handle image denoising with a wide range of noise levels and generalize well for real-world noisy images. T o this end, we propose a new image denoising model, referred to Deep Image Prior Network (DIPNet), based on data-driv en image priors. In particular, we consider image denoising as a domain transferring problem, i.e ., from noise domain to photo-realistic domain. Inspired by this, we pro- pose two image priors: 1) the feature-le vel prior which is designed to help decrease domain discrepanc y between cor - rupted images with different noise lev els for robust image denoising; 2) the pixel-lev el prior which is used to push the denoised image to photo-realistic domain for percep- tual improvement. In particular , we model both priors as discriminator networks, which are trained by an adv ersar- ial training strategy to minimize the H -div ergence between different image domains. The contribution of this work can be summarized as fol- lows: • W e propose an effecti ve deep residual model based on data-dri ven image priors, namely DIPNet, for blind image denoising. Our method can achie ve state of the art results for both synthetic and real-world noise re- mov al. • W e design two image priors based on adversarial train- ing. The feature-level prior is capable of alleviating the discrepancy between different noise lev els to improve the denoising performance, while pixel-lev el prior can effecti vely impro ve the perceptual quality and produce photo-realistic results. • Compared with pre vious methods, our method sig- nificantly improv es the generalizability when adapt- ing from synthetic Gaussian denoising to real-world noise remo val. In particular, a single model trained for blind Gaussian noise remov al can outperform compet- ing methods designed specifically for real-world noise remov al. 2. Related W ork 2.1. Image Denoising A large number of image denoising methods hav e been proposed ov er the recent years, and generally the y can be grouped into two major cate gories: model-based methods and discriminativ e learning based methods. Model-based Methods are usually depended on human- crafted image priors such as nonlocal self-similarity [14, 41, 5], sparsity [15] and gradients [47, 52]. T wo of the classic methods are BM3D [12] and WNNM [20], which are usu- ally used as the benchmark methods for image denoising. In particular , BM3D uses an enhanced sparse representation for denoising by grouping similar 2D image fragments into 3D data arrays. WNNM proposes to use weighted nuclear norm minimization for image denoising by exploiting im- age nonlocal self-similarity . These models can be also fur- ther extended to handle blind denoising problem with v ari- ous noise levels. In addition, a fe w approaches [38, 44, 60] are also proposed to directly address blind image denoising problem by modeling image noise to assist corresponding denoising algorithms. Ho wever , these models are based on human-crafted priors which are designed under limited ob- servations. In this paper , instead of only using the internal information of the input images, we propose to automati- cally learn image priors by making full use of external in- formation. Discriminative Learning Based Methods try to model image prior implicitly with paired (noisy and clear images) training data. These models hav e achieved great success in image denoising by taking advantage of the recent de vel- opment of deep learning. Sev eral approaches adopt either a plain Multilayer Perceptron (MLP) [6] or con volutional neural netw orks (CNN) [29, 53, 1, 34, 57, 7] to learn a non- linear mapping from noisy images to photo-realistic ones. It is w orth to mention that remarkable results have been ob- tained by recent deep learning based models. DnCNN [56] successfully trains a deep CNN model with batch normal- ization and residual learning to further boost denoising per- formance. Moreov er , DnCNN can be extended to handle noisy images with different lev el noise. A generic image- to-image regression deep model (RBDN) [48] can be ef- fectiv ely used for image denoising. A deep kernel predic- tion network [42] is trained for denoising bursts of images taken from a handheld camera. GAN-CNN [8] proposes to use generativ e adversarial network (GAN) to estimate the noise distribution for blind image denoising. Furthermore, FFDNet [58] presents a fast and flexible denoising con vo- lutional neural network, which can handle a wide range of noise le vels with a tunable noise level map. Recently deep image prior [51] is also proposed for general image restora- tion. Howe ver , most pre vious discriminativ e learning based methods hav e to learn multiple models for handling images with different noise le vels. It is still a challenging issue to dev elop a single discriminative model for general image de- noising for both synthetic and real-world noises. Deep Learning on Image T ransf ormation. Beyond image denoising, deep CNNs hav e been successfully ap- plied to other image transformation tasks, where a model receiv es a certain input image and transforms it into the desired output. These applications include image super - resolution [13], downsampling[24], colorization [32], de- blurring [35], style transfer [30], semantic segmentation [39], image synthesis [25, 19] etc . In addition, [4] analy- ses the trade-of f between the distortion and perception mea- sures for image restoration algorithms. Ho wever their mod- els are designed to handle input images within a specific do- main, and cannot be directly used for blind image denoising with unkno wn noise level. In this work, we focus on using a single model to ef fectiv ely handle a wide range of noise lev els with data-driven image priors and also in vestigate the perception-distortion trade-of f in terms of image denoising. 3. Method Our goal is to de velop a model which can take the noisy images to produce photo-realistic ones. The primary chal- lenges are twofold: first, our model should be flexible and robust to process the same images corrupted with different lev el noise; second, we must ensure that the denoised im- ages are realistic and visually pleasing. T o address these challenges, we propose to learn two image priors based on H -div ergence theory , considering that the corrupted images with different lev el noise as well as clear images are in dif- ferent image domains. 3.1. Distribution Alignment with H -di vergence The H -di ver gence [2, 3] is proposed to estimate the do- main diver gence from two set of unlabeled data with dif- ferent distributions. It is a classifier-induced div ergence through learning a hypothesis from a class of finite com- plexity . For simplicity , we first consider the H -di ver gence of two set of samples. Thus, the problem of distrib ution alignment of the two domains can be formulated through a binary classification. Specifically we define a domain as a distribution D on inputs X , thus x 0 and x 00 can be de- noted as samples belonging to the two dif ferent domains D 0 and D 00 respectiv ely . W e also denote a labeling function h : X → [0 , 1] as a domain classifier in order to predict different labels of samples to 0 or 1 for domain D 0 and D 00 respectiv ely . Let H be a hypothesis class on X , i.e ., a set of possible domain classifiers h ∈ H . The H -diver gence between domain D 0 and D 00 is defined as follows: d H ( D 0 , D 00 ) = 2 1 − min h ∈H h 1 N X x L D 0 ( h ( x )) + 1 N X x L D 00 ( h ( x )) i (1) where L D 0 and L D 00 denote the loss of label prediction h ( x 0 ) and h ( x 00 ) on domain D 0 and D 00 respectiv ely . N is the total number of samples for a gi ven dataset. W e can see that the domain distance H is inv ersely proportional to the loss of the optimal domain classifier h ( x ) . Under the conte xt of deep learning, x can be defined as the output image or hidden activ ations produced by a neural network f . Therefore, in order to reduce the dissimilarity of the distributions D and D 0 , we can train f to maximize the loss of the domain classifier . As a result, we need to play a maxmin game between f and h as follows: min f d H ( D 0 , D 00 ) ⇔ max f min h ∈H { 1 N X x L D 0 ( h ( x )) + 1 N X x L D 00 ( h ( x )) } (2) Furthermore, when considering multiple sets of samples with m domains denoted as D 1 , D 2 ..., D m . The distribu- tion alignment can be formulated to a similar image opti- mization problem as follows: min f d H ( D 1 , D 2 , ..., D m ) ⇔ max f min h ∈H { 1 N X m X x L D m ( h ( x )) } (3) In practice, this optimization can be achie ved in an adver - sarial training manner by two ways. One is to adopt gen- erativ e adversarial netw orks (GANs) frame work [19] by re- versing the label of the two categories; the other is to inte- grate a gradient rev ersal layer (GRL) [17, 9] in CNN model. In particular , GRL is designed to retain the input unchanged during forward propagation and re verses the gradient by multiplying it with a ne gati ve constant during the backprop- agation. Additionally , GRL can be easily e xtended to multi- class inputs while GAN is more appropriate to deal with inputs with two categories. In our work, GRL and GAN are used to train feature-level and pixel-lev el priors respec- tiv ely . 3.2. Learnable Image Priors Motivation. Most classic image denoising models can be formulated to solve the follo wing problem [58]: ˆ x = arg min x 1 2 σ k y − x k 2 + λP ( x ) (4) where the first part 1 2 σ k y − x k 2 is the data fidelity term with different noise level σ , the second part P ( x ) is the re gular- ization term with image prior which is usually predefined. λ is the hyper-parameter to balance the two parts. A discrimi- nativ e denoising model, which is adopted in this work, aims to learn a non-linear mapping function x = F ( y ) parame- terized by W to predict the latent clear image x from noisy image y . Thus, the solution of Equation 4 is gi ven by: ˆ x = F ( y , σ, λ, P ; W ) (5) The key to the success of this frame work lies on the pre- defined image prior . This observation motiv ates us to learn image priors directly from image data. In particular , we pro- pose to learn two data-driven image priors on feature lev el and pixel le vel respectiv ely . Featur e-level Prior . Equation 5 requires predefined noise level σ , thus the trained model cannot be flexible enough to handle different noise lev els with a single net- work. In order to achie ve blind image denoising, we seek to incorporate noise le vel information by learning an image prior in the feature space. In particular , we train a multi- class discriminator on the output of fused features from lo- cal and global path (see Section 3.3) for dif ferent noise le vel images as shown in Figure 1. W e try to learn the feature- lev el prior ( P f eat ) via a multi-class cross entropy loss as follows: L p f eat = − 1 N N X i =1 log e ˆ p i P m j =1 e p i,j (6) Here m denotes the number of noise levels. p i,j is the out- put score for j th class for a gi ven image i and ˆ p i is output score for the correct class. W e add a gradient re versal layer (GRL) before the multi-class discriminator to achiev e the adversarial training as sho wn in Figure 1. Pixel-level Prior . The pixel-le vel prior P pix is designed to push the denoised image to the natural image manifold to ensure realistic outputs. T o achieve this, we employ a patch- based discriminator under the GAN frame work. W e adopt N residual blocks GRL conv1x1 multi-class classifier denoised image feature-level prior different noise levels FC Conv … … Figure 1. The dashed rectangle is used to construct the loss func- tion to learn feature-lev el prior . a per ceptual discriminator [50] to stabilize and improv e the performance of GAN by embedding the con volutional parts of a pre-trained deep classification network (Figure 2). Specifically , the extracted features of the output image from the pre-trained network are concatenated with the output of the previous layer , and then processed by learnable blocks of con volutional operations. W e use 3 stride con volutional blocks to achie ve spatial do wnsampling and use r elu 1 1 , r elu 2 1 and r elu 3 1 of VGG-19 [49] for feature e xtraction. The final classification is processed on each acti vation from the feature map. As the effectiv e receptiv e field for each activ ation corresponds to an image patch on the input im- age [45], the discriminator actually predicts each label for each image patch. P atch based discriminator is quite useful to model high-frequencies in image denoising by restricting our attention to the structure in local image patches. The optimization of discriminator is similar to classic bi- nary classification. Specifically let us denote D as the label of input image, thus we assign D = 1 for denoised images and 0 for clear images; p ( w,h ) represents the feature map activ ation of the discriminator at location ( w, h ) . Then the pixel-le vel prior ( p pix ) loss with N samples can be written as: L p pix = − 1 N N X i =1 D i log ( p ( w,h ) i ) + (1 − D i ) log (1 − p ( w,h ) i ) (7) As discussed in Section 3.1, we try to simultaneously mini- mize abo ve loss with respect to the discriminator and maxi- mize it with respect to the transformation netw ork. This can be achieved by the training strategy of generativ e adversar- ial network. 3.3. T ransformation Network Inspired by the architectural guidelines of sev eral previ- ous works [21, 33, 30], our image transformation network consists of 3 components: a stack of residual blocks (Fig- ure 3) to extract low-le vel features of the input image and two asymmetric paths to extract local and global features respectiv ely . Our architecture then fuses these two paths to produce the final output. Low-lev el Path. The input noisy image is first processed by a 16-layer residual network with skip-connection to ex- tract low-le vel features (Figure 3). The “pre-activ ation” N residual blocks binary classifier denoised image conv clear image noisy image Pretrained VGGNet relu1_1 relu2_1 relu3_1 ︸ concat conv conv conv ︸ concat Figure 2. The dashed rectangle is used to construct the loss func- tion to learn pixel-le vel prior . residual block [22] is adopted as it is much easier to train and generalizes better than the original ResNet [21]. For all the residual blocks, a kernel size of 3 × 3 and zero- padding are used to keep the spatial size of the input. W e also keep the number of features constant as 32 in all the residual blocks. Moreover , a skip-connection is added be- tween the input features and the output of the last residual block. As a result, complex patterns can be e xtracted with a large spatial support. Local Path and Global Path. Similar to [18, 26], the encoded features are further processed by two asymmetric networks for local and global feature extraction. The local path is fully con volutional and consists of two residual blocks as sho wn in Figure 3. It specializes in learn- ing local features while retaining the spatial information. As argued in [21], the residual connections mak e it easy to learn identical function, which is an appealing property for the transformation network considering that the output im- age shares a lot of structures with the input image. The global path uses two fully-connected layers to learn global features. Each fully-connected layer is followed by a ReLU layer as the acti v ation function. A global av erage pooling layer [36] is used to ensure that our model can pro- cess images of any resolution. Finally , the global informa- tion is summarized as a fixed dimensional vector and used to regularize the local features produced by the local path. The local and global features are then fused into a com- mon set of features, which are fed to a conv olutional layer to produce the output. The fusion is achieved by a point-wise affine mixing with ReLU non-linearity as follo ws: F c [ x, y ] = ReLU X c 0 w 0 cc 0 G c 0 + X c 0 w cc 0 L c 0 [ x, y ] + b c (8) where the F c [ x, y ] is the fused activ ation at the point [ x, y ] of c th channel. G c 0 and L c 0 denote the global and local features. c and c 0 are the number of channels of fused and input features. w 0 cc 0 and w cc 0 are learnable weights to com- pute point-wise combinations of global and local features. This can be implemented with common con volutional oper- ation with 1 × 1 kernels, which yields a fused 3-d array as the same shape as the input local features. Fusion Global Pooling N residual blocks FC global path local path + + + + + + … Conv Conv Figure 3. Overview of our transformation network. The input noisy image is first processed by a deep residual network to compute lo w- lev el features, which are further split into tw o paths to learn both local and global features. Our model then fuses these two paths to produce the final output. 3.4. Loss Function Pixel-wise mean square error (MSE) between two im- ages is widely used as an optimization target for image de- noising problem. Ho wev er , MSE could result in unsatis- fying results with o ver -smooth textures. In this work, we adopt L 1 loss as it encourages less blurring for image re- construction [28]. Specially the L 1 loss between denoised image ˆ x and clear image x of shape [ C , W , H ] is calculated as: L 1 = 1 C W H C X c =1 W X w =1 H X h =1 k x c,w,h − ˆ x c,w,h k (9) Therefore the final training losses for feature-level and pixel-le vel prior of the proposed method are as follows: L f eat = L 1 + λ 1 L p f eat (10) L pix = L 1 + λ 2 L p pix (11) In our experiments, the hyper -parameter λ 1 and λ 2 is fixed to 0.001 to balance the fidelity loss and the image prior loss. From Figure 1 and 2, we can see that all the components can be trained jointly in an end-to-end manner using a standard SGD algorithm. 4. Experiments W e conduct experiments on synthetic images for addi- tiv e white Gaussian noise remov al with either kno wn or un- known noise lev els as well as real-world noisy image de- noising. 4.1. Datasets and Experiment Setting W e train our models on the 2014 Microsoft COCO dataset [37], which is a large-scale dataset containing 82,783 training images. W e randomly crop image patches of size 64 × 64 for training. The input noisy images are obtained by adding A WGN of noise lev el σ ∈ [15 , 75] while the corresponding clear images are used as ground- truth. The noisy images are not clipped as previous works [56, 58]. T o ev aluate our denoiser on Gaussian noise re- mov al, we use 3 datasets, i.e ., CBSD68 [46] (68 images), K odak24[16] (24 images) McMaster [59] (18 image) with synthetic noise. In addition, tw o benchmarks are considered to e v aluate our method for real-world noise remov al. The dataset1 is pro vided in [54], which includes 100 cropped images for ev aluation and dataset2 is provided by in [43], which includes noisy images of 11 static scenes. Both datasets provide a mean image as the “ground truth”, with which PSNR and SSIM can be calculated. Our DIPNet employs 3 × 3 kernel for all the con volu- tional layers in the transformation network and the tw o dis- criminator networks. Each con volutional layer is followed by a batch normalization layer [27] to stabilize and accel- erate the deep network training. W e train all the models using Adam [31] to achieve stochastic optimization with a batch size of 64 for 30 epochs and it takes around 24 hours for our model to get conv erged. The initial learning rate is 10 − 3 , which is smoothly annealed by the cosine shape learning rate schedule introduced by [40]. Our implemen- tation is based on deep learning frame work Pytorch and a single GTX 1080T i GPU. 4.2. A WGN noise Removal Non-blind A WGN Removal. W e first test our DIPNet-S on noisy images corrupted with a specific noise lev el σ , re- ferring non-blind A WGN remov al. In other w ords, we train separate models for dif ferent noise le vels without any im- age priors. W e compare our method with CBM3D [12] and FFDNet [58]. T able 1 reports denoising results on different Clear image Noisy image BM3D (28.28dB) DnCNN-B (28.80dB) FFDNet (28.87dB) DIPNet-BF (28.92dB) Figure 4. Denoising comparisons of an image from CBSD68 dataset with noise lev el σ = 50. Dataset Method σ =15 σ =25 σ =35 σ =50 σ =75 CBSD68 CBM3D 33.52 30.71 28.89 27.38 25.74 FFDNet 33.87 31.21 29.58 27.92 26.24 DIPNet-S 33.90 31.25 29.60 27.91 26.16 K odak24 CBM3D 34.28 31.68 29.90 28.46 26.82 FFDNet 34.63 32.13 30.57 28.98 27.27 DIPNet-S 34.64 32.15 30.56 28.97 27.27 McMaster CBM3D 34.06 31.66 29.92 28.51 26.79 FFDNet 34.66 32.35 30.81 29.18 27.33 DIPNet-S 34.67 32.35 30.88 29.19 27.35 T able 1. Non-blind denoising results of different methods on CBSD68, Kodak24 and McMaster for A WGN noise with lev el σ = 15, 25, 35, 50 and 75. Dataset Method σ =15 σ =25 σ =35 σ =50 σ =75 a verage CBSD68 DnCNN-B 33.89 31.23 29.58 27.92 24.47 29.42 DIPNet-B 33.88 31.16 29.49 27.89 26.01 29.69 DIPNet-BP 33.70 31.07 29.36 27.82 25.99 29.59 DIPNet-BF 33.86 31.24 29.59 27.93 26.29 29.78 K odak24 DnCNN-B 34.48 32.03 30.46 28.85 25.04 30.17 DIPNet-B 34.44 32.01 30.47 28.90 27.19 30.60 DIPNet-BP 33.49 31.18 29.75 28.20 26.32 29.76 DIPNet-BF 34.62 32.11 30.55 28.97 27.28 30.71 McMaster DnCNN-B 33.44 31.51 30.14 28.61 25.10 29.76 DIPNet-B 34.33 32.09 30.61 28.76 25.86 30.33 DIPNet-BP 33.54 31.52 30.08 28.14 25.48 29.75 DIPNet-BF 34.56 32.33 30.56 29.18 27.32 30.79 T able 2. Blind denoising results of dif ferent methods on CBSD68, K odak24 and McMaster for A WGN noise with le vel σ = 15, 25, 35, 50 and 75. datasets. W e can see that our DIPNet-S can achie ve state of the art results and outperform other methods. Moreover , the improvement generalizes well across different datasets as well as different noise le vels. Blind A WGN Remov al. W e further extend our model for blind image denoising with unknown noise levels. Un- like most previous works [6, 10] that need to first esti- mate the noise level and then select the denoising model trained with the corresponding noise lev el, we train three blind models in our experiments: DIPNet-B refers to the PSNR Input noise le vel DI PN et − 15 DI PN et − 25 DI PN et − 35 DI PN et − 50 DIPNet − B F DI PN et − BP DI PN et − 75 0 25 50 75 100 20 30 Figure 5. Noise lev el sensitivity curves of DIPNet models trained with different noise levels. The averaged PSNR results are ev alu- ated on CBSD68 for different input noise le vels. model trained without any image priors, DIPNet-BF and DIPNet-BP refer to the models trained with feature-lev el and pix el-le vel priors respecti vely . Specifically , 5 noise lev- els ( σ = 15 , 25 , 35 , 50 , 75 ) are adopted to train our blind denoising models. W e compare our method with state of the art method DnCNN-B on different datasets. As shown in table 2, our DIPNet-BF can achie ve state of the art re- sults for blind A WGN noise remo val. DIPNet-BF is flexi- ble enough to handle a wide range of noise lev els effecti vely with a single network. In addition, we compare the visual results of different methods for an image in CBSD68 cor- rupted with noise lev el σ = 50 . From Figure 4, we can see that BM3D shows slightly blurred results, DnCNN-B and FFDNet could produce ov er -smooth edges and textures. In- stead, DIPNet-BF can produce the best perceptual quality of denoised images with sharp edges and fine details. Dataset Metric CBM3D WNNM MLP TNRD DnCNN NI Guided DIPNet-B DIPNet-BF DIPNet-BP Dataset1[54] PSNR 37.40 36.59 38.07 38.17 36.08 37.77 38.35 38.44 38.47 37.89 SSIM 0.9526 0.9247 0.9615 0.9640 0.9161 0.9570 0.9669 0.9669 0.9676 0.9636 Dataset2[43] PSNR 35.19 35.77 36.46 36.61 33.86 35.49 37.15 37.23 37.45 35.94 SSIM 0.8580 0.9381 0.9436 0.9463 0.8635 0.9126 0.9504 0.9389 0.9452 0.9217 T able 3. A verage PSNR(dB) and SSIM of different denoising methods on real-world noisy images in dataset1 [54] and dataset2 [43]. 4.3. Noise Lev el Sensiti vity W e further conduct experiments to in vestigate the noise lev el sensitivity of the proposed DIPNet models consider - ing that the input noise le vel is usually unknown or v ery hard to estimate in practice. W e compare the denoising per - formances of se veral dif ferent non-blind and blind DIPNet models with different input noise lev els. Figure 5 sho ws the noise le vel sensiti vity curv es of different DIPNet models. Specifically , we consider the 5 non-blind DIPNets trained with kno wn noise le vels, e.g ., “DIPNet-15” represents DIP- Net trained with the fixed noise lev el σ = 15 . W e also com- pare the results of DIPNet-BF and DIPNet-BP . As sho wn in Figure 5, we hav e the follo wing two observ ations: • For non-blind DIPNets, the best performances are achiev ed when the input noise lev el matches the noise lev el used for training. The PSNR v alues decrease slowly when using lower input noise le vels and be gin to drops significantly when the input noise lev els sur - pass the training noise lev els. • Our DIPNet-BF and DIPNet-BP demonstrate more stable performance with different input noise levels and the PSNR values denoising slowly with higher in- put noise lev els. DIPNet-BF is capable of generalizing well to a wide range noise le vel (5, 100) when trained only on 5 fix ed noise le vels and outperforms nearly all the non-blind models for different input noise le vels. Based on the abov e observ ations, it is clear that the non- blind DIPNet-S models with specific noise le vels are more sensitiv e to input noise, especially higher lev els. DIPNet- BF demonstrates much stable performance in terms of a wide range of noise lev els. DIPNet-BP is more sensitive to higher lev el noise. 4.4. Real Noise Removal Furthermore, we ev aluate our blind models on real noisy image datasets provided by [43] and [54] in terms of PSNR and SSIM. W e compare with CBM3D [12], DnCNN [56], WNNM[20], TNRD[11] and MLP [6], which are state- of-the-art methods for A WGN noise removal. W e also compare with Neat Image (NI) which is a set of commer - cial software for image denoising [54] and recent Guided [55]. T able 3 reports denoising results on the two datasets. W e can see that our model trained with feature-lev el prior (DIPNet-BF) can achie ve superior performance compared dataset DIPNet-BF DIPNet-BP CBSD68 0.3404 0.6596 dataset1 0.3857 0.6143 T able 4. Subjectiv e e valuation results on CBSD68 [46] and dataset1 [54]. W e sho w the average preference for all the vol- unteers. with traditional methods like CBM3D and deep learn- ing based DnCNN. Moreover , our method can outperform Guided algorithm[55], which is designed specifically for re- alistic noise remo v al. W e provide a visual example by dif- ferent denoising methods from dataset1 [54] in Figure 6. W e can observe that using feature-le vel prior can ef fectively improv e PSNR b ut still produce ov er-smooth textures. The pixel-le vel prior can help produce sharper appearance b ut with lower PSNR and SSIM. This is due to the appearance of high-frequency artifacts. 4.5. Effectiveness of Image Priors W e further demonstrate the effecti veness of the two im- age priors in our model. W e consider blind image denois- ing trained with 5 noise lev els ( σ = 15 , 25 , 35 , 50 , 75 ) in Section 4.2. Quantitati ve results for A WGN and real noise remov al are summarized in T able 2 and T able 3. T wo vi- sual e xamples are provided in Figure 7. The first row shows A WGN noise remov al results for an image from K odak24, the second row shows the real noise removal for an image from dataset2 [43]. W e can see that DIPNet-BF trained with feature-lev el prior provides the highest PSNR values for different noise le vels, especially for higher noise lev- els. What’ s more, DIPNet-BF shows superior performance ov er DIPNet-B on real noise removal task. It is clear that the feature-level prior can effecti vely improv e the generaliz- ability when adapting from synthetic Gaussian denoising to real noise removal by encoding domain-in variant features. On the other hand, incorporating pix el-le vel prior de grades the denoising performance of DIPNet in terms of PSNR. Howe ver , we observe that using pixel-le vel prior can yield better te xture details and sharp edges when compared to clear images (Figure 7) while feature-lev el prior could pro- duce slightly o ver -smooth images. The pix el-lev el prior can achiev e perceptual improvements to produce photo-realistic images but with lo wer PSNR values by introducing high- frequency artifacts. Ground truth PSNR / SSIM Noisy image 35.24dB / 0.8674 CBM3D 37.38dB / 0.9776 DnCNN-B 35.49dB / 0.8827 DIPNet-BF 39.08dB / 0.9778 DIPNet-BP 37.75dB / 0.9590 Figure 6. Denoising results on a real noisy image with different method. The test real images are from dataset1 [54]. Clear image Noisy image (14.94dB) DIPNet-BF (26.37dB) DIPNet-B (26.32dB) DIPNet-BP (25.84dB) Clear image (PSNR) Noisy image (36.71dB) DIPNet-BF (41.75dB) DIPNet-B (41.35dB) DIPNet-BP (39.25dB) Figure 7. Noise remo val comparisons of our methods. The first row is the results of A WGN noise remo val for an image from Kodak24 dataset with noise lev el σ = 50. The second ro w is the results of real noise removal for an image from dataset2 [43]. 4.6. Subjective Evaluation W e hav e conducted a subjecti ve ev aluation to quantify the ability of the two image priors to produce perceptu- ally convincing images. In particular , we show volunteers 4 images each time, i.e., the noisy image, ground-truth im- age and two denoised images produced by DIPNet-BF and DIPNet-BP and ask them which denoised version they pre- fer . W e ask 12 v olunteers to performance the ev aluation on CBSD68 (68 images) [46] and dataset1 (100 images) [54] for both synthetic and real noise removal respectiv ely . The experimental results of the voting test across all the v olun- teers are summarized in T able 4. W e can see that DIPNet- BP outperforms DIPNet-BF by a large margin and DIPNet- BP is superior in terms of perceptual quality . It is be- cause pixel-le vel prior can produce high-frequency details as shown in Figure 7, which are visually pleasing especially for images with high texture details. Howe ver , the gener - ated high-frequenc y details fail to exactly match the fidelity expected in the clear images. As a result, the denoising per- formances with pix el-le vel prior is inferior in terms of dis- tortion measures like PSNR and SSIM. Additional exam- ples are depicted in the supplementary material. 5. Conclusion In this paper , we have de veloped an ef fecti ve blind im- age denoising model based on data-driven image priors. The two image priors are designed from the perspective of domain alignment. Specifically the feature-lev el prior can help alleviate the domain discrepancy across different lev el noise and improve the blind image denoising performance. The pixel-le vel prior is able to push the denoised outputs to the natural image manifold for perceptual quality improve- ment. This is further confirmed by a subjectiv e e v aluation. The two image priors are learned based on adversarial train- ing of H -diver gence using the standard SGD optimization technique. V alidated on various datasets, our approach can achiev e state of the art results for both synthetic and real- world noise remov al. References [1] Forest Agostinelli, Michael R Anderson, and Honglak Lee. Adaptiv e multi-column deep neural networks with applica- tion to robust image denoising. In NIPS , 2013. [2] Shai Ben-David, John Blitzer , K oby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer W ortman V aughan. A theory of learning from dif ferent domains. Machine learn- ing , 79(1-2):151–175, 2010. [3] Shai Ben-Da vid, John Blitzer , K oby Crammer , and Fernando Pereira. Analysis of representations for domain adaptation. In NIPS , 2007. [4] Y ochai Blau and T omer Michaeli. The perception-distortion tradeoff. In CVPR , 2018. [5] Antoni Buades, Bartomeu Coll, and J-M Morel. A non-local algorithm for image denoising. In CVPR , 2005. [6] Harold C Burger , Christian J Schuler, and Stefan Harmeling. Image denoising: Can plain neural networks compete with bm3d? In CVPR , 2012. [7] Chang Chen, Zhiwei Xiong, Xinmei Tian, and Feng W u. Deep boosting for image denoising. In ECCV , 2018. [8] Jingwen Chen, Jiawei Chen, Hongyang Chao, and Ming Y ang. Image blind denoising with generativ e adversarial net- work based noise modeling. In CVPR , 2018. [9] Y uhua Chen, W en Li, Christos Sakaridis, Dengxin Dai, and Luc V an Gool. Domain adapti ve faster r-cnn for object de- tection in the wild. In CVPR , 2018. [10] Y unjin Chen and Thomas Pock. T rainable nonlinear reaction diffusion: A flexible framework for fast and effecti ve image restoration. IEEE TP AMI , 39(6):1256–1272, 2017. [11] Y unjin Chen, W ei Y u, and Thomas Pock. On learning optimized reaction diffusion processes for ef fecti ve image restoration. In CVPR , 2015. [12] Kostadin Dabov , Alessandro Foi, Vladimir Katko vnik, and Karen Egiazarian. Image denoising by sparse 3-d transform- domain collaborativ e filtering. IEEE TIP , 16(8), 2007. [13] Chao Dong, Chen Change Loy , Kaiming He, and Xiaoou T ang. Image super-resolution using deep con volutional net- works. IEEE TP AMI , 38(2):295–307, 2016. [14] W eisheng Dong, Lei Zhang, Guangming Shi, and Xin Li. Nonlocally centralized sparse representation for image restoration. IEEE TIP , 22:1620–1630, 2013. [15] Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations over learned dictionar- ies. IEEE TIP , 15(12):3736–3745, 2006. [16] Rich Franzen. Kodak lossless true color image suite. source: http://r0k. us/graphics/kodak , 4, 1999. [17] Y aroslav Ganin and V ictor S. Lempitsky . Unsupervised do- main adaptation by backpropagation. ICML , 2015. [18] Michal Gharbi, Jiawen Chen, Jonathan T . Barron, Samuel W . Hasinoff, and Frdo Durand. Deep bilateral learning for real- time image enhancement. ACM T ransactions on Graphics , 36(4):1–12, 2017. [19] Ian Goodfellow , Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David W arde-Farley , Sherjil Ozair, Aaron Courville, and Y oshua Bengio. Generative adv ersarial nets. In NIPS , 2014. [20] Shuhang Gu, Lei Zhang, W angmeng Zuo, and Xiangchu Feng. W eighted nuclear norm minimization with application to image denoising. In CVPR , 2014. [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR , 2016. [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV , 2016. [23] Felix Heide, Markus Steinberger , Y un-T a Tsai, Mushfiqur Rouf, Dawid Pajak, Dikpal Reddy , Orazio Gallo, Jing Liu, W olfgang Heidrich, Karen Egiazarian, et al. Flexisp: A flex- ible camera image processing framework. ACM T ransactions on Graphics (T OG) , 33(6):231, 2014. [24] Xianxu Hou, Jiang Duan, and Guoping Qiu. Deep fea- ture consistent deep image transformations: downscaling, decolorization and hdr tone mapping. arXiv preprint arXiv:1707.09482 , 2017. [25] Xianxu Hou, Linlin Shen, K e Sun, and Guoping Qiu. Deep feature consistent variational autoencoder . In W ACV , 2017. [26] Satoshi Iizuka, Edg ar Simo-Serra, and Hiroshi Ishikaw a. Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification. ACM T ransactions on Graph- ics , 35(4), 2016. [27] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. ICML , 2015. [28] Phillip Isola, Jun-Y an Zhu, T inghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adver - sarial networks. In CVPR , 2017. [29] V iren Jain and Sebastian Seung. Natural image denoising with con volutional netw orks. In NIPS , 2009. [30] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV , 2016. [31] Diederik P . Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. ICLR , 2015. [32] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Learning representations for automatic colorization. In ECCV , 2016. [33] Christian Ledig, Lucas Theis, Ferenc Husz ´ ar , Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew P Aitken, Alykhan T ejani, Johannes T otz, Zehan W ang, et al. Photo- realistic single image super-resolution using a generati ve ad- versarial network. In CVPR , 2017. [34] Stamatios Lefkimmiatis. Non-local color image denoising with con volutional neural netw orks. In CVPR , 2017. [35] Lerenhan Li, Jinshan P an, W ei-Sheng Lai, Changxin Gao, Nong Sang, and Ming-Hsuan Y ang. Learning a discrimina- tiv e prior for blind image deblurring. In CVPR , 2018. [36] Min Lin, Qiang Chen, and Shuicheng Y an. Network in net- work. arXiv preprint , 2013. [37] Tsung-Y i Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ ar , and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV , 2014. [38] Ce Liu, Richard Szeliski, Sing Bing Kang, C Lawrence Zitnick, and William T Freeman. Automatic estimation and remov al of noise from a single image. IEEE TP AMI , 30(2):299–314, 2008. [39] Jonathan Long, Ev an Shelhamer, and T rev or Darrell. Fully con volutional networks for semantic segmentation. In CVPR , 2015. [40] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. ICLR , 2017. [41] Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro, and Andrew Zisserman. Non-local sparse models for image restoration. In CVPR , 2009. [42] Ben Mildenhall, Jonathan T Barron, Jiawen Chen, Dillon Sharlet, Ren Ng, and Robert Carroll. Burst denoising with kernel prediction networks. In CVPR , 2018. [43] Seonghyeon Nam, Y oungbae Hwang, Y asuyuki Matsushita, and Seon Joo Kim. A holistic approach to cross-channel im- age noise modeling and its application to image denoising. In CVPR , pages 1683–1691, 2016. [44] T amer Rabie. Robust estimation approach for blind denois- ing. IEEE TIP , 14(11):1755–1765, 2005. [45] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: T owards real-time object detection with region proposal networks. In NIPS , 2015. [46] Stefan Roth and Michael J Black. Fields of e xperts: A frame- work for learning image priors. In CVPR , 2005. [47] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total v ariation based noise remov al algorithms. Physica D: nonlinear phenomena , 60(1-4):259–268, 1992. [48] V enkataraman Santhanam, Vlad I Morariu, and Larry S Davis. Generalized deep image to image regression. In CVPR , 2017. [49] Karen Simonyan and Andrew Zisserman. V ery deep con vo- lutional networks for lar ge-scale image recognition. ICLR , 2015. [50] Diana Sungatullina, Egor Zakharov , Dmitry Ulyanov, and V ictor Lempitsky . Image manipulation with perceptual dis- criminators. In ECCV , pages 587–602, 2018. [51] Dmitry Ulyanov , Andrea V edaldi, and V ictor Lempitsky . Deep image prior . In CVPR , pages 9446–9454, 2018. [52] Y air W eiss and William T Freeman. What makes a good model of natural images? In CVPR , 2007. [53] Junyuan Xie, Linli Xu, and Enhong Chen. Image denoising and inpainting with deep neural networks. In NIPS , 2012. [54] Jun Xu, Hui Li, Zhetong Liang, David Zhang, and Lei Zhang. Real-world noisy image denoising: A new bench- mark. arXiv pr eprint arXiv:1804.02603 , 2018. [55] Jun Xu, Lei Zhang, and David Zhang. External prior guided internal prior learning for real-world noisy image denoising. IEEE T ransactions on Imag e Pr ocessing , 27(6):2996–3010, 2018. [56] Kai Zhang, W angmeng Zuo, Y unjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE TIP , 26(7):3142–3155, 2017. [57] Kai Zhang, W angmeng Zuo, Shuhang Gu, and Lei Zhang. Learning deep cnn denoiser prior for image restoration. In CVPR , 2017. [58] Kai Zhang, W angmeng Zuo, and Lei Zhang. Ffdnet: T ow ard a fast and flexible solution for cnn-based image denoising. IEEE TIP , 27(9):4608–4622, 2018. [59] Lei Zhang, Xiaolin W u, Antoni Buades, and Xin Li. Color demosaicking by local directional interpolation and nonlo- cal adaptive thresholding. Journal of Electr onic imaging , 20(2):023016, 2011. [60] Fengyuan Zhu, Guangyong Chen, and Pheng-Ann Heng. From noise modeling to blind image denoising. In CVPR , 2016.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment