Image Reconstruction with Predictive Filter Flow

Image Reconstruction with Pr edictiv e Filter Flow Shu K ong, Charless F o wlkes Dept. of Computer Science, Uni versity of California, Irvine { skong2, fowlkes } @ics.uci.edu [ Project Page ], [ Github ], [ Slides ], [ Poster ] Abstract W e pr opose a simple , interpretable fr ame work for solv- ing a wide range of image r econstruction pr oblems such as denoising and deconvolution. Given a corrupted input im- age , the model synthesizes a spatially varying linear ﬁlter which, when applied to the input image , r econstructs the desir ed output. The model parameters ar e learned using supervised or self-supervised training . W e test this model on thr ee tasks: non-uniform motion blur r emoval, lossy- compr ession artifact reduction and single image super r es- olution. W e demonstrate that our model substantially out- performs state-of-the-art methods on all these tasks and is signiﬁcantly faster than optimization-based appr oaches to decon volution. Unlike models that dir ectly pr edict output pixel values, the pr edicted ﬁlter ﬂow is contr ollable and in- terpr etable, which we demonstrate by visualizing the space of pr edicted ﬁlters for differ ent tasks. 1 1. Introduction Real-world images are seldom perfect. Practical en- gineering trade-offs entail that consumer photos are often blurry due to low-light, camera shak e or object motion, lim- ited in resolution and further degraded by image compres- sion artifacts introduced for the sake of af fordable transmis- sion and storage. Scientiﬁc applications such as microscop y or astronomy , which push the fundamental physical limita- tions of light, lenses and sensors, face similar challenges. Recov ering high-quality images from degraded measure- ments has been a long-standing problem for image analysis and spans a range of tasks such as blind-image deblurring [ 4 , 28 , 13 , 43 ], compression artifact reduction [ 44 , 33 ], and single image super-resolution [ 39 , 57 ]. Such image reconstruction tasks can be viewed mathe- matically as in verse problems [ 48 , 22 ], which are typically 1 Due to that arxiv limits the size of ﬁles, we put high-resolution ﬁgures, as well as a manuscript with them, in the project page. ill-posed and massiv ely under-constrained. Many contem- porary techniques to in verse problems ha ve focused on reg- ularization techniques which are amenable to computational optimization. While such approaches are interpretable as Bayesian estimators with particular choice of priors, they are often computationally expensi ve in practice [ 13 , 43 , 2 ]. Alternately , data-driv en methods based on training deep con volutional neural networks yield fast inference but lack interpretability and guarantees of robustness [ 46 , 59 ]. In this paper , we propose a new framework called Pr edictive F ilter Flow that retains interpretability and control ov er the resulting reconstruction while allo wing fast inference. The proposed framework is directly applicable to a variety of low-le vel computer vision problems in volving local pixel transformations. As the name suggests, our approach is built on the no- tion of ﬁlter ﬂow introduced by Seitz and Baker [ 42 ]. In ﬁlter ﬂow pixels in a local neighborhood of the input im- age are linearly combined to reconstruct the pixel centered at the same location in the output image. Howe ver , unlike con volution, the ﬁlter weights are allo wed to vary from one spatial location to the next. Filter ﬂows are a ﬂexible class of image transformations that can model a wide range of imaging effects (including optical ﬂow , lighting changes, non-uniform blur , non-parametric distortion). The original work on ﬁlter ﬂo w [ 42 ] focused on the problem of estimat- ing an appropriately re gularized/constrained ﬂo w between a giv en pair of images. This yielded con ve x but impractically large optimization problems (e.g., hours of computation to compute a single ﬂow). Instead of solving for an optimal ﬁlter ﬂow , we propose to directly predict a ﬁlter ﬂow giv en an input image using a conv olutional neural net (CNN) to regress the ﬁlter weights. Using a CNN to directly predict a well regularized solution is orders of magnitude faster than expensi ve iterati ve optimization. Fig. 1 provides an illustration of our overall framework. Instead of estimating the ﬂow between a pair of input im- ages, we focus on applications where the model predicts both the ﬂow and the transformed image. This can be 1 Figure 1: Overview of our proposed framework for Predictive Filter Flow which is readily applicable to various low-lev el vision prob- lems, yielding state-of-the-art performance for non-uniform motion blur remov al, compression artifact reduction and single image super- resolution. Given a corrupted input image, a two-stream CNN analyzes the image and synthesizes the weights of a spatially-varying linear ﬁlter . This ﬁlter is then applied to the input to produce a deblurred/denoised prediction. The whole framew ork is end-to-end trainable in a self-supervised way for tasks such as super-resolution where corrupted images can be generated automatically . The predicted ﬁlters are easily constrained for different tasks and interpretable (here visualized in the center column by the mean ﬂo w displacement, see Fig. 6 ). viewed as “blind” ﬁlter ﬂo w estimation, in analogy with blind decon volution. During training, we use a loss deﬁned ov er the transformed image (rather than the predicted ﬂow). This is closely related to so-called self-supervised tech- niques that learn to predict optical ﬂo w and depth from un- labeled video data [ 15 , 16 , 21 ]. Speciﬁcally , for the recon- struction tasks we consider such as image super-resolution, the forward degradation process can be easily simulated to generate a large quantity of training data without manual collection or annotation. The lack of interpretability in deep image-to-image re- gression models makes it hard to provide guarantees of ro- bustness in the presence of adversarial input [ 29 ], and con- fer reliability needed for researchers in biology and medical science [ 34 ]. Predictiv e ﬁlter ﬂow differs from other CNN- based approaches in this regard since the intermediate ﬁlter ﬂows are interpretable and transparent [ 50 , 12 , 32 ], provid- ing an explicit description of how the input is transformed into output. It is also straightforward to inject constraints on the reconstruction (e.g., local brightness conservation) which would be nearly impossible to guarantee for deep image-to-image regression models. T o ev aluate our model, we carry out extensiv e experi- ments on three dif ferent low-le vel vision tasks, non-uniform motion blur remov al, JPEG compression artifact reduction and single image super-resolution. W e show that our model surpasses all the state-of-the-art methods on all the three tasks. W e also visualize the predicted ﬁlters which reveals ﬁltering operators reminiscent of classic unsharp masking ﬁlters and anisotropic diffusion along boundaries. T o summarize our contribution: (1) we propose a novel, end-to-end trainable, learning frame work for solving v ari- ous low-le vel image reconstruction tasks; (2) we show this framew ork is highly interpretable and controllable, enabling direct post-hoc analysis of how the reconstructed image is generated from the de graded input; (3) we sho w e xperimen- tally that predictive ﬁlter ﬂow outperforms the state-of-the- art methods remarkably on the three different tasks, non- uniform motion blur remov al, compression artifact reduc- tion and single image super-resolution. 2. Related W ork Our work is inspired by ﬁlter ﬂow [ 42 ], which is an op- timization based method for ﬁnding a linear transformation relating nearby pixel values in a pair of images. By im- posing additional constraints on certain structural properties of these ﬁlters, it serves as a general framework for under- standing a wide v ariety of low-le vel vision problems. How- ev er , ﬁlter ﬂow as originally formulated has some obvious shortcomings. First, it requires prior knowledge to specify a set of constraints needed to produce good results. It is not always straightforward to model or e ven come up with such knowledge-based constraints. Second, solving for an opti- mal ﬁlter ﬂow is compute intensive; it may take up to 20 hours to compute ov er a pair of 500 × 500 images [ 42 ]. W e address these by directly predicting ﬂows from image data. W e le verage predictive ﬁlter ﬂow for targeting three speciﬁc image reconstruction tasks which can be framed as perform- ing spatially variant ﬁltering o ver local image patches. Non-Uniform Blind Motion Blur Remo val is an ex- tremely challenging yet practically signiﬁcant task of re- moving blur caused by object motion or camera shake on a blurry photo. The blur kernel is unknown and may vary ov er the image. Recent methods estimate blur kernels lo- cally at patch level, and adopt an optimization method for deblurring the patches [ 46 , 2 ]. [ 53 , 18 , 46 ] leverage prior in- formation about smooth motion by selecting from a prede- ﬁne discretized set of linear blur kernels. These methods are computationally expensi ve as an iterativ e solver is required for deconv olution after estimating the blur kernel [ 9 ]; and the deep learning approach cannot generalize well to novel motion kernels [ 54 , 46 , 18 , 41 ]. Compression Artifact Reduction is of signiﬁcance as lossy image compression is ubiquitous for reducing the size of images transmitted ov er the web and recorded on data storage media. Howe ver , high compression rates come with visual artifacts that degrade the image quality and thus user experience. Among various compression algorithms, JPEG has become the most widely accepted standard in lossy image compression with sev eral (non- in vertible) transforms [ 51 ], i.e., downsampling and DCT quantization. Removing artifacts from jpeg compression can be vie wed as a practical v ariant of natural image de- noising problems [ 6 , 20 ]. Recent methods based on deep con volutional neural networks trained to take as input the compressed image and output the denoised image directly achiev e good performance [ 10 , 47 , 7 ]. Single Image Super -Resolution aims at reco vering a high- resolution image from a single low-resolution image. This problem is inherently ill-posed as a multiplicity of solu- tions e xists for any gi ven lo w-resolution input. Many meth- ods adopt an example-based strategy [ 56 ] requiring an op- timization solver , others are based on deep con volutional neural nets [ 11 , 30 ] which achieve the state-of-the-art and real-time performance. The deep learning methods take as input the low-resolution image (usually 4 × upsampled one using bicubic interpolation), and output the high-resolution image directly . 3. Predicti ve Filter Flow Filter ﬂow models image transformations I 1 → I 2 as a linear mapping where each output pixel only depends on a local neighborhood of the input. Find such a ﬂow can be framed as solving a constrained linear system I 2 = TI 1 , T ∈ Γ . (1) where T is a matrix whose rows act separately on a vec- torized version of the source image I 1 . For the model 1 to make sense, T ∈ Γ must serve as a placeholder for the entire set of additional constraints on the operator which enables a unique solution that satisﬁes our expectations for particular problems of interest. For example, standard con- volution corresponds to T being a circulant matrix whose rows are c yclic permutations of a single set of ﬁlter weights which are typically constrained to hav e compact localized non-zero support. For a theoretical perspective, Filter Flow model 1 is simple and elegant, but directly solving Eq. 1 is intractable for image sizes we typically encounter in prac- tice, particularly when the ﬁlters are allowed to vary spa- tially . 3.1. Learning to pr edict ﬂows Instead of optimizing over T directly , we seek for a learnable function f w ( · ) parameterized by w that predicts the transformation ˆ T speciﬁc to image I 1 taken as input: I 2 ≈ ˆ TI 1 , ˆ T ≡ f w ( I 1 ) , (2) W e call this model Predictive Filter Flo w . Manually design- ing such a function f w ( · ) isn’t feasible in general, therefore we learn a speciﬁc f w under the assumption that I 1 , I 2 are drawn from some ﬁx ed joint distribution. Giv en sampled image pairs, { ( I i 1 , I i 2 ) } , where i = 1 , . . . , N , we seek parameters w that minimize the differ - ence between a recov ered image ˆ I 2 and the real one I 2 mea- sured by some loss ` . min w N X i =1 ` ( I i 2 − f w ( I i 1 ) · I i 1 ) + R ( f w ( I i 1 )) , s.t. constraint on w (3) Note that constraints on w are different from constraints Γ used in Filter Flo w . In practice, we enforce hard constraints via our choice of the architecture/functional form of f along with soft-constraints via additional regularization term R . W e also adopt commonly used L 2 regularization on w to reduce overﬁtting. There are a range of possible choices for measuring the difference between two images. In our experiments, we simply use the robust L 1 norm to measure the pixel-le vel dif ference. Filter locality In principle, each pixel output I 2 in Eq. 3 can depend on all input pixels I 2 . W e introduce the struc- tural constraint that each output pixel only depends on a corresponding local neighborhood of the input. The size of this neighborhood is thus a hyper-parameter of the model. W e note that while the predicted ﬁlter ﬂow ˆ T acts locally , the estimation of the correct local ﬂow within a patch can depend on global context captured by large receptiv e ﬁelds in the predictor f w ( · ) . In practice, this constraint is implemented by using the “im2col” operation to v ectorize the local neighborhood patch centered at each pixel and compute the inner prod- uct of this vector with the corresponding predicted ﬁlter . This operation is highly optimized for av ailable hardware architectures in most deep learning libraries and has time and space cost similar to computing a single con volution. For example, if the ﬁlter size is 20 × 20, the last layer of the CNN model f w ( · ) outputs a three-dimensional array with a channel dimension of 400 , which is comparable to fea- ture activ ations at a single layer of typical CNN architec- tures [ 27 , 45 , 17 ]. Other ﬁlter constraints V arious priori constraints on the ﬁlter ﬂow ˆ T ≡ f w ( I 1 ) can be added easily to enable better model training. F or example, if smoothness is desired, an L 2 regularization on the (1st order or 2nd order) deriv ative of the ﬁlter ﬂow maps can be inserted during training; if sparsity is desired, an L 1 regularization on the ﬁlter ﬂo ws can be added easily . In our work, we add sum-to-one and non-negati ve constraints on the ﬁlters for the task of non- uniform motion blur remov al, meaning that the v alues in each ﬁlter should be non-negati ve and sum-to-one by as- suming there is no lighting change. This can be easily done by inserting a softmax transform across channels of the pre- dicted ﬁlter weights. F or other tasks, we simply let the model output free-form ﬁlters with no further constraints on the weights. Self-Supervision Though the proposed framew ork for training Predictive Filter Flow requires paired inputs and target outputs, we note that generating training data for many reconstruction tasks can be accomplished automati- cally without manual labeling. Given a pool of high qual- ity images, we can automatically generate low-resolution, blurred or JPEG degraded counterparts to use in training (see Section 4 ). This can also be generalized to so-called self-supervised training for predicting ﬂows between video frames or stereo pairs. 3.2. Model Architectur e and T raining Our basic framework is largely agnostic to the choice of architectures, learning method, and loss functions. In our experiments, we utilize to a two-stream architecture as shown in Fig. 1 . The ﬁrst stream is a simple 18-layer net- work with 3 × 3 con volutional layers, skip connections [ 17 ], pooling layers and upsampling layers; the second stream is a shallow but full-resolution network with no pooling. The ﬁrst stream has larger receptiv e ﬁelds for estimating per-pix el ﬁlters by considering long-range conte xtual infor- mation, while the second stream keeps original resolution as input image without inducing spatial information loss. Batch normalization [ 19 ] is also inserted between a con- volution layer and ReLU layer [ 38 ]. The Predictiv e Fil- ter Flow is self-supervised so we could generate an unlim- ited amount of image pairs for training very large models. Howe ver , we ﬁnd a light-weight architecture trained over moderate-scale training set performs quite well. Since our architecture is different from other feed-forward image-to- image regression CNNs, we also report the baseline per- formance of the two-stream architecture trained to directly predict the reconstructed image rather than the ﬁlter coefﬁ- cients. For training, we crop 64 × 64-resolution patches to form a batch of size 56. Since the model adapts to patch bound- ary effects seen during training, at test time we apply it to non-ov erlapping tiles of the input image. Howe ver , we note that the model is fully conv olutional so it could be trained ov er larger patches to a void boundary ef fects and applied to arbitrary size inputs. W e use ADAM optimization method during train- ing [ 24 ], with initial learning 0.0005 and coef ﬁcients 0.9 and 0.999 for computing running averages of gradient and its square. As for the training loss, we simply use the ` 1 -norm loss measuring absolute difference over pixel in- tensities. W e train our model from scratch on a single NVIDIA TIT AN X GPU, and terminate after sev eral hun- dred epochs 2 . 4. Experiments W e ev aluate the proposed Predictiv e Filter Flow frame- work (PFF) on three low-le vel vision tasks: non-uniform motion blur remov al, JPEG compression artifact reduction and single image super-resolution. W e ﬁrst describe the datasets and ev aluation metrics, and then compare with state-of-the-art methods on the three tasks in separate sub- sections, respectiv ely . 4.1. Datasets and Metrics W e use the high-resolution images in DIV2K dataset [ 1 ] and BSDS500 training set [ 37 ] for training all our models on the three tasks. This results into a total of 1,200 train- ing images. W e e valuate each model over dif ferent datasets speciﬁc to the task. Concretely , we test our model for non- uniform motion blur remov al over the dataset introduced in [ 2 ], which contains large motion blur up to 38 pixels. W e ev aluate o ver the classic LIVE1 dataset [ 52 ] for JPEG com- pression artifacts reduction, and Set5 [ 5 ] and Set14 [ 58 ] for single image super-resolution. T o quantitativ ely measure performance, we use Peak- Signal-to-Noise-Ratio (PSNR) and Structural Similarity In- dex (SSIM) [ 52 ] over the Y channel in YCbCr color space between the output quality image and the original image. This is a standard practice in literature for quantitatively measuring the recov ered image quality . 4.2. Non-Unif orm Motion Blur Removal T o train models for non-uniform motion blur removal, we generate the 64 × 64-resolution blurry patches from clear 2 Models with early termination ( ∼ 2 hours for dozens of epochs) still achiev e very good performance, but top performance appears after 1–2 days training. The code and models can be found in https://github. com/aimerykong/predictive- filter- flow Figure 2: V isual comparison of our method ( PFF ) to CNN [Sun, et al.] [ 46 ] and patch-optim [Bahat, et al.] [ 2 ] on testing images released by [ 2 ]. Please be guided with the strong edges in the ﬁlter ﬂow maps to compare visual details in the deblurred images by different methods. Also note that the bottom two rows display images from the real-world, meaning they are not synthesized and there is no blur ground-truth for them. Best view in color and zoom-in. T able 1: Comparison on motion blur removal o ver the non- uniform motion blur dataset [ 2 ]. F or the two metrics, the larger v alue means better performance of the model. Moderate Blur metric [ 55 ] [ 46 ] [ 2 ] CNN PFF PSNR 22.88 24.14 24.87 24.51 25.39 SSIM 0.68 0.714 0.743 0.725 0.786 Large Blur metric [ 55 ] [ 46 ] [ 2 ] CNN PFF PSNR 20.47 20.84 22.01 21.06 22.30 SSIM 0.54 0.56 0.624 0.560 0.638 ones using random linear kernels [ 46 ], which are of size 30 × 30 and have motion vector with random orientation in [0 , 180 ◦ ] degrees and random length in [1 , 30] pixels. W e set the predicted ﬁlter size to be 17 × 17 so the model outputs 17 × 17 = 289 ﬁlter weights at each image location. Note that we generate training pairs on the ﬂy during training, so our model can deal with a wide range of motion blurs. This is advantageous over methods in [ 46 , 2 ] which require a pre- deﬁned set of blur kernels used for deconv olution through some ofﬂine algorithm. In T able 1 , we list the comparison with the state-of-the- art methods ov er the released test set by [ 2 ]. There are two subsets in the dataset, one with moderate motion blur and the other with large blur . W e also report our CNN mod- els based on the proposed two-stream architecture that out- puts the quality images directly by taking as input the blurry ones. Our CNN model outperforms the one in [ 46 ] which trains a CNN for predicting the blur kernel over a patch, but carries out non-blind decon volution with the estimated kernel for the ﬁnal quality image. W e attribute our better performance to two reasons. First, our CNN model learns a direct in verse mapping from blurry patch to its clear coun- terpart based on the learned image distribution, whereas [ 46 ] only estimates the blur kernel for the patch and uses an ofﬂine optimization for non-blind deblurring, resulting in some artifacts such as ringing. Second, our CNN archi- tecture is higher ﬁdelity than the one used in [ 46 ], as ours outputs full-resolution result and learns internally to mini- mize artifacts, e.g., aliasing and ringing ef fect. From the table, we can see our PFF model outperforms all the other methods by a fair margin. T o understand where our model performs better , we visualize the qualitativ e re- sults in Fig. 2 , along with the ﬁlter ﬂo w maps as output from PFF . W e can’t easily visualize the 289 dimensional ﬁlters. Howe ver , since the predicted weights ˆ T are positi ve and L 1 normalized, we can treat them as a distribution which we summarize by computing the expected ﬂo w vector  v x ( i, j ) v y ( i, j )  = X x,y ˆ T ij,xy  x − i y − j  where ij is a particular output pixel and xy index es the input pixels. This can be interpreted as the optical ﬂo w (delta ﬁlter) which most closely approximates the predicted ﬁlter ﬂow . W e use the the color le gend sho wn in top-left of Fig. 6 . The last two ro ws of Fig. 2 sho w the results over real-world blurry images for which there is no “blur-free” ground-truth. W e can clearly see that images produced by PFF have less artifacts such as ringing artifacts around sharp edges [ 46 , 2 ]. Interestingly , from the ﬁlter ﬂow maps, we can see that the expected ﬂow vectors are large near high contrast boundaries and smaller in regions that are already in sharp focus or which are uniform in color . Although we deﬁne the ﬁlter size as 17 × 17, which is much smaller than the maximum shift in the largest blur (up to 30 pixels), our model still handles large motion blur and performs better than [ 2 ]. W e assume it should be possible to utilize larger ﬁlter sizes but we did not observe further im- prov ements when training models to synthesize larger per- pixel kernels. This suggests that a larger blurry dataset is needed to validate this point in future w ork. W e also considered an iterative variant of our model in which we feed the resulting deblurred image back as in- put to the model. Howe ver , we found relati vely little im- prov ement with additional iterations (results shown in the appendix). W e conjecture that, although the model was trained with a wide range of blurred e xamples, the statistics of the transformed image from the ﬁrst iteration are sufﬁ- ciently different than the blurred training inputs. One so- lution could be inserting adversarial loss to push the model to generate more ﬁne-grained textures (as done in [ 30 ] for image super-resolution). 4.3. JPEG Compression Artifact Reduction Similar to training for image deblurring, we gener - ate JPEG compressed image patches from original non- compressed ones on the ﬂy during training. This can be Figure 3: V isual comparison of our methods (PFF and CNN). Strong edges in the expected ﬂow map (right) high- light areas where most apparent artifacts are removed. More results can be found in the appendix. Best viewed in color and zoomed-in. T able 2: Comparison on JPEG compression artifact reduc- tion over LIVE1 dataset [ 52 ]. PSNR and SSIM are used as metrics listed on two rows respectively in each macro row grid (the larger the better). QF JPEG SA-DCT AR-CNN L4 CAS-CNN MWCNN PFF [ 14 ] [ 10 ] [ 47 ] [ 7 ] [ 35 ] 10 27.77 28.65 29.13 29.08 29.44 29.69 29.82 0.791 0.809 0.823 0.824 0.833 0.825 0.836 20 30.07 30.81 31.40 31.42 31.70 32.04 32.14 0.868 0.878 0.890 0.890 0.895 0.889 0.905 40 32.35 32.99 33.63 33.77 34.10 34.45 34.67 0.917 0.940 0.931 — 0.937 0.930 0.949 easily done using JPEG compression function by varying the quality factor (QF) of interest. In T able 2 , we list the performance of our model and compare to the state-of-the-art methods. W e note that our ﬁnal PFF achieves the best among all the methods. Our CNN baseline model also achiev es on-par performance with state-of-the-art, though we do not show in the table, we draw the performance under the ablation study in Fig. 4 . Speciﬁcally , we study how our model trained with single or a mixed QFs affect the performance when tested on im- age compressed with a range of dif ferent QFs. W e plot the detailed performances of our CNN and PFF in terms of ab- solute measurements by PSNR and SSIM, and the increase in PSNR between the reconstructed and JPEG compressed image. W e can see that, though a model trained with QF=10 ov erﬁts the dataset, all the other models achieve general- izable and stable performance. Basically , a model trained on a single QF brings the lar gest performance gain o ver im- ages compressed with the same QF . Moreover , when our model is trained with mixed quality factors, its performance PSNR improv ements. SSIM improv ements. Figure 4: Performance vs. training data with dif ferent com- pression quality factors measured by PSNR and SSIM and their performance gains, ov er the LIVE1 dataset. The orig- inal JPEG compression is plotted for baseline. is quite stable and competitiv e with quality-speciﬁc models across different compression quality factors. This indicates that our model is of practical value in real-world applica- tions. In Fig. 3 , we demonstrate qualitative comparison be- tween CNN and PFF . The output ﬁlter ﬂo w maps indicate from the colorful edges ho w the pixels are w arped from the neighborhood in the input image. This also clearly shows where the JPEG image degrades most, e.g., the large sky region is quantized by JPEG compression. Though CNN makes the block effect smooth to some e xtent, our PFF pro- duces the best visual quality , smoothing the block artifact while maintaining both high- and low-frequenc y details. 4.4. Single Image Super -Resolution In this w ork, we only generate pairs to super -resolve im- ages 4 × larger . T o generate training pairs, for each orig- inal image, we downsample 1 4 × and upsample 4 × again using bicubic interpolation (with anti-aliasing). The 4 × up- sampled image from the low-resolution is the input to our model. Therefore, a super-resolution model is expected to be learned for sharpening the input image. In T able 3 , we compare our PFF model quantitatively with other methods. W e can see that our model outperforms the o t hers on both test sets. In Fig. 5 , we compare visually ov er bicubic interpolation, CNN and PFF . W e can see from the zoom-in regions that our PFF generates sharper bound- aries and deli vers an anti-aliasing functionality . The ﬁlter ﬂow maps once again act as a guide, illustrating where the Figure 5: V isual comparison of our method (PFF) to CNN, each image is super-resolved ( × 4). More results can be found in the appendix. Best view in color and zoom-in. T able 3: Comparison on single image super-resolution ( × 4) ov er the classic Set5 [ 5 ] and Set14 [ 58 ] datasets. The met- rics used here are PSNR (dB) and SSIM listed as tw o rows, respectiv ely . Bicubic NE+LLE KK A+ SRCNN RDN+ PFF [ 8 ] [ 23 ] [ 49 ] [ 11 ] [ 59 ] Set5 28.42 29.61 29.69 30.28 30.49 32.61 32.74 0.8104 0.8402 0.8419 0.8603 0.8628 0.9003 0.9021 Set14 26.00 26.81 26.85 27.32 27.50 28.92 28.98 0.7019 0.7331 0.7352 0.7491 0.7513 0.7893 0.7904 smoothing happens and where sharpening happens. Espe- cially , the ﬁlter maps demonstrate from the strong colorful edges where the pixels undergo larger transforms. In next section, we visualize the per-pixel kernels to hav e an in- depth understanding. 5. V isualization and Analysis W e e xplored a number of techniques to visualize the pre- dicted ﬁlter ﬂo ws for different tasks. First, we ran k-means on predicted ﬁlters from the set of test images for each the three tasks, respectiv ely , to cluster the kernels into K =400 groups. Then we run t-SNE [ 36 ] over the 400 mean ﬁlters to display them in the image plane, shown by the scatter plots in top row of Fig. 6 . Qualitati ve inspection shows ﬁlters that can be interpreted as performing translation or integra- tion along lines of different orientation (non-uniform blur), ﬁlling in high-frequency detail (jpeg artifact reduction) and deformed Laplacian-like ﬁlters (super -resolution). W e also examined the top 10 principal components of the predicted ﬁlters (shown in the second row grid in Fig. 6 ). Figure 6: Three row-wise panels: (1) W e run K-means ( K =400) on all ﬁlters synthesized by the model ov er the test set, and visualize the 400 centroid kernels using t-SNE on a 2D plane; (2) top ten principal components of the synthesized ﬁlters; (3) visualizing the color coded ﬁlter ﬂo w along with input and quality image. Each pix els ﬁlter is assigned to the nearest centroid and the color for the centroid is based on the 2D t-SNE embedding using the color chart shown at top left. The 10D principal subspace capture 99.65%, 99.99% and 99.99% of the ﬁlter energy for non-uniform blur, artifact remov al and super resolution respectiv ely . PCA re veals smooth, symmetric harmonic structure for super-resolution with some intriguing vertical and horizontal features. Finally , in order to summarize the spatially varying structure of the ﬁlters, we use the 2D t-SNE embedding to assign a color to each centroid (as given by the reference color chart shown top-left), and visualize the nearest cen- troid for the ﬁlter at each ﬁlter location in the third ro w grid in Fig. 6 . This visualization demonstrates the ﬁlters as out- put by our model generally vary smoothly ov er the image with discontinuities along salient edges and textured regions reminiscent of anisotropic diffusion or bilateral ﬁltering. In summary , these visualizations provide a transparent view of ho w each reconstructed pixel is assembled from the degraded input image. W e view this as a notable advan- tage over other CNN-based models which simply perform image-to-image regression. Unlike activ ations of interme- diate layers of a CNN, linear ﬁlter weights have a well de- ﬁned semantics that can be visualized and analyzed using well dev eloped tools of linear signal processing. 6. Conclusion and Future W ork W e propose a general, elegant and simple framew ork called Predictiv e Filter Flow , which has direct applications to a broad range of image reconstruction tasks. Our frame- work generates space-variant per-pix el ﬁlters which are easy to interpret and fast to compute at test time. Through extensi ve experiments over three different lo w-le vel vision tasks, we demonstrate this approach outperforms the state- of-the-art methods. In our experiments here, we only train light-weight mod- els o ver patches, Ho wever , we belie ve global image conte xt is also important for these tasks and is an obvious direction for future work. For e xample, the global blur structure con- ve ys information about camera shake; super-resolution and compression reduction can beneﬁt from long-range interac- tions to reconstruct high-frequency detail (as in non-local means). Moreover , we e xpect that the interpretability of the output will be particularly appealing for interacti ve and sci- entiﬁc applications such as medical imaging and biological microscopy where predicted ﬁlters could be directly com- pared to physical models of the imaging process. Acknowledgement This project is supported by NSF grants IIS-1618806, IIS-1253538, DBI-1262547 and a hardware donation from NVIDIA. References [1] E. Agustsson and R. Timofte. Ntire 2017 challenge on sin- gle image super-resolution: Dataset and study . In The IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) W orkshops , July 2017. 4 [2] Y . Bahat, N. Efrat, and M. Irani. Non-uniform blind deblur- ring by reblurring. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , pages 3286– 3294, 2017. 1 , 3 , 4 , 5 , 6 , 12 , 13 [3] V . Belagiannis and A. Zisserman. Recurrent human pose es- timation. In 2017 12th IEEE International Confer ence on Automatic F ace & Gestur e Recognition (FG 2017) , pages 468–475. IEEE, 2017. 11 [4] A. J. Bell and T . J. Sejnowski. An information-maximization approach to blind separation and blind decon volution. Neu- ral computation , 7(6):1129–1159, 1995. 1 [5] M. Bevilacqua, A. Roumy , C. Guillemot, and M. L. Alberi- Morel. Low-complexity single-image super-resolution based on nonnegati ve neighbor embedding. In British Machine V i- sion Confer ence , 2012. 4 , 7 [6] A. Buades, B. Coll, and J.-M. Morel. A non-local algorithm for image denoising. In Computer V ision and P attern Recog- nition, 2005. CVPR 2005. IEEE Computer Society Confer- ence on , volume 2, pages 60–65. IEEE, 2005. 3 [7] L. Cavigelli, P . Hager , and L. Benini. Cas-cnn: A deep con volutional neural netw ork for image compression artifact suppression. In Neural Networks (IJCNN), 2017 Interna- tional Joint Confer ence on , pages 752–759. IEEE, 2017. 3 , 6 [8] H. Chang, D.-Y . Y eung, and Y . Xiong. Super-resolution through neighbor embedding. In Computer V ision and P at- tern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Confer ence on , volume 1, pages I–I. IEEE, 2004. 7 [9] S. Cho, J. W ang, and S. Lee. Handling outliers in non-blind image decon volution. In Computer V ision (ICCV), 2011 IEEE International Confer ence on , pages 495–502. IEEE, 2011. 3 [10] C. Dong, Y . Deng, C. Change Loy , and X. T ang. Compres- sion artifacts reduction by a deep con volutional network. In Pr oceedings of the IEEE International Conference on Com- puter V ision , pages 576–584, 2015. 3 , 6 [11] C. Dong, C. C. Loy , K. He, and X. T ang. Image super-resolution using deep conv olutional networks. IEEE transactions on pattern analysis and machine intelligence , 38(2):295–307, 2016. 3 , 7 [12] F . Doshi-V elez and B. Kim. T owards a rigorous sci- ence of interpretable machine learning. arXiv preprint arXiv:1702.08608 , 2017. 2 [13] R. Fergus, B. Singh, A. Hertzmann, S. T . Roweis, and W . T . Freeman. Removing camera shake from a single photograph. In ACM transactions on graphics (TOG) , volume 25, pages 787–794. A CM, 2006. 1 [14] A. Foi, V . Katkovnik, and K. Egiazarian. Pointwise shape- adaptiv e dct for high-quality denoising and deblocking of grayscale and color images. IEEE T ransactions on Image Pr ocessing , 16(5):1395–1411, 2007. 6 [15] R. Garg, V . K. BG, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the res- cue. In Eur opean Confer ence on Computer V ision , pages 740–756. Springer , 2016. 2 [16] C. Godard, O. Mac Aodha, and G. J. Brosto w . Unsupervised monocular depth estimation with left-right consistency . In CVPR , volume 2, page 7, 2017. 2 [17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In Pr oceedings of the IEEE con- fer ence on computer vision and pattern recognition , pages 770–778, 2016. 4 [18] M. Hradi ˇ s, J. Kotera, P . Zemc ´ ık, and F . ˇ Sroubek. Con volu- tional neural networks for direct text de blurring. In Pr oceed- ings of BMVC , volume 10, page 2, 2015. 3 [19] S. Ioffe and C. Szegedy . Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv pr eprint arXiv:1502.03167 , 2015. 4 [20] V . Jain and S. Seung. Natural image denoising with con vo- lutional networks. In Advances in Neural Information Pr o- cessing Systems , pages 769–776, 2009. 3 [21] J. Y . Jason, A. W . Harley , and K. G. Derpanis. Back to ba- sics: Unsupervised learning of optical ﬂow via brightness constancy and motion smoothness. In European Confer ence on Computer V ision , pages 3–10. Springer , 2016. 2 [22] J. Kaipio and E. Somersalo. Statistical and computational in verse pr oblems , volume 160. Springer Science & Business Media, 2006. 1 [23] K. I. Kim and Y . Kwon. Single-image super-resolution using sparse regression and natural image prior . IEEE tr ansactions on pattern analysis & machine intelligence , (6):1127–1133, 2010. 7 [24] D. P . Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint , 2014. 4 [25] S. Kong and C. Fo wlkes. Pixel-wise attentional gat- ing for parsimonious pixel labeling. arXiv pr eprint arXiv:1805.01556 , 2018. 11 [26] S. Kong and C. Fowlk es. Recurrent scene parsing with per- spectiv e understanding in the loop. In Pr oceedings of the IEEE Conference on Computer V ision and P attern Recogni- tion (CVPR) , 2018. 11 [27] A. Krizhevsk y , I. Sutskever , and G. E. Hinton. Imagenet classiﬁcation with deep conv olutional neural networks. In Advances in neural information pr ocessing systems , pages 1097–1105, 2012. 4 [28] D. Kundur and D. Hatzinakos. Blind image decon volution. IEEE signal pr ocessing magazine , 13(3):43–64, 1996. 1 [29] A. Kurakin, I. Goodfellow , and S. Bengio. Adversarial exam- ples in the physical world. arXiv pr eprint arXiv:1607.02533 , 2016. 2 [30] C. Ledig, L. Theis, F . Husz ´ ar , J. Caballero, A. Cunningham, A. Acosta, A. P . Aitken, A. T ejani, J. T otz, Z. W ang, et al. Photo-realistic single image super -resolution using a genera- tiv e adversarial network. In CVPR , volume 2, page 4, 2017. 3 , 6 , 11 [31] K. Li, B. Hariharan, and J. Malik. Iterative instance se gmen- tation. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , 2016. 11 [32] Z. C. Lipton. The mythos of model interpretability . Commun. A CM , 61(10):36–43, 2018. 2 [33] P . List, A. Joch, J. Lainema, G. Bjontegaard, and M. Kar- czewicz. Adaptive deblocking ﬁlter . IEEE transactions on cir cuits and systems for video technology , 13(7):614–619, 2003. 1 [34] G. Litjens, T . Kooi, B. E. Bejnordi, A. A. A. Setio, F . Ciompi, M. Ghafoorian, J. A. van der Laak, B. V an Ginneken, and C. I. S ´ anchez. A surve y on deep learning in medical image analysis. Medical image analysis , 42:60–88, 2017. 2 [35] P . Liu, H. Zhang, K. Zhang, L. Lin, and W . Zuo. Multi-lev el wa velet-cnn for image restoration. In The IEEE Conference on Computer V ision and P attern Recognition (CVPR) , 2018. 6 [36] L. v . d. Maaten and G. Hinton. V isualizing data using t-sne. Journal of machine learning r esearc h , 9(Nov):2579–2605, 2008. 7 [37] D. Martin, C. Fo wlkes, D. T al, and J. Malik. A database of human segmented natural images and its application to ev al- uating segmentation algorithms and measuring ecological statistics. In Computer V ision, 2001. ICCV 2001. Pr oceed- ings. Eighth IEEE International Conference on , volume 2, pages 416–423. IEEE, 2001. 4 [38] V . Nair and G. E. Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) , pages 807–814, 2010. 4 [39] S. C. Park, M. K. Park, and M. G. Kang. Super-resolution image reconstruction: a technical ov erview . IEEE signal pr o- cessing magazine , 20(3):21–36, 2003. 1 [40] B. Romera-Paredes and P . H. S. T orr . Recurrent instance segmentation. In ECCV , 2016. 11 [41] C. Schuler , M. Hirsch, S. Harmeling, and B. Scholkopf. Learning to deblur . IEEE T ransactions on P attern Analysis & Machine Intelligence , (1):1–1, 2016. 3 [42] S. M. Seitz and S. Baker . Filter ﬂow . In Pr oceedings of the IEEE Conference on Computer V ision and P attern Recogni- tion (CVPR) , 2009. 1 , 2 [43] Q. Shan, J. Jia, and A. Agarwala. High-quality motion de- blurring from a single image. In Acm tr ansactions on graph- ics (tog) , v olume 27, page 73. A CM, 2008. 1 [44] M.-Y . Shen and C.-C. J. Kuo. Revie w of postprocessing tech- niques for compression artifact remov al. J ournal of visual communication and image r epresentation , 9(1):2–14, 1998. 1 [45] K. Simonyan and A. Zisserman. V ery deep conv olutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014. 4 [46] J. Sun, W . Cao, Z. Xu, and J. Ponce. Learning a conv olu- tional neural network for non-uniform motion blur remov al. In Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition , pages 769–777, 2015. 1 , 3 , 4 , 5 , 6 , 12 , 13 [47] P . Svoboda, M. Hradis, D. Barina, and P . Zemcik. Compres- sion artifacts removal using con volutional neural networks. arXiv pr eprint arXiv:1605.00366 , 2016. 3 , 6 [48] A. T arantola. In verse problem theory and methods for model parameter estimation , v olume 89. siam, 2005. 1 [49] R. T imofte, V . De Smet, and L. V an Gool. Anchored neigh- borhood regression for fast example-based super-resolution. In Pr oceedings of the IEEE International Confer ence on Computer V ision , pages 1920–1927, 2013. 7 [50] A. V ellido, J. D. Mart ´ ın-Guerrero, and P . J. Lisboa. Mak- ing machine learning models interpretable. In ESANN , vol- ume 12, pages 163–172. Citeseer , 2012. 2 [51] G. K. W allace. The jpeg still picture compression standard. IEEE transactions on consumer electr onics , 38(1):xviii– xxxiv , 1992. 3 [52] Z. W ang, A. C. Bovik, H. R. Sheikh, and E. P . Simon- celli. Image quality assessment: from error visibility to structural similarity . IEEE transactions on image pr ocess- ing , 13(4):600–612, 2004. 4 , 6 [53] O. Whyte, J. Sivic, A. Zisserman, and J. Ponce. Non-uniform deblurring for shaken images. International journal of com- puter vision , 98(2):168–186, 2012. 3 [54] L. Xu, J. S. Ren, C. Liu, and J. Jia. Deep conv olutional neural network for image deconv olution. In Advances in Neural Information Pr ocessing Systems , pages 1790–1798, 2014. 3 [55] L. Xu, S. Zheng, and J. Jia. Unnatural l0 sparse represen- tation for natural image deblurring. In Pr oceedings of the IEEE confer ence on computer vision and pattern reco gni- tion , pages 1107–1114, 2013. 5 , 12 [56] C.-Y . Y ang, C. Ma, and M.-H. Y ang. Single-image super- resolution: A benchmark. In Eur opean Conference on Com- puter V ision , pages 372–386. Springer , 2014. 3 [57] J. Y ang, J. Wright, T . S. Huang, and Y . Ma. Image super- resolution via sparse representation. IEEE transactions on image pr ocessing , 19(11):2861–2873, 2010. 1 [58] R. Zeyde, M. Elad, and M. Protter . On single image scale-up using sparse-representations. In International conference on curves and surfaces , pages 711–730. Springer , 2010. 4 , 7 [59] Y . Zhang, Y . Tian, Y . Kong, B. Zhong, and Y . Fu. Resid- ual dense network for image super-resolution. In The IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , 2018. 1 , 7 A ppendix In the supplementary material, we ﬁrst show mor e visu- alizations to understand the predicted ﬁlter ﬂows, then show if it is possible to r eﬁne the r esults by iteratively feeding de- blurr ed image to the same model for the task of non-uniform motion blur remo val. W e ﬁnally pr esent mor e qualitative r e- sults for all the thr ee tasks studied in this paper . 1. V isualization of Per -Pixel Loading F actors As a supplementary visualization to the principal compo- nents by PCA shown in the main paper , we can also visual- ize the per-pix el loading factors corresponding to each prin- cipal component. W e run PCA over testing set and show the ﬁrst six principal components and the corresponding per- pixel loading factors as a heatmap in Figure 7 . W ith this vi- sualization technique, we can know what region has higher response to which component kernels. Moreover , gi ven that the ﬁrst ten principal components capture ≥ 99% ﬁlter en- ergy (stated in the main paper), we expect future work to predict compact per-pix el ﬁlters using low-rank technique, which allows for incorporating long-range pixels through large predictiv e ﬁlters while with compact features (thus memory consumption is reduced largely). 2. Iteratively Remo ving Motion Blur As the deblurred images are still not perfect, we are in- terested in studying if we can improve performance by it- erativ ely running the model, i.e., feeding the deblurred im- age as input to the same model one more time to get the result. W e denote this method as PFF+1. Not much sur- prisingly , we do not observe further improvement as listed in Figure 4 , instead, such a practice e ven hurts performance slightly . The qualitativ e results are shown in Figure 8 , from which we can see the second run does not generate much change through the ﬁlter ﬂow maps. W e believe the reason is that, the deblurred images have different statistics from the original blurry input, and the model is not trained with such deblurred images. Therefore, it suggests two natural directions as future w ork for improving the results, 1) train- ing explicitly with recurrent loops with multiple losses to improv e the performance, similar to [ 3 , 31 , 40 , 26 , 25 ], or 2) simultaneously inserting an adversarial loss to force the model to hallucinate details for realistic output, which can be useful in practice as done in [ 30 ]. 3. More Qualitati ve Results In Figure 9 , 10 and 11 , we show more qualitati ve results for non-uniform motion blur remov al, JPEG compression artifact reduction and single image super-resolution, respec- tiv ely . From these comparisons and with the guide of ﬁlter Figure 7: W e show the original image, lo w-quality input and the high-quality output by our model as well as the mean kernel and ﬁlter ﬂow maps on the left panel, and the ﬁrst six principal components and the corresponding loading factors as heatmap on the right panel. Best seen in color and zoom- in. ﬂow maps, we can see at what regions our PFF pays atten- tion to and how it outperforms the other methods. T able 4: Comparison on motion blur remov al over the non-uniform motion blur dataset [ 2 ]. PFF+1 means we perform PFF one more time by taking as input the deblurred image by the same model. Moderate Blur metric [ 55 ] [ 46 ] [ 2 ] CNN PFF PFF+1 PSNR 22.88 24.14 24.87 24.51 25.39 25.28 SSIM 0.68 0.714 0.743 0.725 0.786 0.783 Large Blur metric [ 55 ] [ 46 ] [ 2 ] CNN PFF PFF+1 PSNR 20.47 20.84 22.01 21.06 22.30 22.21 SSIM 0.54 0.56 0.624 0.560 0.638 0.633 Figure 8: W e show deblurring results ov er some random testing images from the dataset released by [ 2 ]. W e ﬁrst feed the blurry images to PFF model, and obtain deblurred images; then we feed such deblurred images into the same PFF model again to see if this iterati ve practice reﬁnes the output. Howe ver , through the visualization that iterativ ely running the model changes very little as seen from the second ﬁlter ﬂow maps. This helps qualitatively explain why iterati vely running the model does not improv e deblurring performance further . Figure 9: V isual comparison of our method ( PFF ) to CNN [Sun, et al.] [ 46 ] and patch-optim [Bahat, et al.] [ 2 ] on more testing images released by [ 2 ]. Please be guided with the strong edges in the ﬁlter ﬂo w maps to compare visual details in the deblurred images by different methods. The last four rows show real-w orld blurry images without “ground-truth” blur . Note that for the last image, there is very lar ge blur caused by the motion of football players. As our model is not trained on larger kernels which should be able to cov er the size of blur, it does not perform as well as patch-optim [Bahat, et al.] [ 2 ]. But it is clear that our model generates sharp edges in this task. Best view in color and zoom-in. Figure 10: V isual comparison between CNN and our method ( PFF ) for JPEG compression artifact reduction. Here we compress the original images using JPEG method with quality factor (QF) as 10. Best view in color and zoom-in. Figure 11: V isual comparison between CNN and our method ( PFF ) for single image super-resolution. Here all images are super-resolv ed by 4 × larger . W e sho w in the ﬁrst column the results by bicubic interpolation. Best vie w in color and zoom-in.

Image Reconstruction with Predictive Filter Flow

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment