Non-local Attention Optimized Deep Image Compression

Non-local Attention Optimized Deep Image Compr ession Haojie Liu * , T ong Chen * , Peiyao Guo * , Qiu Shen * , Xun Cao * , Y ao W ang ** , and Zhan Ma * * Nanjing Uni versity ** Ne w Y ork Uni versity Abstract This paper pr oposes a no vel Non-Local Attention Op- timized Deep Image Compr ession (NLAIC) frame work, which is built on top of the popular variational auto- encoder (V AE) structur e. Our NLAIC frame work embeds non-local operations in the encoders and decoder s for both image and latent featur e pr obability information (known as hyperprior) to captur e both local and global correlations, and apply attention mechanism to gener ate masks that ar e used to weigh the featur es for the image and hyperprior , which implicitly adapt bit allocation for differ ent features based on their importance. Furthermor e, both hyperpri- ors and spatial-c hannel neighbors of the latent featur es ar e used to impr ove entr opy coding. The pr oposed model out- performs the existing methods on K odak dataset, including learned (e.g ., Balle2019 [ 18 ], Balle2018 [ 5 ]) and conven- tional (e.g ., BPG, JPEG2000, JPEG) image compr ession methods, for both PSNR and MS-SSIM distortion metrics. 1. Introduction Most recently proposed machine learning based image compression algorithms [ 5 , 20 , 17 ] le verage the autoen- coder structure, which transforms ra w pix els into compress- ible latent features via stacked con volutional neural net- works (CNNs). These latent features are entropy coded subsequently by exploiting the statistical redundancy . Re- cent prior works have revealed that compression ef ﬁciency can be improv ed when exploring the conditional proba- bilities via the contexts of spatial neighbors and hyperpri- ors [ 17 , 11 , 5 ]. T ypically , rate-distortion optimization [ 21 ] is fulﬁlled by minimizing Lagrangian cost J = R + λD , when performing the end-to-end training. Here, R is re- ferred to as entr opy rate , and D is the distortion measured by either mean squared error (MSE) or multiscale structural similarity (MS-SSIM) [ 25 ]. Howe ver , e xisting methods still present several limita- M a i n E n c o d er M a i n D ec o d er Q Q AD AE AD AE Hy p er p r i o r D ec o d er Hy p er p r i o r E n c o d er Co n d i t i o n a l Con t ex t M o d el In p u t i m a g e O u tp u t i m a g e AD AE Q Q u a n t i z a t i o n A r i t h m e t i c E n cod i n g A r i t h m et i c D ec o d i n g Bi ts tr ea m B i ts tr ea m P Figure 1: Proposed NLAIC framew ork using a variational autoencoder structure with embedded non-local attention optimization in the main and hyperprior encoders and de- coders. tions. For example, most of the operations, such as stacked con volutions, are performed locally with limited receptiv e ﬁeld, e ven with p yramidal decomposition. Furthermore, latent features are treated with equal importance in either spatial or channel dimension in most works, without con- sidering the div erse visual sensitivities to v arious contents (such as texture and edge). Thus, attempts have been made in [ 11 , 17 ] to exploit importance maps on top of latent fea- ture vectors for adaptiv e bit allocation. But these methods require the extra explicit signaling o verhead to carry the im- portance maps. In this paper , we introduce non-local operation blocks proposed in [ 24 ] into the variational autoencoder (V AE) structure to capture both local and global correlations among pixels, and generate the attention masks which help to yield more compact distributions of latent fea- tures and hyperpriors.Dif ferent from those existing methods in [ 11 , 17 ], we use non-local processing to generate atten- 1 tion masks at different layers (not only for quantized fea- tures), to allocate the bits intelligently through the end-to- end training. W e also improve the context modeling of the entropy engine for better latent feature compression, by us- ing a masked 3D CNN (i.e., 5 × 5 × 5) on the latent features to generate the conditional statistics of the latent features. T wo different model implementations are provided, one is the “NLAIC joint” , which uses both hyperpriors and spatial-channel neighbors in the latent features for con- text modeling, and the other is the “NLAIC baseline” with contexts only from h yperpriors. Our joint model outper- forms all existing learned and traditional image compres- sion methods, in terms of the rate distortion efﬁcienc y for the distortion measured by both MS-SSIM and PSNR. T o further verify the ef ﬁciency of our framew ork, we also conduct ablation studies to discuss model variants such as removing non-local operations and attention mechanisms layer by layer , as well as the visual comparison. These ad- ditional experiments provide further evidence of the supe- rior performance of our proposed NLAIC frame work o ver a broad dataset. The main contributions of this paper are highlighted as follows: • W e are the ﬁrst to introduce non-local operations into compression framew ork to capture both local and global correlations among the pixels in the original im- age and feature maps. • W e apply attention mechanism together with afore- mentioned non-local operations to generate implicit importance masks to guide the adaptive processing of latent features. These masks essentially allocate more bits to more important features that are critical for re- ducing the image distortion. • W e employ a one-layer mask ed 3D CNN to exploit the spatial and cross channel correlations in the latent fea- tures, the output of which is then concatenated with hyperpriors to estimate the conditional statistics of the latent features, enabling more ef ﬁcient entropy coding. 2. Related W ork Non-local Operations. Most traditional ﬁlters (such as Gaussian and mean) process the data locally , by us- ing a weighted average of spatially neighboring pixels. It usually produces over -smoothed reconstructions. Classi- cal non-local methods for image restoration problems (e.g., low-rank modeling [ 9 ], joint sparsity [ 16 ] and non-local means [ 7 ]) have shown their superior ef ﬁciency for qual- ity improvement by exploiting non-local correlations. Re- cently , non-local operations ha ven been included into the deep neural networks (DNN) for video classiﬁcation [ 24 ], image restoration (e.g., denoising, artifacts remo val and super-resolution) [ 13 , 27 ], etc, with signiﬁcant performance improv ement reported. It is also worth to point out that non- local operations hav e been applied in other scenarios, such as intra block copy in screen content e xtension of the High- Efﬁcienc y V ideo Coding (HEVC) [ 26 ]. Self Attention. Self-attention mechanism is widely used in deep learning based natural language processing (NLP) [ 15 , 8 , 23 ]. It can be described as a mapping strat- egy which queries a set of ke y-value pairs to an output. For example, V aswani et. al [ 23 ] hav e proposed multi-headed attention methods which are extensiv ely used for machine translation. For those low-le vel vision tasks [ 27 , 11 , 17 ], self-attention mechanism makes generated features with spatial adapti ve acti vation and enables adaptiv e informa- tion allocation with the emphasis on more challenging areas (i.e., rich textures, salienc y , etc). In image compression, quantized attention masks are commonly used for adaptive bit allocation, e.g., Li et. al [ 11 ] uses 3 layers of local con volutions and Mentzer et. al [ 17 ] selects one of the quantized features. Unfor - tunately , these methods require the extra explicit signaling ov erhead. Our model adopts attention mechanism that is close to [ 11 , 17 ] but applies multiple layers of non-local as well as conv olutional operations to automatically generate attention masks from the input image. The attention masks are applied to the temporary latent features directly to gen- erate the ﬁnal latent features to be coded. Thus, there is no need to use extra bits to code the masks. Image Compression Architectur es. DNN based im- age compression generally relies on well-kno wn autoen- coders. Its back propagation scheme requires all the steps differentiable in an end-to-end manner . Sev eral methods (e.g., adding uniform noise [ 4 ], replacing the direct deriv a- tiv e with the deri vati ve of the expectation [ 22 ] and soft- to-hard quantization [ 3 ]) are dev eloped to approximate the non-differentiable quantization process. On the other hand, entropy rate modeling of quantized latent features is an- other critical issue for learned image compression. Pixel- CNNs [ 19 ] and V AE are commonly used for entropy esti- mation following the Bayesian generativ e rules. Recently , conditional probability estimates based on autoregressi ve neighbors of the latent feature maps and hyperpriors jointly has shown signiﬁcant impro vement in entropy coding. 3. Non-Local Attention Implementation 3.1. General Framework Fig. 1 illustrates our NLAIC framew ork. It is b uilt on a variational autoencoder structure [ 5 ], with non-local at- tention modules (NLAM) as basic units in both main and hyperprior encoder -decoder pairs (i.e., E M , D M , E h and D h ). E M with quantization Q are used to generate latent 2 T able 1: Detailed Parameter Settings in NLAIC as shown in Fig. 1 : “Con v” denotes a con volution layer with kernel size and number of output channels. “s” is the stride (e.g.,s2 means a down/up-sampling with stride 2). NLAM represents the non-local attention modules. “ × 3” means cascading 3 residual blocks (ResBlock). Main Encoder Main Decoder Hyperprior Encoder Hyperprior Decoder Conditional Context Model Con v: 5 × 5 × 192 s2 NLAM ResBlock( × 3): 3 × 3 × 192 NLAM Masked: 5 × 5 × 5 × 24 s1 ResBlock( × 3): 3 × 3 × 192 Decon v: 5 × 5 × 192 s2 Con v: 5 × 5 × 192 s2 Decon v: 5 × 5 × 192 s2 Con v: 1 × 1 × 1 × 48 s1 Con v: 5 × 5 × 192 s2 ResBlock( × 3): 3 × 3 × 192 ResBlock( × 3): 3 × 3 × 192 ResBlock( × 3): 3 × 3 × 192 ReLU NLAM Decon v: 5 × 5 × 192 s2 Con v: 5 × 5 × 192 s2 Decon v: 5 × 5 × 192 s2 Con v: 1 × 1 × 1 × 96 s1 Con v: 5 × 5 × 192 s2 NLAM NLAM ResBlock( × 3): 3 × 3 × 192 ReLU ResBlock( × 3): 3 × 3 × 192 Decon v: 5 × 5 × 192 s2 Con v: 5 × 5 × 384 s1 Con v: 1 × 1 × 1 × 2 s1 Con v: 5 × 5 × 192 s2 ResBlock( × 3): 3 × 3 × 192 NLAM Con v: 5 × 5 × 3 s2 X Y Z : 1 1   : 1 1   : 1 1 g  2 C H W  2 H W C  2 H W C  H W C  2 H W C  2 H W C  2 H W C  H W H W  so ftm a x 2 H W C  2 H W C  : 1 1 z  H W C  (a) N L M R es B l o c k R es B l o c k R es B l o c k Co n v 1 x 1 S i gm o i d R e s B l o c k R e s B l o c k R es B l o c k In p u t f e a t u r e O u t p u t f e a t u r e C o n v Re LU C o n v A tt en ti o n m a s k C o r r es p o n d i n g i n p u t (b) Figure 2: (a) Non-local module (NLM). H × W × C denotes the size of fearure maps with height H , width W and channel C . ⊕ is the add operation and ⊗ is the matrix multiplication. (b) Non-local attention module (NLAM). The main branch consists of 3 residual blocks.The mask branch combines non-local modules with residual blocks for attention mask generation. The details of residual blocks are shown in the dash frame. quantized features and E D decodes the features into the re- constructed image. E h and D h generate much smaller side information as hyperpriors. The hyperpriors as well as au- toregressi ve neighbors of the latent features are then pro- cessed through the conditional context model P to generate the conditional probability estimates for entropy coding of the latent quantized features. T able 1 details the network structures and associated parameters of ﬁv e different components in the proposed NLAIC frame work. The NLAM module is shown in Fig. 2(b) , and explained in Sections 3.2 and 3.3 . 3.2. Non-local Module Our NLAM adopts the non-local network proposed in [ 24 ] as a basic block, as shown in Fig. 2 . As shown in Fig. 2(a) , the non-local module (NLM) computes the output at pixel i , Y i , using a weighted average of the transformed feature values at pix el j , X j , as below: Y i = 1 C ( X ) X ∀ j f ( X i , X j ) g ( X j ) , (1) where i is the location index of output vector Y and j rep- resents the index that enumerates all accessible positions of input X . X and Y share the same size. The function f ( · ) computes the correlations between X i and X j , and g ( · ) computes the representation of the input at the position j . C ( X ) is a normalizing factor to generate the ﬁnal response which is set as C ( X ) = P ∀ j f ( X i , X j ) . Note that a va- riety of function forms of f ( · ) have been already discussed in [ 24 ]. Thus in this work, we directly use the embedded Gaussian function for f ( · ) , i.e., f ( X i , X j ) = e θ ( X T i ) φ ( X j ) . (2) Here, θ ( X i ) = W θ X i and φ ( X j ) = W φ X j , where W θ and W φ denote the cross-channel transform using 1 × 1 con volu- tion in our framew ork. The weights f ( X i , X j ) are further modiﬁed by a softmax operation. The operation deﬁned in Eq. ( 1 ) can be written in matrix form [ 24 ] as: Y = softmax ( X T W T θ W φ X ) g ( X ) . (3) 3 In addition, residual connection can be applied for better con vergence as suggested in [ 24 ], as shown in Fig. 2(a) , i.e., Z i = W z Y i + X i , (4) where W z is also a linear 1 × 1 conv olution across all chan- nels, and Z i is the ﬁnal output vector . 3.3. Non-local Attention Module Importance map has been adopted in [ 11 , 17 ] to adap- tiv ely allocate information to quantized latent features. For instance, we can giv e more bits to textured area but less bits to elsewhere, resulting in better visual quality at the simi- lar bit rate. Such adaptiv e allocation can be implemented by using an e xplicit mask , which must be speciﬁed with ad- ditional bits. As aforementioned, existing mask generation methods in [ 11 , 17 ] are too simple to handle areas with more complex content characteristics. Inspired by [ 27 ], we propose to use a cascade of a non- local module and regular con volutional layers to generate the attention masks, as shown in Fig. 2(b) . The NLAM consists of two branches. The main branch uses con ven- tional stacked networks to generate features and the mask branch applies the NLM with three residual blocks [ 10 ], one 1 × 1 con v olution and sigmoid activ ation to produce a joint spatial-channel attention mask M , i.e., M = sigmoid ( F NLM ( X )) , (5) where M denotes the attention mask and X is the input features. F NLM ( · ) represents the operations of using NLM with subsequent three residual blocks and 1 × 1 con volution which are shown in Fig. 2(b) . This attention mask M , hav- ing its element 0 < M k < 1 , M k ∈ R , is element-wise multiplied with feature maps from the main branch to per- form adaptiv e processing. Finally a residual connection is added for faster con ver gence. W e av oid any batch normalization (BN) layers and only use one ReLU in our residual blocks, justiﬁed through our experimental observ ations. Note that in existing learned image compression meth- ods, particularly for those with superior performance [ 4 , 5 , 18 , 14 ], GDN activ ation has proven its better ef ﬁciency compared with ReLU, tanh, sigmoid, leakyReLU, etc. This may be due to the fact that GDN captures the global infor- mation across all feature channels at the same pixel loca- tion. Howe ver , we just use the simple ReLU function, and rely on our proposed NLAM to capture both the local and global correlations. W e also ﬁnd through experiments that inserting two pairs of two layers of NLAM for the main encoder-decoder , and one layer of NLAM in the h yperprior encoder-decoder , provides the best performance. As will be shown in subsequent Section 4 , our NLAIC has demon- strated the state-of-the-art coding efﬁcienc y . 3.4. Entropy Rate Modeling Previous sections present our nov el NLAM scheme to transform the input pixels into more compact latent features. This section details the entropy rate modeling part that is critical for the ov erall rate-distortion ef ﬁciency . 3.4.1 Context Modeling Using Hyperpriors Similar as [ 5 ], a non-parametric, fully factorized density model is used for hyperpriors ˆ z , which is described as: p ˆ z | ψ ( ˆ z | ψ ) = Y i ( p z i | ψ ( i ) ( ψ ( i ) ) ∗ U ( − 1 2 , 1 2 ))( ˆ z i ) , (6) where ψ ( i ) represents the parameters of each univ ariate dis- tribution p ˆ z | ψ ( i ) . For quantized latent features ˆ y , each element ˆ y i can be modeled as a conditional Gaussian distribution as: p ˆ y | ˆ z ( ˆ y | ˆ z ) = Y i ( N ( µ i , σ i 2 ) ∗ U ( − 1 2 , 1 2 ))( ˆ y i ) , (7) where its µ i and σ i are predicted using the distrib ution of ˆ z . W e ev aluate the bits of ˆ y and ˆ z using: R ˆ y = − X i log 2 ( p ˆ y i | ˆ z i ( ˆ y i | ˆ z i )) , (8) R ˆ z = − X i log 2 ( p ˆ z i | ψ ( i ) ( ˆ z i | ψ ( i ) )) . (9) Usually , we take ˆ z as side information for estimating µ i and σ i and ˆ z only occupies a very small fraction of bits, sho wn in Fig. 3 . Figure 3: Illustration of percentage of ˆ z in the entire bit- stream. For the case that model is optimized using MSE loss, ˆ z occupies less percentage for joint model than the baseline; But the outcome is rev ersed for the case that model is tuned with MS-SSIM loss. The percentage of ˆ z for MSE loss optimized method is noticeably higher than the sce- nario using MS-SSIM loss. 4 (a) (b) Figure 4: Illustration of the rate-distortion performance on K odak. (a) distortion is measured by MS-SSIM (dB). Here we use − 10 log 10 (1 − d ) to represent raw MS-SSIM ( d ) in dB scale. (b) PSNR is used for distortion measurement. 3.4.2 Context Modeling Using Neighbors PixelCNNs and PixelRNNs [ 19 ] have been proposed for effecti ve modeling of probabilistic distribution of images using local neighbors in an autore gressi ve way . It is fur- ther extended for adaptiv e context modeling in compres- sion framework with noticeable improvement. For exam- ple, Minnen et al. [ 18 ] ha ve proposed to extract autore gres- siv e information by a 2D 5 × 5 mask ed con v olution, which is combined with hyperpriors using stacked 1 × 1 con volution, for probability estimation. It is the ﬁr st deep-learning based method with better PSNR compared with the BPG444 at the same bit rate. In our NLAIC, we use a one-layer 5 × 5 × 5 3D masked con volution to exploit the spatial and cross-channel cor- relation. F or simplicity , a 3 × 3 × 3 example is shown in Fig. 5 . T raditional 2D PixelCNNs need to search for a well structured channel order to e xploit the conditional proba- bility efﬁciently . Instead, our proposed 3D masked con vo- 3 x 3 x 3 m a s k ed k er n el Cu r r e n t p i x e l Figure 5: In 3 × 3 × 3 masked con volution, the current pixel (in purple) is predicted by the processed pixels (in yello w , green and blue) in a 3D space. The unprocessed pixels (in white) and the current pixel are masked with zeros. lutions implicitly exploit correlation among adjacent chan- nels. Compared to 2D masked CNN used in [ 18 ], our 3D CNN approach signiﬁcantly reduces the network parame- ters for the conditional context modeling. Leveraging the additional contexts from neighbors via an autoregressi ve fashion, we can obtain a better conditional Gaussian dis- tribution to model the entrop y as: p ˆ y ( ˆ y i | ˆ y 1 , ..., ˆ y i − 1 , ˆ z ) = Y i ( N ( µ i , σ i 2 ) ∗ U ( − 1 2 , 1 2 ))( ˆ y i ) , (10) where ˆ y 1 , ˆ y 2 , ..., ˆ y i − 1 denote the causal (and possibly re- constructed) pixels prior to current pix el ˆ y i . 4. Experiments 4.1. T raining W e use COCO [ 12 ] and CLIC [ 2 ] datasets to train our NLAIC frame work. W e randomly crop images into 192 × 192 × 3 patches for subsequent learning. Rate- distortion optimization (RDO) is applied to do end-to-end training at various bit rate, i.e., L = λ · d ( ˆ x, x ) + R y + R z . (11) d ( · ) is a distortion measurement between reconstructed im- age ˆ x and the original image x . Both negati ve MS-SSIM and MSE are used in our work as distortion loss for e valua- tion, which are marked as “MS-SSIM opt. ” and “MSE opt. ”, respectiv ely . R y and R z represent the estimated bit rates of latent features and hyperpriors, respecti vely . Note that 5 NLAIC (joint) Balle2019 NLAIC (baseline) BPG(4:4:4) Balle2018 JPEG2000 0 10 20 30 40 50 60 70 BD-Rate Gains(%) 64.39% 59.84% 57.74% 56.19% 52.52% 38.02% Figure 6: Coding efﬁcienc y comparison using JPEG as an- chor . It shows our NLAIC achiev es the best BD-Rate gains among all popular algorithms. all components of our NLAIC are trained together . W e set learning rates (LR) for E M , D M , E h , D h and P at 3 × 10 − 5 in the beginning. But for P , its LR is clipped to 10 − 5 after 30 epochs. Batch size is set to 16 and the entire model is trained on 4-GPUs in parallel. T o understand the contribution of the context modeling using spatial-channel neighbors, we offer two different im- plementations: one is “NLAIC baseline” that only uses the hyperpriors to estimate the means and variances of the la- tent features (see Eq. ( 7 )), while the other is “NLAIC joint” that uses both hyperpriors and previously coded pix els in the latent feature maps (see Eq. ( 10 )). In this work, we ﬁrst train the “NLAIC baseline” models. T o train the “NLAIC joint” model, one w ay is ﬁxing the main and h yperprior en- coders and decoders in the baseline model, and updating only the conditional context model P . Compared with the “NLAIC baseline”, such transfer learning based “NLAIC joint” provides 3% bit rate reduction at the same distortion. Alternativ ely , we could use the baseline models as the start point, and reﬁne all the modules in the “NLAIC joint” sys- tem. In this way , “NLAIC joint” offers more than 9% bit rate reduction o ver the “NLAIC baseline” at the same qual- ity . Thus, we choose the latter one for better performance. 4.2. Perf ormance Efﬁciency W e evaluate our NLAIC models by comparing the rate- distortion performance av eraged on publicly av ailable Ko- dak dataset. Fig. 4 shows the performance when distortion is measured by MS-SSIM and PSNR, respectiv ely , that are widely used in image and video compression tasks. Here, PSNR represents the pixel-lev el distortion while MS-SSIM describes the structural similarity . MS-SSIM is reported Figure 7: Ablation studies on NLAM where we gradually remov e the NLAM components and re-train the model to offer higher correlation with human perceptual incep- tion, particularly at low bit rate [ 25 ]. As we can see, our NLAIC provides the state-of-the-art performance with no- ticeable performance margin compared with the existing leading methods, such as Ball ´ e2019 [ 18 ] and Ball ´ e2018 [ 5 ]. Speciﬁcally , as shown in Fig. 4(a) using MS-SSIM for both loss and ﬁnal distortion measurement, “NLAIC base- line” outperforms the existing methods while the “NLAIC joint” presents even larger performance margin. For the case that uses MSE as loss and PSNR as distortion mea- surement, “NLAIC joint” still offers the best performance, as illustrated in Fig. 4(b) . “NLAIC baseline” is slightly worse than the model in [ 18 ] that uses contexts from both hyperpriors and neighbors jointly as our “NLAIC joint”, but better than the w ork [ 5 ] that only uses the hyperpriors to do contexts modeling for a fair comparison. Fig. 6 com- pares the average BD-Rate reductions by various methods ov er the legac y JPEG encoder . Our “NLAIC joint” model shows 64.39% and 12.26% BD-Rate [ 6 ] reduction against JPEG420 and BPG444, respectiv ely . 4.3. Ablation Studies W e further analyze our NLAIC in following aspects: Impacts of NLAM: T o further discuss the efﬁciency of newly introduced NLAM, we remove the mask branch in the NLAM pairs gradually , and retrain our frame work for performance e valuation. For this study , we use the base- line context modeling in all cases, and use the MSE as the loss function and PSNR as the ﬁnal distortion measurement, shown in Fig. 7 . For illustrative understanding, we also pro- vide two anchors, i.e., “Ball ´ e2018” [ 5 ] and “NLAIC joint” respectiv ely . Howe ver , to see the degradation caused by gradually removing the mask branch in NLAMs, one should compare with the NLAIC baseline curve. Removing the mask branches of the ﬁrst NLAM pair in 6 La t e n t f e a t u r e s M e a n S c a l e No r m a l i z e d Pr e d i c t e d E r r o r J o i n t Re m o v e Fi rs t Re m o v e A l l B a s e l i n e Re m o v e M a i n Di s t r i b u t i o n Figure 8: Prediction error with dif ferent model at similar bit rate. Column-wisely , it depicts the latent features, the pre- dicted mean, predicted scale, normalized prediction error ( i.e., f eatur e − mean scale ) and the distribution of the normalized prediction error from left to right plots. Each ro w represents a different model (e.g., v arious combinations of NLAM components, and contexts prediction). These ﬁgures sho w that with NLAM and joint contexts from h yperprior and au- toregressi ve neighbors, the latent features capture more in- formation (indicated by a layer dynamic range), which leads to a large scale (standard deviation of features), and the ﬁnal normalized feature prediction error has the most compact distribution, which leads to the lo west bit rate. the main encoder-decoders (referred to as “remove ﬁrst”) yields a PSNR drop of about 0.1dB compared to “NLAIC baseline” at the same bit rate. PSNR drop is further enlarged noticeably when remo ving all NLAM pairs’ mask branches in main encoder-decoders (a.k.a., “remove main”). It gi ves the worst performance when further disabling the NLAM pair’ s mask branches in hyperprior encoder-decoders, re- sulting in the traditional variational autoencoder without non-local characteristics explorations (i.e., “remo ve all”). Impacts of Joint Contexts Modeling: W e further com- pare conditional context modeling efﬁcienc y of the model variants in Fig. 8 . As we can see, with embedded NLAM and joint contexts modeling, our “NLAIC joint” could pro- vide more powerful latent features, and more compact nor- malized feature prediction error, both contributing to its leading coding efﬁcienc y . Hyperpriors ˆ z : Hyperpriors ˆ z has noticeable contribu- tion to the overall compression performance [ 18 , 5 ]. Its per- centage decreases as the ov erall bit rate increases, shown in Fig. 3 . The percentage of ˆ z for MSE loss optimized model is higher than the case using MS-SSIM loss optimization. Another interesting observ ation is that ˆ z exhibits contradic- tiv e distributions of joint and baseline models, for respecti ve MSE and MS-SSIM loss based schemes. More explorations is highly desired in this aspect to understand the bit alloca- tion of hyperpriors in our future study . 4.4. V isual Comparison W e also ev aluate our method on BSD500 [ 1 ] dataset, which is widely used in image restoration problems. Fig. 9 shows the results of dif ferent image codecs at the similar bit rate. Our NLAIC provides the best subjectiv e quality with relativ e smaller bit rate 1 . Considering that MS-SSIM loss optimized results demonstrate much smaller PSNR at high bit rate in Fig. 4(a) , we also show our model comparison optimized for respectiv e PSNR and MS-SSIM loss at high bit rate scenario. W e ﬁnd it that MS-SSIM loss optimized results exhibit worse details compared with PSNR loss optimized models at high bit rate, as shown in Fig. 10 . This may be due to the fact that pixel distortion becomes more signiﬁcant at high bit rate, but structural similarity puts more weights at a fair lo w bit rate. It will be interesting to explore a better metric to cov er the advantages of PSNR at high bit rate and MS-SSIM at low bit rate for an o verall optimal efﬁcienc y . 5. Conclusion In this paper , we proposed a non-local attention op- timized deep image compression (NLAIC) method and achiev e the state-of-the-art performance. Speciﬁcally , we hav e introduced the non-local operation to capture both lo- cal and global correlation for more compact latent feature representations. T ogether with the attention mechanism, we can enable the adaptiv e processing of latent features by allo- cating more bits to important area using the attention maps generated by non-local operations. Joint contexts from au- toregressi ve spatial-channel neighbors and hyperpriors are lev eraged to improve the entropy coding ef ﬁciency . Our NLAIC outperforms the existing image compression methods, including well known BPG, JPEG2000, JPEG as well as the most recent learning based schemes [ 18 , 5 , 20 ], in terms of both MS-SSIM and PSNR ev aluation at the same bit rate. For future study , we can make our context model deeper to improv e image compression performance. Parallelization 1 In practice, some bit rate points cannot be reached for BPG and JPEG. Thus we choose the closest one to match our NLAIC bit rate. 7 ( a ) J P E G : 0 . 3 0 1 4 b p p P S N R : 21 . 2 3 MS - S S IM : 0 . 8 5 0 4 ( b ) BP G : 0 . 3 4 6 4 b p p P S N R : 24 . 8 4 MS - S S IM : 0 . 9 2 7 0 ( c ) N L A IC M S E o p t .: 0 . 2 9 2 9 b p p P S N R : 24 . 71 MS - S S IM : 0 . 9 2 7 7 ( d ) N L A IC M S - S S IM o p t .: 0 . 3 0 8 7 b p p P S N R : 23 . 5 7 MS - S S IM : 0 . 9 5 5 1 ( a ) J P E G : 0 . 2 1 2 7 b p p P S N R : 25 . 1 7 MS - S S IM : 0 . 8 6 2 9 ( b ) BP G : 0 . 1 1 4 2 b p p P S N R : 31 . 9 7 MS - S S IM : 0 . 9 5 8 1 ( c ) N L A IC M S E o p t .: 0 . 1 2 7 6 b p p P S N R : 34 . 6 3 MS - S S IM : 0 . 9 7 3 8 ( d ) N L A IC M S - S S IM o p t .: 0 . 1 0 7 4 b p p P S N R : 32 . 5 4 MS - S S IM : 0 . 9 7 5 9 ( e ) O r i g i n a l ( e ) O r i g i n a l Figure 9: V isual comparison among JPEG420, BPG444, NLAIC joint MSE opt., MS-SSIM opt. and the original image from left to right. Our method achie ves the best visual quality containing more te xture without blocky nor blurring artifacts. ( a ) MS - S S I M opt : 0 . 8743 bpp P S N R : 31 . 49 MS - S S I M : 0 . 9956 ( d ) MS - S S I M opt : 0 . 6056 bpp P S N R : 28 . 84 MS - S S I M : 0 . 9879 ( c ) Or i gi na l ( e ) MS E opt : 0 . 6045 bpp P S N R : 30 . 43 MS - S S I M : 0 . 9815 ( f ) Or igi na l ( b ) MS E opt : 0 . 8798 bpp P S N R : 35 . 12 MS - S S I M : 0 . 9935 Figure 10: Illustrati ve reconstruction samples of respecti ve PSNR and MS-SSIM loss optimized compression and acceleration are important to deploy the model for ac- tual usage in practice, particularly for mobile platforms. In 8 addition, it is also meaningful to extend our framework for end-to-end video compression framework with more priors acquired from spatial and temporal information. References [1] The berkeley se gmentation dataset and benchmark. 7 [2] Challenge on learned image compression 2018. 5 [3] E. Agusts son, F . Mentzer , M. Tschannen, L. Cavigelli, R. T imofte, L. Benini, and L. V . Gool. Soft-to-hard vector quantization for end-to-end learning compress- ible representations. In Advances in Neural Informa- tion Pr ocessing Systems , pages 1141–1151, 2017. 2 [4] J. Ball ´ e, V . Laparra, and E. P . Simoncelli. End-to- end optimized image compression. arXiv preprint arXiv:1611.01704 , 2016. 2 , 4 [5] J. Ball ´ e, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston. V ariational image compression with a scale hyperprior . arXiv pr eprint arXiv:1802.01436 , 2018. 1 , 2 , 4 , 6 , 7 [6] G. Bjontegaard. Calculation of av erage psnr differ - ences between rd-curves. VCEG-M33 , 2001. 6 [7] A. Buades, B. Coll, and J.-M. Morel. A non-local algorithm for image denoising. In 2005 IEEE Com- puter Society Confer ence on Computer V ision and P at- tern Recognition (CVPR’05) , volume 2, pages 60–65. IEEE, 2005. 2 [8] O. Firat, K. Cho, and Y . Bengio. Multi-way , multilin- gual neural machine translation with a shared attention mechanism. arXiv pr eprint arXiv:1601.01073 , 2016. 2 [9] S. Gu, L. Zhang, W . Zuo, and X. Feng. W eighted nuclear norm minimization with application to image denoising. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , pages 2862– 2869, 2014. 2 [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep resid- ual learning for image recognition. In Pr oceedings of the IEEE conference on computer vision and pattern r ecognition , pages 770–778, 2016. 4 [11] M. Li, W . Zuo, S. Gu, D. Zhao, and D. Zhang. Learn- ing conv olutional networks for content-weighted im- age compression. arXiv pr eprint arXiv:1703.10553 , 2017. 1 , 2 , 4 [12] T .-Y . Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Doll ´ ar , and C. L. Zitnick. Microsoft coco: Common objects in context. In Eur opean con- fer ence on computer vision , pages 740–755. Springer , 2014. 5 [13] D. Liu, B. W en, Y . Fan, C. C. Loy , and T . S. Huang. Non-local recurrent network for image restoration. In Advances in Neural Information Pr ocessing Systems , pages 1680–1689, 2018. 2 [14] H. Liu, T . Chen, P . Guo, Q. Shen, and Z. Ma. Gated context model with embedded priors for deep image compression. arXiv pr eprint arXiv:1902.10480 , 2019. 4 [15] M.-T . Luong, H. Pham, and C. D. Manning. Ef fective approaches to attention-based neural machine transla- tion. arXiv preprint , 2015. 2 [16] J. Mairal, F . Bach, J. Ponce, G. Sapiro, and A. Zisser- man. Non-local sparse models for image restoration. In 2009 IEEE 12th International Confer ence on Com- puter V ision (ICCV) , pages 2272–2279. IEEE, 2009. 2 [17] F . Mentzer , E. Agustsson, M. Tschannen, R. T imo- fte, and L. V an Gool. Conditional probability mod- els for deep image compression. In IEEE Conference on Computer V ision and P attern Recognition (CVPR) , volume 1, page 3, 2018. 1 , 2 , 4 [18] D. Minnen, J. Ball ´ e, and G. D. T oderici. Joint au- toregressi ve and hierarchical priors for learned image compression. In Advances in Neural Information Pr o- cessing Systems , pages 10794–10803, 2018. 1 , 4 , 5 , 6 , 7 [19] A. v . d. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. arXiv pr eprint arXiv:1601.06759 , 2016. 2 , 5 [20] O. Rippel and L. Bourdev . Real-time adapti ve image compression. arXiv pr eprint arXiv:1705.05823 , 2017. 1 , 7 [21] G. J. Sulliv an, T . Wie gand, et al. Rate-distortion opti- mization for video compression. IEEE signal pr ocess- ing magazine , 15(6):74–90, 1998. 1 [22] G. T oderici, D. V incent, N. Johnston, S.-J. Hwang, D. Minnen, J. Shor, and M. Cov ell. Full resolution image compression with recurrent neural networks. CoRR , abs/1608.05148, 2016. 2 [23] A. V aswani, N. Shazeer, N. Parmar , J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser , and I. Polosukhin. Attention is all you need. In Advances in Neural Infor- mation Pr ocessing Systems , pages 5998–6008, 2017. 2 [24] X. W ang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Pr oceedings of the IEEE Con- fer ence on Computer V ision and P attern Recognition , pages 7794–7803, 2018. 1 , 2 , 3 , 4 [25] Z. W ang, E. P . Simoncelli, and A. C. Bovik. Mul- tiscale structural similarity for image quality assess- ment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computer s, 2003 , volume 2, pages 1398–1402. Ieee, 2003. 1 , 6 9 [26] X. Xu, S. Liu, T . Chuang, Y . Huang, S. Lei, K. Ra- paka, C. P ang, V . Seregin, Y . W ang, and M. Kar- czewicz. Intra block copy in hevc screen content cod- ing extensions. IEEE J ournal on Emer ging and Se- lected T opics in Cir cuits and Systems , 6(4):409–419, Dec 2016. 2 [27] Y . Zhang, K. Li, K. Li, B. Zhong, and Y . Fu. Resid- ual non-local attention networks for image restoration. 2018. 2 , 4 10

Non-local Attention Optimized Deep Image Compression

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment