Efficient and Effective Context-Based Convolutional Entropy Modeling for Image Compression

Precise estimation of the probabilistic structure of natural images plays an essential role in image compression. Despite the recent remarkable success of end-to-end optimized image compression, the latent codes are usually assumed to be fully statis…

Authors: Mu Li, Kede Ma, Jane You

Efficient and Effective Context-Based Convolutional Entropy Modeling for   Image Compression
1 Ef ficient and Ef fecti v e Context-Based Con v olutional Entropy Modeling for Image Compression Mu Li, K ede Ma, Member , IEEE , Jane Y ou, Member , IEEE , David Zhang, F ellow , IEEE , and W angmeng Zuo, Senior Member , IEEE Abstract —Precise estimation of the probabilistic structure of natural images plays an essential role in image compr ession. Despite the recent remarkable success of end-to-end optimized image compression, the latent codes ar e usually assumed to be fully statistically factorized in order to simplify entr opy modeling. Howev er , this assumption generally does not hold true and may hinder compression perf ormance. Here we present context- based con volutional networks (CCNs) for efficient and effective entropy modeling. In particular , a 3D zigzag scanning order and a 3D code dividing technique are introduced to define proper coding contexts for parallel entropy decoding, both of which boil down to place translation-inv ariant binary masks on conv olution filters of CCNs. W e demonstrate the promise of CCNs f or entr opy modeling in both lossless and lossy image compression. For the former , we directly apply a CCN to the binarized representation of an image to compute the Bernoulli distribution of each code for entropy estimation. For the latter , the categorical distribution of each code is represented by a discretized mixtur e of Gaussian distrib utions, whose parameters are estimated by thr ee CCNs. W e then jointly optimize the CCN- based entr opy model along with analysis and synthesis transf orms for rate-distortion performance. Experiments on the K odak and T ecnick datasets show that our methods powered by the proposed CCNs generally achieve comparable compression performance to the state-of-the-art while being much faster . Index T erms —Context-based con volutional networks, entropy modeling, image compression. I . I N T RO D U C T I O N Data compression has played a significant role in engineer- ing for centuries [ 1 ]. Compression can be either lossless or lossy . Lossless compression allows perfect data reconstruc- tion from compressed bitstreams with the goal of assigning shorter codew ords to more “probable” codes. T ypical exam- ples include Huffman coding [ 2 ], arithmetic coding [ 3 ], and range coding [ 4 ]. Lossy compression discards “unimportant” information of the input data, and the definition of importance is application-dependent. F or example, if the data (such as images and videos) are meant to be consumed by the human This work is partially supported by the National Natural Scientific Foun- dation of China (NSFC) under Grant No. 61671182 and 61872118. Mu Li and Jane Y ou are with the Department of Computing, the Hong Kong Polytechnic Univ ersity , Ko wloon, Hong Kong (e-mail: csmuli@comp.polyu.edu.hk; csyjia@comp.polyu.edu.hk). W angmeng Zuo is with the School of Computer Science and T ech- nology , Harbin Institute of T echnology , Harbin, 150001, China (e-mail: cswmzuo@gmail.com). Kede Ma is with the Department of Computer Science, City University of Hong K ong, Kowloon, Hong K ong (e-mail: kede.ma@cityu.edu.hk). David Zhang is with the School of Science and Engineering, The Chinese Univ ersity of Hong Kong (Shenzhen), Shenzhen Research Institute of Big Data, and Shenzhen Institute of Artificial Intelligence and Robotics for Society , Shenzhen, China, e-mail: (davidzhang@cuhk.edu.cn). visual system, importance should be measured in accordance with human perception, discarding features that are percep- tually redundant, while keeping those that are most visually noticeable. In lossy compression, one must face the rate- distortion trade-off, where the rate is computed by the entropy of the discrete codes [ 5 ] and the distortion is measured by a signal fidelity metric. A prev ailing scheme in the context of lossy image compression is transform coding, which con- sists of three operations - transformation, quantization, and entropy coding. Transforms map an image to a latent code representation, which is better -suited for exploiting aspects of human perception. Early transforms [ 6 ] are linear , in vertible, and fixed for all bit rates; errors arise only from quantization. Recent transforms take the form of deep neural networks (DNNs) [ 7 ], aiming for nonlinear and more compressible rep- resentations. DNN-based transforms are mostly non-in vertible, which may , howe ver , encourage discarding perceptually unim- portant image features during transformation. This giv es us an opportunity to learn different transforms at dif ferent bit rates for optimal rate-distortion performance. Entropy coding is responsible for losslessly compressing the quantized codes into bitstreams for storage and transmission. In either lossless or lossy image compression, a discrete probability distribution of the latent codes shared by the encoder and the decoder ( i . e ., the entropy model) is essential in determining the compression performance. According to the Shannon’ s source coding theorem [ 5 ], given a vector of code intensities y = { y 0 , . . . , y M } , the optimal code length of y should be d− log 2 P ( y ) e , where binary symbols are assumed to construct the codebook. W ithout further constraints, it is intractable to estimate P ( y ) in high-dimensional spaces, a problem commonly known as the curse of dimensionality . For this reason, most entropy coding schemes assume y is fully statistically factorized with the same marginal distri- bution, leading to a code length of d− P M i =0 log 2 P ( y i ) e . Alternativ ely , the chain rule in probability theory offers a more accurate approximation P ( y ) ≈ M Y i =0 P ( y i | PTX( y i , y )) , (1) where PTX( y i , y ) ⊂ { y 0 , . . . , y i − 1 } represents the partial context of y i coded before it in y . A representativ e exam- ple is the context-based adaptive binary arithmetic coding (CAB AC) [ 8 ] in H.264/A VC, which considers the two nearest codes as partial context, and obtains noticeable improvements ov er previous image/video compression standards. As the size 2 of PTX( y i , y ) becomes large, it is difficult to estimate this conditional probability by constructing histograms. Recent methods such as PixelRNN [ 9 ] and PixelCNN [ 10 ] take ad- vantage of DNNs in modeling long range relations to increase the size of partial context, but are computationally intensiv e. In this work, we present context-based conv olutional net- works (CCNs) for effecti ve and ef ficient entropy modeling. Giv en y , we specify a 3D zigzag coding order such that the most relev ant codes of y i can be included in its context. Parallel computation during entropy encoding is straightfor- ward as the context of each code is known and readily av ailable. Howe ver , this is not always the case during entropy decoding. The partial context of y i should first be decoded sequentially for the estimation of P ( y i | PTX( y i , y )) , which is prohibitively slow . T o address this issue, we introduce a 3D code di viding technique, which partitions y into multiple groups in compliance with the proposed coding order . The codes within each group are assumed to be conditionally independent giv en their respective contexts, and therefore can be decoded in parallel. In the context of CCNs, this amounts to applying properly designed translation-inv ariant binary masks to con volutional filters. T o v alidate the proposed CCNs, we combine them with arithmetic coding [ 3 ] for entropy modeling. For lossless image compression, we con vert the input grayscale image to eight binary planes and train a CCN to predict the Bernoulli distribution of y i by optimizing the entropy loss in information theory [ 11 ]. For lossy image compression, we parameterize the categorical distribution of y i with a discretized mixture of Gaussian (MoG) distributions, whose parameters ( i . e ., mixture weights, means, and variances) are estimated by three CCNs. The CCN-based entropy model is jointly optimized with analysis and synthesis transforms ( i . e ., mappings between raw pixel space and latent code space) over a database of training images, trading off the rate and the distortion. Experiments on the K odak and T ecnick datasets sho w that our methods for lossless and lossy image compression perform fa vorably against image compression standards and DNN-based meth- ods, especially at low bit rates. I I . R E L A T E D W O R K In this section, we provide a brief overvie w of entropy mod- els and lossy image compression methods based on DNNs. For traditional image compression techniques, we refer interested readers to [ 12 ], [ 13 ], [ 14 ]. A. DNN-Based Entr opy Modeling The first and the most important step in entropy modeling is to estimate the probability P ( y ) . For most image compres- sion techniques, y is assumed to be statistically independent, whose entropy can be easily computed through the marginal distributions [ 7 ], [ 15 ], [ 16 ], [ 17 ]. Arguably speaking, natural images undergoing a highly nonlinear analysis transform still exhibit strong statistical redundancies [ 18 ]. This suggests that incorporating context into probability estimation has great potentials in improving the performance of entropy coding. DNN-based context modeling for natural languages and images has attracted considerable attention in the past decade. In natural language processing, recurrent neural networks (RNN) [ 19 ], and long short-term memory (LSTM) [ 20 ] are tw o popular tools to model long-range dependencies. In image pro- cessing, PixelRNN [ 9 ] and PixelCNN [ 10 ] are among the first attempts to exploit long-range pixel dependencies for image generation. The above-mentioned methods are computationally inefficient, requiring one forward propagation to generate (or estimate the probability of) a single pix el. T o speed up PixelCNN, Reed et al . [ 21 ] proposed Multiscale PixelCNN, which is able to sample a twice lar ger intermediate image conditioning on the initial image. This process may be iterated to generate the final high-resolution result. When vie wing Multiscale PixelCNN as an entropy model, we must losslessly compress and send the initial image as side information to the decoder for entropy decoding. Only recently hav e DNNs for context-based entropy mod- eling become an acti ve research topic. Ball ´ e et al . [ 18 ] introduced a scale prior , which stores a variance parameter for each y i as side information. Richer side information generally leads to more accurate entropy modeling. Ho we ver , this type of information should also be quantized, compressed and considered as part of the codes, and it is difficult to trade of f the bits sav ed by the improved entropy model and the bits introduced by storing this side information. Li et al . [ 22 ] extracted a small code block for each y i as its context, and adopted a simple DNN for entropy modeling. The method suf fers from hea vy computational complexity similar to PixelRNN [ 9 ]. Li et al . [ 23 ] and Mentzer et al . [ 24 ] implemented parallel entropy encoding with masked DNNs. Howe ver , sequential entropy decoding has to be performed due to the context dependence, which remains painfully slow . In contrast, our CCN-based entropy model permits parallel entropy encoding and decoding, making it more attractive for practical applications. B. DNN-Based Lossy Image Compression A major problem in end-to-end lossy image compression is that the gradients of the quantization function are zeros al- most ev erywhere, making gradient descent-based optimization ineffecti ve. Dif ferent strategies have been proposed to alle- viate the zero-gradient problem resulting from quantization. From a signal processing perspectiv e, the quantizer can be approximated by additiv e i.i.d. uniform noise, which has the same width as the quantization bin [ 25 ]. A desired property of this approximation is that the resulting density is a continuous relaxation of the probability mass function of y [ 7 ]. Another line of research introduced continuous functions (without the zero-gradient problem) to approximate the quantization function. The step quantizer is used in the forward pass, while its continuous proxy is used in the backward pass. T oderici et al . [ 26 ] learned an RNN to compress small-size images in a progressive manner . They later tested their models on large-size images [ 27 ]. Johnston et al . [ 28 ] exploited adaptiv e bit allocations and perceptual losses to boost the compression performance especially in terms of MS-SSIM [ 29 ]. 3 The joint optimization of rate-distortion performance is another crucial issue in DNN-based image compression. The methods in [ 26 ], [ 27 ], [ 28 ] treat entropy coding as a post-processing step. Ball ´ e et al . [ 7 ] explicitly formulated DNN-based image compression under the framew ork of rate- distortion optimization. Assuming y is statistically factorized, they learned piece-wise linear density functions to compute differential entropy as an approximation to discrete entropy . In a subsequent work [ 18 ], each y i is assumed to follo w zero- mean Gaussian with its o wn variance separately predicted using side information. Minnen et al . [ 30 ] combined the autoregressi ve and hierarchical priors, leading to improved rate-distortion performance. Theis et al . [ 15 ] introduced a continuous upper bound of the discrete entropy with a Gaus- sian scale mixture. Rippel et al . [ 17 ] described pyramid-based analysis and synthesis transforms with adaptiv e code length regularization for real-time image compression. An adversarial loss [ 31 ] is incorporated to generate visually realistic results at low bit rates [ 17 ]. I I I . C C N S F O R E N T RO P Y M O D E L I N G In this section, we present in detail the construction of CCNs for entropy modeling. W e work with a fully con volutional network, consisting of T layers of conv olutions followed by point-wise nonlinear activ ation functions, and assume the standard raster coding order (see Fig. 1 ). In order to perform efficient context-based entropy coding, two assumptions are made on the network architecture: • For a code block y ∈ Q M × H × W , where M , H , and W denote the dimensions along channel, height, and width directions, respecti vely , the corresponding out- put of the t -th con volution layer v ( t ) has a size of M × H × W × N t , where N t denotes the number of feature blocks to represent y . • Let CTX( y i ( p, q ) , y ) be the set of codes encoded before y i ( p, q ) ( i . e ., full context), and SS( v ( t ) i,j ( p, q )) be the set of codes in the receptive field of v ( t ) i,j ( p, q ) that contributes to its computation ( i . e ., support set), respecti vely . Then, SS( v ( t ) i,j ( p, q )) ⊂ CTX( y i ( p, q ) , y ) . Assumption I establishes a one-to-many correspondence be- tween the input code block y and the output feature rep- resentation v ( T ) . In other words, the feature v ( t ) i,j ( p, q ) in i - th channel and j -th feature block at spatial location ( p, q ) is uniquely associated with y i ( p, q ) . Assumption II ensures that the computation of v ( t ) i ( p, q ) depends only on a subset of CTX( y i ( p, q ) , y ) . T ogether , the two assumptions guarantee the legitimacy of context-based entropy modeling in fully con volutional networks, which can be achieved by placing translation-in variant binary masks to con volution filters. W e start with the case of a 2D code block, where y ∈ Q H × W , and define masked con volution at the t -th layer as v ( t ) i ( p, q ) = N t X j =1  u ( t ) j ∗  m ( t )  w ( t ) i,j  ( p, q ) + b ( t ) i , (2) where ∗ and  denote 2D con volution and Hadamard product, respectiv ely . w ( t ) ij is a 2D con volution filter , m ( t ) is the Fig. 1. Illustration of 2D masked conv olution in the input layer of the proposed CCN for entropy modeling. A raster coding order (left to right, top to bottom) and a con volution kernel size of 5 × 5 are assumed here. The orange and blue dashed regions indicate the full context of the orange and blue codes, respectively . In the right panel, we highlight the support sets of the two codes in corresponding colors, which share the same mask. corresponding 2D binary mask, and b ( t ) i is the bias. According to Assumption I, the input u ( t ) i and the output v ( t ) i are of the same size as y . The input code block y corresponds to u (0) 0 . For the input layer of a fully con volutional network, the codes to produce v (0) i ( p, q ) is Ω p,q = { y ( p + µ, q + ν ) } , where ( µ, ν ) ∈ Ψ is the set of local indices centered at (0 , 0) . W e choose SS( v (0) i ( p, q )) = CTX( y ( p, q ) , Ω p,q ) ⊂ CTX( y ( p, q ) , y ) , (3) which can be achiev ed by setting m (0) ( µ, ν ) = ( 1 , if Ω p,q ( µ, ν ) ∈ CTX( y ( p, q ) , y ) 0 , otherwise . (4) Fig. 1 illustrates the concepts of full context CTX( y ( p, q ) , y ) , support set SS( v (0) ( p, q )) , and translation-in variant mask m (0) , respecti vely . At the t -th layer , if we let m ( t ) = m (0) , for a code y ( p + µ, q + ν ) ∈ CTX( y ( p, q ) , y ) , we have SS( u ( t ) j ( p + µ, q + ν )) ⊂ CTX( y ( p + µ, q + ν ) , y ) ⊂ CTX( y ( p, q ) , y ) , (5) where the first line follows by induction and the second line follows from the definition of context. That is, as long as y ( p + µ, q + ν ) is in the context of y ( p, q ) , we are able to compute v ( t ) i ( p, q ) from u ( t ) j ( p + µ, q + ν ) without violating Assumption II. In addition, u ( t ) j ( p, q ) for t > 0 is also generated from CTX( y ( p, q ) , y ) , and can be used to compute v ( t ) i ( p, q ) . Therefore, we may modify the mask at the t -th layer m ( t ) ( µ, ν ) = ( m (0) ( µ, ν ) , if ( µ, ν ) 6 = (0 , 0) 1 , otherwise . (6) A. Pr oposed Strate gies for P arallel Entr opy Decoding W ith the translation-in v ariant masks designed in Eqn. ( 4 ) and Eqn. ( 6 ), the proposed CCN can efficiently encode y in parallel. Howe ver , it remains difficult to parallelize the computation in entropy decoding. As shown in Fig. 2 (a) and (b), the two nearby codes in the same row (highlighted in or- ange and blue, respectiv ely) cannot be decoded simultaneously 4 v q p (a) (b) (c) (d) (e) Fig. 2. Illustration of code dividing techniques in conjunction with different coding orders for a 2D code block. The orange and blue dots represent two neighbouring codes. The gray dots denote codes that hav e already been encoded, while the white circles represent codes yet to be encoded. (a) Raster coding order adopted in many compression methods. (b) Support sets of the orange and blue codes, respectively . It is clear that the orange code is in the support set of the blue one, and therefore should be decoded first. (c) Code dividing scheme for the raster coding order . By removing the dependencies among codes in each row , the orange and blue codes can be decoded in parallel. Howe ver , the orange code is excluded from the support set of the blue one, which may hinder entropy estimation accuracy . (d) Zigzag coding order and its corresponding code dividing scheme. The two codes in the orange squares that are important for the orange code in entropy prediction are retained in its partial context. (e) Support sets of the orange and blue codes in compliance with the zigzag coding order . p q r (a) p q r (b) Fig. 3. Illustration of the proposed 3D zigzag coding order and 3D code dividing technique. (a) Each group in the shape of a diagonal plane is highlighted in green. Specifically , GP k ( y ) = { y r ( p, q ) | r + p + q = k } are encoded earlier than GP k +1 ( y ) . W ithin GP k ( y ) , we first process codes along the line p + q = k by gradually decreasing p . W e then process codes along the line p + q = k − 1 with the same order. The procedure continues until we sweep codes along the last line p + q = max( k − r, 0) in GP k ( y ) . (b) Support set of the orange code with a spatial filter size of 3 × 3 . Zoom in for improved visibility . p q r (a) p q r (b) Fig. 4. Illustration of masked codes with M = 6 , r = 2 , and a filter size of 3 × 3 . Blue dots represent codes activ ated by the mask and red dots indicate the opposite. The only difference lies in the green diagonal plane. (a) Input layer . (b) Hidden layer . because the orange code is in the support set (or context) of the blue code given the raster coding order . T o speed up entropy decoding, we may further remo ve dependencies between codes at the risk of model accuracy . Specifically , we partition y into K groups, namely , GP 0 ( y ) , . . . , GP K − 1 ( y ) , and assume the codes within the same group are statistically indepen- dent. This results in a partial context PTX( y ( p, q ) , y ) = { GP 0 ( y ) , . . . , GP k − 1 ( y ) } for y ( p, q ) ∈ GP k ( y ) . In other words, all codes in the k -th group share the same partial context, and can be decoded in parallel. Note that code di vid- ing schemes are lar gely constrained by pre-specified coding orders. For example, if we use a raster coding order , it is straightforward to divide y by ro w . In this case, y ( p, q − 1) ( p and q inde x vertical and horizontal directions, respecti vely), which is extremely important in predicting the probability of y ( p, q ) according to CAB A C [ 8 ], has been excluded from its partial context. T o make a good trade-off between modeling efficienc y and accuracy , we switch to a zigzag coding order as shown in Fig. 2 (d), where GP k ( y ) = { y ( p, q ) | p + q = k } and PTX( y ( p, q ) , y ) = { y ( p 0 , q 0 ) | p 0 + q 0 < k } . As such, we retain the most relev ant codes in the partial context for better entropy modeling (see Fig. 9 for quantitativ e results). Accordingly , the mask at the t -th layer becomes m ( t ) ( µ, ν ) = ( m (0) ( µ, ν ) , if µ + ν 6 = 0 1 , otherwise . (7) 5 M Conv | 5× 5 | 16× 1 M Conv | 5× 5 | 16× 16 M Conv | 5× 5 | 16× 16 M Conv | 5× 5 | 16× 16 M Conv | 5× 5 | 16× 16 M Conv | 5× 5 | 16× 16 M Conv | 5× 5 | 16× 16 M Conv | 5× 5 | 16× 16 M Conv | 5× 5 | 1× 16 M Conv | 5× 5 | 16× 16 M Conv | 5× 5 | 16× 16 G ra ys c a l e Im a ge Bi na ri z e d Im a ge P l a ne s Conte xt - Ba s e d Convol uti ona l N e t w ork (CCN ) M e a n E s t i m a t e s of B e rnoul l i D i s t ri buti ons (a) (b) (c) y P ( y ) P ( · ) GP 0 ( y ) , . . . , GP k − 1 ( y ) P (GP k ( y )) GP k ( y ) Code Block CCN Arithmetic Encoder Bitstream Decoded Code Block CCN Arithmetic Decoder Decoded Code Group Bitstream Fig. 5. Proposed lossless image compression method. (a) gives the CCN-based entropy model for lossless image compression. The grayscale image x is first con verted to bit-plane representation y , which is fed to the network to predict the mean estimates of Bernoulli distributions P ( y r ( p, q ) | SS( v r ( p, q ))) . Each con volution layer is followed by a parametric ReLU nonlinearity , except for the last layer, where a sigmoid function is adopted. From the mean estimates, we find that for most significant bit-planes, our model makes more confident predictions closely approximating local image structures. For the least significant bit-planes, our model is less confident, producing mean estimates close to 0 . 5 . MCon v: masked conv olution used in our CCNs with filter size ( S × S ) and the number of feature blocks (output × input). (b) and (c) show the arithmetic encoding and decoding with the learned CCN, respectively . Now , we e xtend our discussion to a 3D code block, where y ∈ Q M × H × W . Fig. 3 (a) sho ws the proposed 3D zigzag coding order and 3D code di viding technique. Specifically , y is divided into K = M + H + W − 2 groups in the shape of diagonal planes, where the k -th group is specified by GP k ( y ) = { y r ( p, q ) | r + p + q = k } . The partial context of y r ( p, q ) ∈ GP k ( y ) is defined as PTX( y r ( p, q ) , y ) = { y r 0 ( p 0 , q 0 ) | r 0 + p 0 + q 0 < k } . W e then write masked con- volution in the 3D case as v ( t ) i,r ( p, q ) = N t X j =1 M X s =1  u ( t ) j,s ∗  m ( t ) r,s  w ( t ) i,j,r,s  ( p, q ) + b ( t ) i,r , (8) where { i, j } and { r , s } are indices for the feature block and the channel, respectiv ely . For the 2D case, each layer shares the same mask ( M = 1 ). When extending to 3D code blocks, each channel in a layer shares a mask, and there are a total of M 3D masks. For the input layer, the codes to produce v (0) i,r ( p, q ) is Ω p,q = { y s ( p + µ, q + ν ) } ( µ,ν ) ∈ Ψ , 0 ≤ s