Autoencoder based image compression: can the learning be quantization independent?

This paper explores the problem of learning transforms for image compression via autoencoders. Usually, the rate-distortion performances of image compression are tuned by varying the quantization step size. In the case of autoen-coders, this in princ…

Authors: Thierry Dumas (Sirocco), Aline Roumy (Sirocco), Christine Guillemot (Sirocco)

Autoencoder based image compression: can the learning be quantization   independent?
A UTOENCODER B ASED IMA GE COMPRESSION: CAN THE LEARNING BE QU ANTIZA TION INDEPENDENT? Thierry Dumas, Aline, Roumy and Christine Guillemot INRIA Rennes Bretagne-Atlantique thierry .dumas@inria.fr , aline.roumy@inria.fr , christine.guillemot@inria.fr ABSTRA CT This paper explores the problem of learning transforms for image compression via autoencoders. Usually , the rate- distortion performances of image compression are tuned by varying the quantization step size. In the case of autoen- coders, this in principle would require learning one trans- form per rate-distortion point at a giv en quantization step size. Here, we show that comparable performances can be obtained with a unique learned transform. The dif ferent rate-distortion points are then reached by varying the quantization step size at test time. This approach sa ves a lot of training time. Index T erms — Image compression, deep autoencoders, quantization. 1. INTRODUCTION Image coding standards all use linear and in vertible trans- forms to con vert an image into coefficients with lo w statistical dependencies, i.e suited for scalar quantization. Notably , the discrete cosine transform (DCT) is the most commonly used for two reasons: ( i ) it is image-independent, implying that the DCT does not need to be transmitted, ( ii ) it approaches the optimal orthogonal transform in terms of rate-distortion, assuming that natural images can be modeled by zero-mean Gaussian-Markov processes with high correlation [1]. Deep autoencoders hav e been shown as promising tools for find- ing alternative transforms [2, 3, 4]. Autoencoders learn the encoder-decoder non-linear transform from natural images. In the best image compression algorithms based on au- toencoders [5, 6, 7], one transform is learned per rate- distortion point at a given quantization step size. Then, the quantization step size remains unchanged at test time so that the training and test conditions are identical. By contrast, im- age coding standards implement adapti ve quantizations [8, 9]. Should the quantization be imposed during the training? T o answer this, we propose an approach where the transform and the quantization are learned jointly . Then, we in vestigate whether , at test time, the compression falls apart when the co- efficients obtained with the learned transform are quantized This work has been supported by the French Defense Procurement Agency (DGA). using quantization step sizes which differ from those in the training stage. The code to reproduce our numerical results and train the autoencoders is av ailable online 1 . Matrices and tensors are denoted by bold letters. k X k F is the Frobenius norm of X . X  Z is the elementwise multipli- cation between X and Z . 2. JOINT LEARNING OF THE TRANSFORM AND THE QU ANTIZA TION Section 2 introduces an ef ficient autoencoder for image com- pression. Then, it details our proposal for learning jointly this autoencoder transform and the quantization. 2.1. A utoencoder f or image compression An autoencoder is a neural network with an encoder g e , parametrized by θ , that computes a representation Y from the data X , and a decoder g d , parametrized by φ , that gi ves a re- construction ˆ X of X , see Figure 1. Autoencoders can be used for denoising or dimensionality reduction. When it is used for compression, the representation is also quantized, leading to the new quantized representation ˆ Y = Q ( Y ) . If an au- toencoder has fully-connected layers [10, 11, 12], the number of parameters depends on the image size. This implies that one autoencoder has to be trained per image size. T o a void this, an architecture without fully-connected layer is chosen. It exclusi vely comprises con volutional layers and non-linear operators. In this case, Y ∈ R h × w × m is a set of m feature maps of size n = h × w , see Figure 1. The basic autoencoder training minimizes the image re- construction error [13]. In order to create a rate-distortion optimization, the authors in [6] add the minimization of the entropy of the quantized representation. Moreov er , a bit al- location is performed by learning a normalization for each feature map of Y . The encoder follo wed by the normaliza- tions at the encoder side, parametrized by ϕ e , are denoted g e ( . ; θ , ϕ e ) . Similarly , the normalizations at the decoder side, parametrized by ϕ d , followed by the decoder are de- noted g d ( . ; ϕ d , φ ) . Finally , this leads to (1). 1 www.irisa.fr/temics/demos/visualization_ae/ visualizationAE.htm 𝑿 𝑔 𝑒 . ; 𝜽 𝒀 𝒀 𝑚 𝑤 ℎ 𝑿 𝒬 . … 𝑔 𝑑 . ; 𝝓 … Fig. 1 : Illustration of an autoencoder for image compression. min θ , ϕ e , ϕ d , φ E " k X − g d ( Q ( g e ( X ; θ , ϕ e )) ; ϕ d , φ ) k 2 F + γ m X i =1 H i # H i = − 1 n n X j =1 log 2 ( ˆ p i ( ˆ y ij )) , γ ∈ R ∗ + (1) ˆ p i is the probability mass function of the i th quantized fea- ture map coefficients { ˆ y ij } j =1 ...n . The expectation E [ . ] is ap- proximated by av eraging over a training set of images. Un- fortunately , Q makes minimization (1) unusable. Indeed, the deriv ativ e of any quantization with respect to its input is 0 at any point. Consequently , θ and ϕ e cannot be learned via gradient-based methods [14]. T o get around this issue, [6] fixes the quantization step size to 1 and approximates the uni- form scalar quantization with the addition of a uniform noise of support [-0.5, 0.5]. Note that, e ven though the quantization step size is fixed, the bit allocation varies over the different feature maps via the normalizations. In the next section, we consider instead to remove the normalizations and learn ex- plicitly the quantization step size for each feature map of Y . 2.2. Learning the quantization step sizes W e address the problem of optimizing the quantization step size for each feature map of Y . Because of the quantization, the function to be minimized is an implicit function of the quantization step sizes { δ i } i =1 ...m . The tar get is to make it an explicit function of { δ i } i =1 ...m . F or q ∈ { ..., − δ i , 0 , δ i , ... } , ˆ p i ( q ) = Z q +0 . 5 δ i q − 0 . 5 δ i p i ( t ) dt = δ i ˜ p i ( q ) (2) ˜ p i = p i ∗ l i where p i is the probability density function of the i th feature map coefficients { y ij } j =1 ...n and l i denotes the probability density function of the continuous uniform dis- tribution of support [ − 0 . 5 δ i , 0 . 5 δ i ] . The normalizations are remov ed from (1) and, using (2), (1) becomes (3). min θ , φ E " k X − g d ( g e ( X ; θ ) + E ; φ ) k 2 F + γ m X i =1 ˜ h i # (3) ˜ h i = − log 2 ( δ i ) − 1 n n X j =1 log 2 ( ˜ p i ( y ij + ε ij )) The i th matrix of E ∈ R h × w × m contains n realizations { ε ij } j =1 ...n of E i , E i being a continuous random v ariable of probability density function l i . In (3), the function to be mini- mized is dif ferentiable with respect to θ . θ can thus be learned via gradient-based methods. Ho wev er , { δ i } i =1 ...m cannot yet be learned as the function to be minimized in (3) is not dif- ferentiable with respect to { δ i } i =1 ...m . This is resolv ed using the change of variable E i = δ i T where T is a random v ari- able follo wing the continuous uniform distrib ution of support [ − 0 . 5 , 0 . 5] . Now , the minimization ov er { δ i } i =1 ...m is feasi- ble, see (4). min θ , φ , δ 1 ,...,δ m E " k X − g d ( g e ( X ; θ ) + ∆  T ; φ ) k 2 F + γ m X i =1 ˜ h i # ˜ h i = − log 2 ( δ i ) − 1 n n X j =1 log 2 ( ˜ p i ( y ij + δ i τ ij )) (4) The i th matrix of T ∈ R h × w × m contains n realizations { τ ij } j =1 ...n of T . All the coefficients in the i th matrix of ∆ ∈ R h × w × m are equal to δ i . A detail has been left out so far: ˜ p i is unkno wn. In a similar manner to [5, 6], ˜ p i can be replaced by a function ˜ f i , parametrized by ψ ( i ) , and ψ ( i ) is learned such that ˜ f i fits ˜ p i . In the end, we end up with three groups of parameters: { θ , φ } , { δ i } i =1 ...m and { ψ ( i ) } i =1 ...m . These three groups are learned by alternating three dif ferent stochastic gradient descents. All the training heuristics are detailed in the code 1 . Section 2 has developped an approach for learning ex- plictly the transform and a quantization step size for each fea- ture map of Y . Before e valuating this approach in Section 4, Section 3 studies what would happen if, at test time, the coefficients in Y are quantized using quantization step sizes that differ from those in the training stage. This first requires understanding the internal structure of Y after the training. 3. INSIDE THE LEARNED REPRESENT A TION This section studies the different feature maps of Y after the training. T o this end, a deep con volutional autoencoder must first be built and trained. g e is the composition of a con- volutional layer , a generalized divisi ve normalization (GDN) [15], a conv olutional layer , a GDN and a con volutional layer . (a) i = 50 (b) i = 125 Fig. 2 : Normed histogram of the i th feature map of Y . Fig. 3 : Histogram of the m − 1 scales provided by the fitting. g d is the re verse composition, replacing each GDN with an in verse generalized divisi ve normalization (IGDN) [15] and each con volutional layer with a transpose con volutional layer [16]. It is important to stress that m = 128 , X has one channel and the con volutional strides and paddings are chosen such that h and w are 16 times smaller than respectively the height and the width of X . Therefore, the number of pixels in X is twice the number of coefficients in Y . The train- ing set contains 24000 luminance images of size 256 × 256 that are e xtracted from ImageNet [17]. The minimization is (4), γ = 10000 . 0 . Note that, if a GDN was placed imme- diately after g e , a IGDN was placed immediately before g d and, ∀ i ∈ [ | 1 , m | ] , δ i = 1 . 0 was not learned, the autoencoder architecture and the training would correspond to [6]. 3.1. Distribution of the lear ned r epresentation After the training, a test set of 24 luminance images of size 512 × 768 is created from the K odak suite 2 . Here, X refers to a test luminance image. Figure 2 shows the normed histogram of the 50 th feature map of Y = g e ( X ; θ ) and that of its 125 th feature map, av eraged over the test set. Every feature map of Y , e xcept the 90 th , has a normed histogram similar to those displayed. T o be more precise, let’ s write the probability den- sity function of the Laplace distribution with mean µ ∈ R and scale λ ∈ R ∗ + , denoted f ( . ; µ, λ ) . 2 r0k.us/graphics/kodak/ (a) α = 8 . 0 (b) α = − 8 . 0 (c) α = 20 . 0 (d) α = − 20 . 0 Fig. 4 : 64 × 64 crop at the top-left of ˆ X . j = 50 in ( a ) and ( b ) . j = 125 in ( c ) and ( d ) . f ( x ; µ, λ ) = 1 2 λ exp  − | x − µ | λ  ∀ i ∈ [ | 1 , m | ] , i 6 = 90 , there exists µ i ∈ R and λ i ∈ R ∗ + such that f ( . ; µ i , λ i ) fits well the normed histogram of the i th feature map of Y . Note that most of the m − 1 scales be- long to [0 . 5 , 2 . 0] , see Figure 3. For transformed coef ficients having a zero-mean Laplace distribution, [18] proves that a uniform reconstruction quantizer (URQ) with constant deci- sion offsets approaches the optimal scalar quantizer in terms of squared-error distortion for any quantization step size. Y et, in our case, ( i ) the m − 1 Laplace probability density functions are not zero-mean, ( ii ) uniform scalar quantizers are used in- stead of t his URQ. The point ( i ) is not problematic as an e xtra set of luminance images is used to compute an approximation µ i ∈ R of the mean of the i th feature map of Y , then, at test time, the i th feature map of Y is centered via µ i before be- ing quantized. Note that { µ i } i =1 ...m does not depend on the test luminance images, thus incurring no transmission cost. Regarding the point ( ii ) , it must be noted that the decoder mapping of the URQ is exactly the decoder mapping of the uniform scalar quantization with same quantization step size. Since our case comes close to the requirements of the proof in [18], at test time, the rate-distortion trade-off should not col- lapse as the quantization step sizes de viate from the learned values. This will be verified in Section 4. 3.2. Internal structure of the learned repr esentation The shortcoming of the previous fitting is that it does not rev eal what information each matrix of Y encodes. T o dis- cov er it, further visualizations are needed. The most common way of exploring a deep con volutional neural network (CNN) trained for image recognition is to look at the image, at the CNN input, resulting from the maximization over its pix els of a giv en neural acti vation in the CNN [19, 20, 21]. Precisely , [19, 20] maximize over the image pix els a gi ven neural acti- vation at the CNN output, i.e a class probability . This shows what image features characterize this class according to the CNN. In our case, the maximization over the image pixels of a given coefficient in Y does not yield interpretable images. Indeed, the coef ficients in Y are not bounded. This may ex- plain why the maximization often returns saturated images. Fig. 5 : Rate-distortion curves averaged over the 24 luminance images from the K odak suite. Alternativ ely , the information the j th feature map of Y encodes, j ∈ [ | 1 , m | ] , can be seen as follo ws. ∀ i ∈ [ | 1 , m | ] , all the coef ficients in the i th feature map of Y are set to µ i . This w ay , the feature maps of Y contains no significant infor- mation. Then, a single coef ficient in the j th feature map of Y is set to α ∈ R and ˆ X = g d ( Q ( Y ) ; φ ) is displayed. α is selected such that it is near one of the two tails of the Laplace distribution of the j th feature map of Y . Figure 4 sho ws the 64 × 64 crop at the top-left of ˆ X when the single coef ficient is located at the top-left corner of the j th feature map of Y , j ∈ { 50 , 125 } . W e see that the 50 th feature map of Y encodes a spatially localized image feature whereas its 250 th feature map encodes a spatially extended image feature. Moreover , the image feature is turned into its symmetrical feature, with respect to the mean pixel intensity , by moving α from the right tail of the Laplace distribution of the j th feature map of Y to the left tail. This linear behaviour is observed for each feature map of Y . It is interesting to see that, giv en the fitting in Section 3.1, Y is similar to the DCT coef ficients for blocks of prediction error samples in H.265 [9] in terms of distribution. Howe ver , when looking at the information each feature map of Y en- codes, Y has nothing to do with these DCT coefficients. 4. EXPERIMENTS W e now ev aluate in terms of rate-distortion performances: ( i ) whether the way of learning the quantization matters, ( ii ) whether , at test time, it is ef ficient to quantize the coef ficients obtained with the learned transform using quantization step sizes which dif fer from those in the training stage. This is done by comparing three cases. The 1 st case follo ws the approach in [6]. One transform is learned per rate-distortion point, the bit allocation being learned via the normalizations. In details, an autoencoder is trained for each γ ∈ S = { 10000 . 0 , 12000 . 0 , 16000 . 0 , 24000 . 0 , 40000 . 0 , 72000 . 0 , 96000 . 0 } . During the training and at test time, the quantization step size is fixed to 1.0. In the 2 nd case, a unique transform is learned, the bit al- location being done by learning a quantization step size per feature map. More precisely , a single autoencoder is trained for γ = 10000 . 0 and { δ i } i =1 ...m is learned, see Section 2. At test time, the rate varies as the quantization step sizes are equal to the learned quantization step sizes multiplied by β ∈ B = { 1 . 0 , 1 . 25 , 1 . 5 , 2 . 0 , 3 . 0 , 4 . 0 , 6 . 0 , 8 . 0 , 10 . 0 } . In the 3 rd case, a unique transform is learned, the bit al- location being learned via the normalizations. In details, a single autoencoder is trained for γ = 10000 . 0 and, during the training, the quantization step size is 1.0. At test time, the rate varies as the quantization step size spans B . In the 2 nd case, the autoencoder has the architecture de- scribed at the beginning of Section 3. In the 1 st and 3 rd case, a GDN is also placed after g e and a IGDN is placed before g d . The autoencoders are trained on 24000 luminance images of size 256 × 256 that are extracted from ImageNet. Then, at test time, the 24 luminance images from the Kodak suite are in- serted into the autoencoders. The rate is estimated via the em- pirical entropy of the quantized coefficients, assuming that the quantized coefficients are i.i.d. Note that, for the 2 nd and the 3 rd case, we have also implemented a binarizer and a binary arithmetic coder to compress the quantized coef ficients loss- lessly , see the code 1 . The dif ference between the estimated rate and the exact rate via the lossless coding is always smaller than 0 . 04 bbp. Figure 5 shows the rate-distortion curves a ver- aged ov er the 24 luminance images. The JPEG2000 curve is obtained using ImageMagick. The H.265 [22] curve is com- puted via the version HM-16.15. There is hardly any differ - ence between the 2 nd and the 3 rd case. This means that the ex- plicit learning of the transform and the quantization step sizes is equi valent to learning the transform and the normalizations while the quantization step size is imposed. Note that, in the 2 nd case, the learning of { δ i } i =1 ...m in volves 128 parameters whereas, in the 3 rd case, that of { ϕ e , ϕ d } inv olves 33024 pa- rameters. The 2 nd and the 3 rd case perform as well as the 1 st case. The minimization (4) and the training in [6] pro vide learned transforms which can be used with various quantiza- tion step sizes at test time. It is con venient not to train one autoencoder per compression rate as a single training takes 4 days on a NVIDIA GTX 1080. Finally , we see that the learned transforms yield better rate-distortion performances than JPEG2000. The quality of image reconstruction for the experiment in Figure 5 and another e xperiment on luminance images created from the BSDS300 [23] can be seen online 1 . 5. CONCLUSION Using a unique transform learned via autoencoders and v ar- ious quantization step sizes at test time, it is possible to compress as well as when learning one transform per rate- distortion point at a given quantization step size. Moreov er , the learned transformed outperform other image compression algorithms based on transforms. 6. REFERENCES [1] Thomas W iegand and Heiko Schwarz, “Source cod- ing: part I of fundamentals of source and video coding, ” F oundations and T rends in Signal Pr ocessing , vol. 4 (1- 2), pp. 1–222, January 2011. [2] George T oderici, Sean M. O’Malley , Sung Jin Hw ang, Damien V incent, Da vid Minnen, Shumeet Baluja, Michele Cov ell, and Rahul Sukthankar, “V ariable rate image compression with recurrent neural networks, ” in ICLR , 2016. [3] Karol Gregor , Frederic Besse, Danilo J. Rezende, Ivo Danihelka, and Daan W ierstra, “T owards conceptual compression, ” arXiv pr eprint arXiv:1604.08772 , April 2016. [4] Thierry Dumas, Aline Roumy , and Christine Guille- mot, “Image compression with stochastic winner-take- all auto-encoder , ” in ICASSP , 2017. [5] Lucas Theis, W enzhe Shi, Andre w Cunningham, and Ferenc Husz ´ ar , “Lossy image compression with com- pressiv e autoencoders, ” in ICLR , 2017. [6] Johannes Ball ´ e, V alero Laparra, and Eero P . Simoncelli, “End-to-end optimized image compression, ” in ICLR , 2017. [7] Oren Rippel and Lubomir Bourdev , “Real-time adaptive image compression, ” in ICML , 2017. [8] Michael W . Marcelli, Margaret A. Leple y , Ali Bilgin, Thomas J. Flohr, T roy T . Chinen, and James H. Kasner, “ An overvie w of quantization in JPEG 2000, ” Image Communication , vol. 17 (1), pp. 73–84, January 2002. [9] Thomas W iegand and Heiko Schwarz, “V ideo coding : part II of fundamentals of source and video coding, ” F oundations and T rends in Signal Pr ocessing , vol. 10 (1-3), pp. 1–346, December 2016. [10] Ruslan R. Salakhutdino v and Geoffre y E. Hinton, “Se- mantic hashing, ” in SIGIR W orkshop on Information Retrieval and Applications of Graphical Models , 2007. [11] L. Deng, M. Seltzer, D. Y u, A. Acero, A. Mohamed, and G. Hinton, “Binary coding of speech spectrograms using a deep auto-encoder , ” in INTERSPEECH , 2010. [12] Alex Krizhevsky and Geoffrey E. Hinton, “Using v ery deep autoencoders for content-based image retriev al, ” in ESANN , 2011. [13] Y oshua Bengio, “Learning deep architectures for AI, ” F oundations and T rends in Machine Learning , vol. 2 (1), pp. 1–127, 2009. [14] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. W illiams, “Learning representations by back- propagating errors, ” Natur e , vol. 323, pp. 533–536, Oc- tober 1986. [15] Johannes Ball ´ e, V alero Laparra, and Eero P . Simoncelli, “Density modeling of images using a generalized nor- malization transform, ” in ICLR , 2016. [16] V incent Dumoulin and Francesco V isin, “ A guide to con volution arithmetic for deep learning, ” arXiv pr eprint arXiv:1603.07285 , March 2016. [17] Jia Deng, W ei Dong, Richard Socher , Li-Jia Li, Kai Li, and Fei-Fei Li, “ImageNet: a large-scale hierarchical image database, ” in CVPR , 2009. [18] Gary J. Sulliv an, “Efficient scalar quantization of expo- nential and laplacian random v ariables, ” IEEE T rans- actions on Information Theory , vol. 42, pp. 1365–1374, September 1996. [19] Karen Simonyan, Andrea V edaldi, and Andrew Zisser - man, “Deep inside con volutional networks: visualiz- ing image classification models and saliency maps, ” in ICLR , 2014. [20] Anh Nguyen, Jason Y osinki, and Jef f Clune, “Deep neu- ral netw orks are easily fooled: high confidence predic- tions for unrecognizable images, ” in CVPR , 2015. [21] Jason Y osinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson, “Understanding neural net- works through deep visualization, ” in ICML , 2015. [22] Gary J. Sulliv an, Jens-Rainer Ohm, W oo-Jin Han, and Thomas Wie gand, “Overvie w of the High Efficiency Video Coding (HEVC) Standard, ” IEEE T ransactions on Circuits and Systems for V ideo T echnology , vol. 22 (12), pp. 1649–1667, December 2012. [23] David Martin, Charless Fo wlkes, Doron T al, and Jiten- dra Malik, “ A database of human segmented natural im- ages and its application to ev aluating segmentation al- gorithms and measuring ecological statistics, ” in Pr oc. Int’l Conf. Computer V ision , 2001.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment