Learning for Video Compression

IEEE TRANSA CTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1 Learning for V ideo Compression Zhibo Chen, Senior Member , IEEE, T ian yu He, Xin Jin, Feng W u, F ellow , IEEE Abstract —One key challenge to learning-based video compres- sion is that motion predicti ve coding, a very effectiv e tool for video compression, can hardly be trained into a neural network. In this paper we propose the concept of PixelMotionCNN (PMCNN) which includes motion extension and h ybrid pr edic- tion networks. PMCNN can model spatiotemporal coherence to effectively perform predictiv e coding inside the learning network. On the basis of PMCNN, we further explore a learning-based framework for video compression with additional components of iterative analysis/synthesis, binarization, etc. Experimental results demonstrate the effectiv eness of the proposed scheme. Although entropy coding and complex conﬁgurations are not em- ployed in this paper , we still demonstrate superior perf ormance compared with MPEG-2 and achieve comparable r esults with H.264 codec. The proposed learning-based scheme provides a possible new direction to further improv e compression efﬁciency and functionalities of future video coding. Index T erms —video coding, learning, PixelMotionCNN. I . I N T RO D U C T I O N V IDEO occupies about 75% of the data transmitted on world-wide netw orks and that percentage has been steadily growing and is projected to continue to grow fur - ther [1]. Meanwhile the introduction of ultra-high deﬁnition (UHD), high dynamic range (HDR), wide color gamut (WCG), high frame rate (HFR) and future immersiv e video services hav e dramatically increased the challenge. Therefore, the need for highly efﬁcient video compression technologies are always pressing and ur gent. Since the proposal of concept of hybrid coding by Habibi in 1974 [2] and hybrid spatial-temporal coding framework by Forchheimer in 1981 [3], this Hybrid V ideo Coding (HVC) framew ork has been widely adopted into most popular exist- ing image/video coding standards like JPEG, H.261, MPEG- 2, H.264, and H.265, etc. The video coding performance improv es around 50% ev ery 10 years under the cost of increased computational complexity and memory . And now it encountered great challenges to further signiﬁcantly improve the coding efﬁciency and to deal efﬁciently with novel sophis- ticated and intelligent media applications such as face/body recognition, object tracking, image retriev al, etc. So there exist strong requirements to explore ne w video coding directions This work was supported in part by the National K ey Research and Dev elopment Program of China under Grant No. 2016YFC0801001, the National Program on K ey Basic Research Projects (973 Program) under Grant 2015CB351803, NSFC under Grant 61571413, 61632001, 61390514. ( Corr esponding author: Zhibo Chen ) Zhibo Chen, T ianyu He, Xin Jin and Feng W u are with the CAS Ke y Laboratory of T echnology in Geo-spatial Information Processing and Ap- plication System, Univ ersity of Science and T echnology of China, Hefei 230027, China (e-mail: chenzhibo@ustc.edu.cn; hetianyu@mail.ustc.edu.cn; jinxustc@mail.ustc.edu.cn; fengwu@ustc.edu.cn) Copyright c  2018 IEEE. Personal use of this material is permitted. How- ev er, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org. (a) MPEG-2 (b) Ours Fig. 1: V isualization of reconstructed video compared with MPEG-2 under the compression ratio of about 575 . It is worth noting that entropy coding is not employed for our results, even though it is commonly done in standard video compression codecs. and frame works as potential candidates for future video coding schemes, especially considering the outstanding dev elopment of machine learning technologies. Recently , some learning-based image compression schemes hav e been proposed [4]–[6] with types of neural networks such as auto-encoder, recurrent netw ork and adversarial network, demonstrating a new direction of image/video compression. For example, the ﬁrst work of learning-based image com- pression [4], [7] was introduced in 2016 and demonstrates their better performance compared with the ﬁrst image coding standard JPEG. Howe ver , all learning-based methods proposed so far were dev eloped for still image compression and there is still no published work for video compression. One key bottleneck is that motion compensation, as a very effecti ve tool for video coding, can hardly be trained into a neural network (or would be tremendously more complex than con ventional motion estimation) [8]. Therefore, there exist some research work on replacing some modules (e.g., sub-pel interpolation, up- sampling ﬁltering, post-processing, etc.) in HVC framework by learning-based modules [9]–[11]. Howe ver , such partial replacements are still under the heuristically optimized HVC framew ork without capability to successfully deal with afore- mentioned challenges. In this paper , we ﬁrst propose the concept of PixelMo- tionCNN (PMCNN) by modeling spatiotemporal coherence to effecti vely deal with the aforementioned bottleneck of motion compensation trained into a neural network, and explore a learning-based framework for video compression. Speciﬁcally , we construct a neural network to predict each block of video sequence conditioned on previously reconstructed frame as well as the reconstructed blocks above and to the left of current block. The difference between predicted and original pixels is then analyzed and synthesized iteratively to produce a compact discrete representation. Consequently , the bitstream is obtained that can be used for storage or transmission. T o the best of our knowledge, this is the ﬁrst fully learning-based IEEE TRANSA CTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 2 Fig. 2: The pipeline of the proposed learning-based encoder . The encoder conv erts video into a compressed format (bitstream), then the decoder con verts bitstream back into a reconstructed video. Note that, at decoding time, we only hav e access to the reconstructed data (Spatiotemporal Rec-Memory) instead of original data, therefore, the decoder is included in the encoder to produce reconstructed data for sequentially encoding. W e do not employ lossless compression (entropy coding) in this paper, it can be complemented as dashed line in the ﬁgure. The details of each components are described in Section IV. video compression framework. The remainder of this paper is organized as follows. Section II introduces the related work. In Section III, we explained spa- tiotemporal modeling with PixelMotionCNN, and the learning- based video compression framework is illustrated in Section IV. W e will present the experiment results and analysis in Section V, and then conclude in Section VI. I I . R E L ATE D W O R K Recently there are two kinds of research work trying to apply machine learning techniques into image/video compres- sion problem, one is Codec-based improvements which in- troduces learning-based optimization modules combined with traditional image/video codecs, another is pure Learning-based compression frame work which are mainly focused on learning- based image compression schemes in current stage. A. Codec-based Impr ovements Lossy image/video codecs, such as JPEG and High Efﬁ- ciency V ideo Coding (HEVC) [12], giv e a profound impact on image/video compression. Considering the recent success of neural network-based methods in various domains, a stream of CNN-based improv ements for these codecs have been proposed to further increase coding efﬁcienc y . Most of these works focus on enhancing the performance [10], [13] or reducing the complexity [14], [15] of codec by replacing manually designed function with learning-based approach. Similarly , a series of works adopt CNN in post-processing to reduce the artifacts of compression [16]–[18]. Encouraged by positi ve results in domain of super-resolution, another line of work encodes the down-sampled content with codec and then up-samples the decoded one by CNN for reconstruction [11], [19]. B. Learning-based Image Compression End-to-end image compression has surged for almost two years, opening up a ne w av enue for lossy compression. The majority of them adopt an autoencoder-like scheme. In the works of [4], [7], [20], [21], a discrete and compact representation is obtained by applying a quantization to the bottleneck of auto-encoder . T o achie ve variable bit rates, the model progressively analyzes and synthesizes residual errors with several auto-encoders. Progressiv e codes are essential to Rate-Distortion Optimization (RDO), since a higher quality can be attained by adding additional bits. On the basis of these progressive analyzer, Baig et al. introduce an inpainting scheme that exploits spatial coherence exhibited by neighbor- ing blocks to reduce redundancy in image [22]. Ball ´ e et al. relaxed discontinuous quantization step with additiv e uniform noise to alleviate the non-differentiability , and developed an effecti ve non-linear transform coding frame- work in the context of compression [5]. Similarly , Theis et al. replaced quantization with a smooth approximation and optimized a conv olutional auto-encoder with an incremental training strategy [6]. Instead of optimizing for pixel ﬁdelity (e.g., Mean Square Error) as most codecs do, Chen et al. op- timized coding parameters for minimizing semantic difference between the original image and the compressed one [23]. Compared to image, video contains highly temporal cor - relation between frames. Se veral works focus on frame in- terpolation [24] or frame extrapolation [25]–[27] to lev erage this correlation and increase frame rate. On the other hand, some efforts have been made to estimate optical ﬂow between frames with [28] or without [29] supervision as footstone for early-stage video analysis. Moreover , Spatial Transformer Networks (STN) [30], [31] also capture temporal correlation and provide the ability to spatially transform feature maps by applying parametric transformation to blocks of feature map, allowing it to zoom, rotate and ske w the input. Ho wev er, the quality of synthesis images in these methods is not high enough to be directly applied in video coding. Encouraged by the aforementioned achiev ements, we take advantage of the successful exploration in learning-based im- age compression and further explore a learning-based frame- work for video compression. IEEE TRANSA CTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 3 Fig. 3: Detailed architecture of our video compression scheme. I I I . S PA T I OT E M P O R A L M O D E L I N G W I T H P I X E L M OT I O N C N N In a typical scene, there are great spatiotemporal correlations between pixels in a video sequence. One effecti ve approach to de-correlate highly correlated neighboring signal samples is to model the spatiotemporal distribution o f pixels in the video. Since videos are generally encoded and decoded in a sequence, the modeling problem can be solved by estimating a product of conditional distrib utions (conditioned on reconstructed v alues), instead of modeling a joint distribution of the pixels. Oord et al. [32] introduced PixelCNN for generative image modeling, which can be regarded as a kind of intra-frame prediction for de-correlating neighboring pixels inside one frame. T o further de-correlate neighboring frames of the same sequence, we here propose PixelMotionCNN (PMCNN) that sequentially predicts the blocks in a video sequence. A. Model W e consider the circumstance where videos are encoded and decoded frame-by-frame in chronological order , and block- by-block in a raster scan order . W e deﬁne a video sequence { f 1 , f 2 , ..., f I } as a collection of m frames that are ordered along the time axis. Each frame comprises n blocks sequential- ized in a raster scan order , formulated as f i = { b i 1 , b i 2 , ..., b i J } . Inspired by PixelCNN, we can f actorize the distribution of each frame: p ( f i ) = J Y j =1 p ( b i j | f 1 , ..., f i − 1 , b i 1 , ..., b i j − 1 ) (1) where p ( b i j | f 1 , ..., f i − 1 , b i 1 , ..., b i j − 1 ) is the probability of the j th block b i j in the i th frame, giv en the previous frames f 1 , ..., f i − 1 as well as the blocks b i 1 , ..., b i j − 1 abov e and to the left of the current block. Note that, after the transmission of bitstream, we only have access to the reconstructed data instead of original data at decoding stage. Therefore, as Figure 2 illustrates, we sequen- tially analyze and synthesize each block, saving all the re- constructed frames { ˆ f 1 , ˆ f 2 , ..., ˆ f I } and blocks { ˆ b i 1 , ˆ b i 2 , ..., ˆ b i J } in Spatiotemporal Rec-Memory . Instead of conditioning on the pre viously generated content as PixelCNN does, PM- CNN learns to predict the conditional probability distribution p ( b i j | ˆ f 1 , ..., ˆ f i − 1 , ˆ b i 1 , ..., ˆ b i j − 1 ) conditioned on previously reconstructed content. Spatiotemporal modeling is essential to our learning- based video compression pipeline, as it offers excellent de- correlation capability for sources. The experiments in Section V demonstrate its effecti veness compared with pure spatial or temporal modeling schemes. B. Arc hitectural Components Considering motion information in the temporal direction, we assume that the pixels in neighboring frames are correlated along motion trajectory and there usually exists a linear or non- linear displacement between them. Therefore, as illustrated in Figure 3, we ﬁrst employ motion extension for approximating linear part of motion displacement, yielding an extended frame ¯ f i . The non-linear part of motion displacement and the recon- structed blocks (abov e and to the left of current block) are then jointly modeled with con volutional neural network. This scheme leverages spatiotemporal coherence simultaneously to provide a prediction progressively , and reduces the amount of information required to be transmitted without any other side information (e.g., motion vector). Motion Extension The objective of motion extension is to e xtend motion trajectory obtained from pre vious two re- constructed frames ˆ f i − 2 , ˆ f i − 1 . Let v x , v y , x, y ∈ Z , we ﬁrst determine a motion vector ( v x , v y ) between ˆ f i − 2 and ˆ f i − 1 IEEE TRANSA CTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 4 (a) Motion Estimation (b) Motion Extension Fig. 4: Comparison between motion estimation and motion extension. The query frame (the start point of blue dashed arrow) is divided into blocks, and each of the blocks is compared with the blocks in the former frames (the end point of blue dashed arrow) to form a motion vector (the black arrow). The black dashed arrow in (b) has the same value as the black arrow , which direct where should the values in ¯ f i be copied from. Note that, the calculation of motion vectors in (b) only depends on reconstructed content, resulting the omission of transmission, which differs from (a) that transmits motion vectors as side information. by block matching with 4 × 4 block size. W e ﬁll in the whole frame ˆ f i by copying blocks from ˆ f i − 1 according to motion trajectory estimated from corresponding block in ˆ f i − 1 . F or instance, the values of block ¯ b i centered in ( x, y ) in extended frame ¯ f i are copied from ˆ b i − 1 centered in ( x − v x , y − v y ) . W e repeat this operation to obtain complete v alues of ¯ f i as extended frame. It is important to note that there are two intrinsical dif ferences between motion e xtension and motion estimation [33] used in traditional video coding schemes: • W e employ motion e xtension as preprocessing to generate an extended input of PMCNN which utilizes former reconstructed reference frames to generate current coding block. This is essentially different from motion estimation used in traditional codecs which utilizes current coding block to search matching block in reconstructed reference frames. • In general, traditional codecs transmit motion vectors as side information since they indicate where the estimation of current coding block is directly from. By contrast, our proposed scheme doesn’ t need to transmit motion v ectors. Our scheme can deﬁnitely obtain a complete frame without gaps since generally motion trajectories is non-linear between frames and can easily be optimized by neural networks. Hybrid Prediction In particular , we employ a con volutional neural network that accepts extended frame as its input and outputs an estimation of current block. Pre vious works [34] hav e shown that ConvLSTM has the potential to model tem- poral correlation while reserving spatial in variance. Moreover , residual learning (Res-Block) [35] is a powerful technique proposed to train very deep conv olutional neural network. W e here exploit the strength of Con vLSTM and Res-Block to sequentially connect features of ˆ f i − 2 , ˆ f i − 1 and ¯ f i . An estimation of current frame, as well as the blocks above and to the left of current block, is then fed into several Conv olution- BatchNorm-ReLU modules [36]. As described in Section III-A, the spatiotemporal coherence is modeled concurrently in this scheme, producing a prediction ˜ b i j of current block. In this section, we deﬁne the form of PMCNN and then describe the detailed architecture of PMCNN 1 . In the ne xt section, we give a comprehensi ve explanation of our frame- work that employ PMCNN as predictiv e coding. I V . L E A R N I N G F O R V I D E O C O M P R E S S I O N Our scheme for video compression can be divided into three components: predictiv e coding, iterative analysis/synthesis and binarization. Note that entropy coding is not employed in this paper , even though it is commonly applied in standard video compression codecs, and may gi ve more performance gain. In this section, we give details about each component in our scheme and introduce various modes used in our framew ork. Predicti ve Coding. W e utilize PMCNN for predictive cod- ing, to create a prediction of a block ˜ b i j of the current frame based on previously encoded frames as well as the blocks abov e and to the left of it. This prediction is subtracted from original value to form a residual r i j . A successful prediction may decrease the ener gy in the residual compared with original frame, and the data can be represented with fewer bits [37]. Our learning objectiv e for the PMCNN can be deﬁned as follows: L v cnn = 1 B × J B X i =1 J X j =1 ( b i j − ˜ b i j ) 2 , (2) where B is batch size, J is the total number of blocks in each frame, and ˜ b i j denote the output of PMCNN, the superscript and subscript refer to j th block in the i th frame respectively . Iterative Analysis/Synthesis. Several works put efforts on directly encoding pixel values [4], [5]. By contrast we encode the difference between the predicted and the original pixel values. W e adopt the model of T oderici et al. [4], which is composed of se veral LSTM-based auto-encoders with connections between adjacent stages. The residuals between reconstruction and target are analyzed and synthesized itera- tiv ely to pro vide a v ariable-rate compression. Each stage n produces a compact representation required to be transmitted of input residual r i ( n ) j . W e can represent the loss function of iterativ e analysis/synthesis as follows: L res = 1 B × J B X i =1 J X j =1 ( r i (1) j − S X m =1 ˆ r i ( m ) j ) 2 , (3) where r i (1) j = b i j − ˜ b i j , (4) 1 W e give all parameters in the Appendix A. IEEE TRANSA CTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 5 (a) Akiyo (b) BasketballDrill (c) Claire (d) Drawing (e) Foreman (f) FourPeople (g) Pairs (h) Silent Fig. 5: Thumbnail of test sequences. r i ( n ) j = r i (1) j − n − 1 X m =1 ˆ r i ( m ) j , (5) and ˆ r i ( n ) j indicates the output of n th stage, S is the total number of stages ( 8 in this paper). Binarization. Binarization is actually where signiﬁcant amount of data reduction can be attained, since such a many- to-one mapping reduce the number of possible signal values at the cost of introducing some numerical errors. Unfortunately , binarization is an inherently non-differentiable operation that cannot to be optimized with gradient-based techniques. How- ev er , some researchers ha ve tackled this problem by mathe- matical approximation [4]–[6]. Follo wing Raiko et al. [38] and T oderici et al. [7], we add a probabilistic quantization noise  for the forward pass and keep the gradients unchanged for the backward pass:  =  1 − c in , with probability 1+ c in 2 − c in − 1 , otherwise, (6) where c in ∈ [ − 1 , 1] represents the input of binarizer . The output of binarizer can thus be formulated as c out = c in +  . Note that, the binarizer takes no parameter in our scheme. At test phase, in order to generate extended frame, we need to encode the ﬁrst two frames directly (without PMCNN). Similarly , we encode the ﬁrst row and the ﬁrst column of blocks in each frame only conditioned on previous frames { ˆ f 1 , ..., ˆ f i − 1 } since they hav e no spatial neighborhood to be used for predication. A. Objective Function Although there exist potential to use various metrics in this framew ork for gradient-based optimization, we here employ MSE as the metric for simplicity . During the training phase, PMCNN, iterativ e analyzer / synthesizer and binarizer are jointly optimized to learn a compact representation of input video sequence. The overall objective can be formulated as: L total = L v cnn + L res , (7) where L v cnn and L res represent the learning objective of PMCNN and iterativ e analysis/synthesis respectively . B. Spatially Pr ogr essive Coding W e design a spatially progressive coding scheme in the test phase, by performing various number of iterations determined for each block by a quality metric (e.g. PSNR). This spatially progressiv e coding scheme enables the functionality of adap- tiv ely allocating different bits to different blocks, which can be applied to further improving coding performance similar to rate control in traditional video coding framework. In this paper , we perform this spatially progressive coding scheme in the simplest way , that is to continue to progressi vely encode residual when the MSE between reconstructed block and the original block is lower than a threshold. C. T emporally Pr ogressive Coding In addition to spatially progressiv e coding, we also emplo y a temporally progressiv e coding scheme that exploits in variance between adjacent frames to progressiv ely determine coding methods of blocks in each frame. W e can achieve this by optionally encoding each block according to a speciﬁc met- ric. Also, we implement this scheme in a simplest way in this paper , similar to skip mode deﬁned in traditional video coding framework [39]. In particular, for each block, we ﬁrst determine whether the block should be encoded or not by calculating MSE between b i j and b i − 1 j . The encoding process is ignored if the MSE is lower than a threshold and a ﬂag (a bit) is transmitted to decoder for indicating the selected mode. The ﬂag is encoded by arithmetic coding as the only overhead ( < 1% of bitstream) in our frame work. Note that, the encoding of ﬂag has no effect on the training of entire model since it is an out-loop operation. V . E X P E R I M E N TA L R E S U LT S In this section, we ﬁrst e valuate the performance of PM- CNN, then we compare our video compression scheme with traditional video codecs. Dataset. In general, adjacent frames of the same sequence is highly correlated, resulting in limited di versity . Therefore, we ﬁrst collect an image dataset for the pre-training of iterativ e analyzer / synthesizer module as it is used to compress resid- uals between the reconstructed frame and target frame. The IEEE TRANSA CTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 6 Fig. 6: Quantitative analysis of PMCNN. Each column represents the PSNR/MS-SSIM performance on test sequence. W e can observe that PMCNN lev erages spatiotemporal dependencies, and surpass the performance of other prediction schemes. image dataset contains 530 , 000 color images collected from Flickr . Each image is do wn-sampled to 256x256 to enhance the texture complexity . W e further perform data augmentation in- cluding random rotation and color perturbation during training. Secondly , to model the spatiotemporal distrib ution of pixels in video sequences, we further jointly train PMCNN module (initialized) and iterative analyzer / synthesizer module (pre- trained) on video dataset, which contains 10,000 sequences sampled from UCF-101. For test set, we collect 8 represen- tativ e sequences from MPEG/VCEG common test sequences [40] as demonstrated in Figure 6, including various content categories (e.g. global motion, local motion, different motion amplitude, different texture complexity , etc.). In line with the image dataset, all sequences are resized to 256x192 according to 4:3 aspect ratio. Note that, our video compression scheme is totally block based (ﬁx ed 32x32 in our paper), including PMCNN that sequentially predicts blocks and iterati ve ana- lyzer / synthesizer that progressiv ely compress the residuals between reconstruction and target. Therefore, it can be easily extended to high-resolution scenario. Implementation Details. W e adopt a 32x32 block size for PMCNN and iterativ e analyzer / synthesizer in our paper according to the veriﬁcation and comparison of the preliminary experiment. Although variable block size coding typically demonstrates higher performance than ﬁxed block size in traditional codec [41], we just verify the effecti veness of our method with ﬁxed block size for simplicity . W e ﬁrst pretrain PMCNN on aforementioned video dataset using Adam optimizer [42] with 10 − 3 learning rate and trained with 20 epochs. The learning rate is decreased by 10 − 1 ev ery 5 epochs. All parameters are initialized by Kaiming initialization [43]. Our iterative analyzer / synthesizer model is pretrained on 32x32 image blocks randomly cropped from collected image dataset using Adam optimizer with 10 − 3 learning rate. W e train iterative analyzer / synthesizer with 40 epochs, dropping learning rate by 10 − 1 ev ery 10 epochs. The pretrained PM- CNN and pretrained iterative analyzer / synthesizer are then merged and tuned on video dataset with 10 − 4 learning rate and trained with 10 epochs. Since block-based video compression framew ork will ine vitably introduce block artifacts, we also implement the simplest deblocking technique [44] to handle block artifacts. Baseline. As the ﬁrst work of learning-based video com- pression, we compare our scheme with two representative HVC codecs: MPEG-2 (v1.2) [45] and H.264 (JM 19.0) [46] in our e xperiments. Both of them take YUV 4:2:0 video format as input and output. W e encode the ﬁrst frame as Intra-frame mode and Predicted-frame mode for the remaining with ﬁxed QP . Rate control is disabled for both codecs. Evaluation Metric. T o assess the performance of our model, we report PSNR between the original videos and the reconstructed ones. Follo wing T oderici et al. [4], we also adopt Multi-Scale Structural Similarity (MS-SSIM) [47] as a perceptual metric. Note that, these metrics are applied both T ABLE I: Equivalent bit-rate savings (based on PSNR) of different learning- based prediction modes with respect to No-Pred mode. Sequences Spatial-Pred T emporal-Pred PMCNN Akiyo -9.57% -42.56% -66.31% Foreman -3.82% -18.47% -56.60% Silent -12.12% -53.23% -72.84% A verage -6.78% -30.06% -57.10% IEEE TRANSA CTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 7 Fig. 7: Quantitativ e analysis of our learning-based video compression framework. It is worth noting that entropy coding is not employed for our results, ev en though it is commonly done in standard video compression codecs. on RGB channels and the reported results in this paper are av eraged on each test sequence. A. Quantitative Analysis of Our Scheme Our proposed PMCNN model lev erages spatiotemporal coherence and provide a hybrid prediction ˜ b i j conditioned on previously reconstructed frames ˆ f 1 , ..., ˆ f i − 1 and blocks ˆ b i 1 , ..., ˆ b i j − 1 . T o ev aluate the impact of each condition, we train our PMCNN conditioned on individual dependency re- spectiv ely . W e refer ‘Spatial-Pred’ as the model trained only conditioned on blocks ˆ b i 1 , ..., ˆ b i j − 1 , ‘T emporal-Pred’ as the model trained only conditioned on frames ˆ f 1 , ..., ˆ f i − 1 , ‘No- Pred’ as the model trained on none of these dependencies. W e compare each case with PMCNN on the ﬁrst 30 frames of three representiv e sequences including local motion (Akiyo), global motion (Foreman) and different motion amplitude (Silent). Figure 6 demonstrates ef ﬁciency of the proposed PM- CNN framework, the one simultaneously conditioned on spa- IEEE TRANSA CTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 8 Akiyo Basket- ballDrill Claire Drawing Foreman Four - People Pairs Silent (a) Raw (b) MPEG-2 (c) H.264 (d) Our Scheme Fig. 8: Subjective comparison between various codecs under the same bit-rate. tial and temporal dependencies (‘PMCNN’) outperforms the other two patterns that conditioned on individual dependency (‘T emporal-Pred’ and ‘Spatial-Pred’) or none of these depen- dencies (‘No-Pred’). It is natural because PMCNN modeled on a stronger prior knowledge, while ‘T emporal-Pred’ and ‘Spatial-Pred’ only model the temporal motion trajectory or spatial content relev ance respectiv ely . W e further refer BD-rate [48] (bit-rate savings) to cal- culate equiv alent bit-rate savings between two compression schemes. As T able I illustrates, compared with indi vidual dependency (‘Spatial-Pred’ and ‘T emporal-Pred’), PMCNN effecti vely exploits spatiotemporal coherence and outperforms ‘Spatial-Pred’ and ‘T emporal-Pred’ signiﬁcantly . Our bitstream mainly consists of two parts: the quantized representation generated from iterati ve analysis / synthesis and ﬂags that indicates the selected mode for temporally progressiv e coding ( < 1% bitstream). Moreo ver , we employ sequence header, mode header and frame header in bitstream for synchronization. In our experiments, the percentage of skipped blocks is about 25% ∼ 89% (inﬂuenced by the motion complexity of video content). The Drawing is the sequence with the smallest percentage of skipped blocks, IEEE TRANSA CTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 9 T ABLE II: Equiv alent bit-rate savings (based on PSNR) of Our Scheme with respect to modern codecs. Sequences MPEG-2 H.264 Akiyo -52.57% -6.12% BasketballDrill -28.31% +31.99% Claire -58.41% -40.25% Drawing -30.41% +56.71% Foreman -39.73% +56.67% FourPeople -80.01% -11.71% Pairs -41.87% +8.85% Silent -56.01% -30.74% A verage -48.415% +8.175% while the Claire achieves the largest. Similar skip detection scheme is also applied in traditional hybrid coding frame- works, with a manner of rate-distortion optimization which is more complex than our proposed scheme. In addition, we replace our LSTM-based analyzer / synthesizer with a series classic con volutional layer (the same number of layers as our scheme). The results indicate that the LSTM-based analyzer / synthesizer can achiev e 9 . 27% ∼ 13 . 21% bit-rate reduction ov er a con volutional-based analyzer / synthesizer . W e also verify our trained network on three high-resolution sequences without retraining. Compared with H.264 codec, there are around further 8.25% BD-rate increase by applying the scheme directly to higher resolution sequences. W e can expect an improvement by retraining the network, and it’ s also important to apply variable block size into the scheme, especially for higher resolution content. B. Comparison with T raditional V ideo Codecs As the ﬁrst work of learning-based video compression, we quantitativ ely analyze the performance of our frame work and compare it with modern video codecs. W e provide quantitative comparison with traditional video codecs in T able II and Figure 7, as well as subjective quality comparison in Figure 8. The experimental results illustrate that our proposed scheme outperforms MPEG-2 signiﬁcantly with 48.415% BD-Rate reduction and correspondingly 2.39dB BD-PSNR [49] improvement in average, and demonstrates comparable results with H.264 codec with around 8.175% BD- Rate increase and correspondingly 0.41dB BD-PSNR drop in av erage. W e achiev e this despite the facts: • W e do not perform entropy coding in our aforementioned experiments since it is not the main contribution of this paper and there should be more research work to design an optimized entropy encoder for learning based video compression frame work. In order to demonstrate the potential of this research, we hav e tried a simple entropy coding method described in [4] without any speciﬁc optimization, an av erage performance gain of 12 . 57% can be obtained compared to our scheme without entropy coding. • W e do not perform complex prediction modes selection or adaptive transformation schemes as developed for decades in traditional video coding schemes. Although affected by the abo ve factors and our learning- based video compression framework is in its infancy stage to compete with latest H.265/HEVC video coding technologies [12], it still shows great improv ement over the ﬁrst successful video codec MPEG-2 and has enormous potential in the following aspects: • W e provide a possible new direction to solve the limita- tion of heuristics in HVC by learning the parameters of encoder/decoder . • The gradient-based optimization used in our framew ork can be seamlessly integrated with v arious metrics (loss function) including perceptual ﬁdelity and semantic ﬁ- delity , which is infeasible to HVC. For instance, a neural network-based object tracking algorithm can be employed as a semantic metric for surveillance video compression. • W e have less side information required to be transmitted than HVC. The only overhead in our scheme is the ﬂag ( < 1% of bitstream) used for temporal progressive coding. By contrast, HVC require considerable side in- formation (e.g., motion vector , block partition, prediction mode information, etc.) to indicate sophisticated coding modes. W e calculate the time consuming of our scheme and tra- ditional codecs on the same machine (CPU: i7-4790K, GPU: NVIDIA GTX 1080). The o verall computational complexity of our implementation is about 141 times that of H.264 (JM 19.0). It should be noted that our scheme is just a preliminary exploration of learning-based framew ork for video compression and each part is implemented without any opti- mization. W e belie ve the o verall computational complexity can be reduced in the future by applying algorithm optimization based on speciﬁc AI hardware, and some existing algorithms can also be revised accordingly , e.g., some parallel processing like w ave front parallelism, some efﬁcient network architecture (e.g., Shuf ﬂeNet, MobileNet) or adopting some network model compression techniques (e.g., pruning, distillation). W e also observe that, our approach shows unstable perfor- mance on various test sequences (especially in the case of global motion). This is reasonable since the coding modes we adopted in our algorithm are still very simple and unbalanced. Howe ver , we successfully demonstrate the potential of this framew ork and provide a potential ne w direction for video compression. V I . C O N C L U S I O N W e propose the concept of PMCNN by modeling spatiotem- poral coherence to ef fectiv ely perform predictive coding and explore a learning-based framework for video compression. Although lack of entropy coding, this scheme still achiev es a promising result for video compression, demonstrating a new possible direction of video compression. In the future, we will apply our scheme to high-resolution video sequence and expect more optimization since there are lots of aspects can be further improved in this framework, including entropy coding, IEEE TRANSA CTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 10 T ABLE III: Detailed architecture for PMCNN. Layer T ype Input Size / Stride Dilation BN Activ ation Output Output Size 1 Motion Extension ˆ f i − 2 , ˆ f i − 1 - - - - ¯ f i 256 × 192 × 3 2 Con v ˆ f i − 2 , ˆ f i − 1 , ¯ f i 4 × 4 / 2 - Y ReLU con v2 128 × 96 × 96 3 ResBlock × 4 conv2 - - - - rb3 128 × 96 × 96 4 Con v rb3 4 × 4 / 2 - Y ReLU con v4 64 × 48 × 192 5 ResBlock × 8 conv4 - - - - rb5 64 × 48 × 192 6 Con v rb5 4 × 4 / 2 - Y ReLU con v6 32 × 24 × 192 7 ResBlock × 12 conv6 - - - - rb7 32 × 24 × 192 8 Con v rb7 4 × 4 / 2 - Y ReLU con v8 16 × 12 × 96 9 Con vLSTM con v8 3 × 3 / 1 - N - con vlstm9 16 × 12 × 32 10 DeCon v con vlstm9 5 × 5 / 2 - Y ReLU decon v10 32 × 24 × 32 11 Pooling ¯ f i 5 × 5 / 8 - N - pooling11 32 × 24 × 32 12 Con v pooling11 4 × 4 / 1 - Y ReLU con v12 32 × 24 × 32 13 Concat decon v10, con v12 - - - - concat13 32 × 24 × 64 14 ResBlock × 12 concat13 - - - - rb14 32 × 24 × 64 15 DeCon v rb14 5 × 5 / 2 - Y ReLU decon v15 64 × 48 × 32 16 Pooling ¯ f i 5 × 5 / 4 - N - pooling16 64 × 48 × 32 17 Con v pooling16 4 × 4 / 1 - Y ReLU con v17 64 × 48 × 32 18 Concat decon v15, con v17 - - - - concat18 64 × 48 × 64 19 ResBlock × 8 concat18 - - - - rb19 64 × 48 × 64 20 DeCon v rb19 5 × 5 / 2 - Y ReLU decon v20 128 × 96 × 16 21 Pooling ¯ f i 5 × 5 / 2 - N - pooling21 128 × 96 × 16 22 Con v pooling21 4 × 4 / 1 - Y ReLU con v22 128 × 96 × 16 23 Concat decon v20, con v22 - - - - concat23 128 × 96 × 32 24 ResBlock × 4 concat23 - - - - rb24 128 × 96 × 32 25 DeCon v rb24 5 × 5 / 2 - Y T anh decon v25 256 × 192 × 3 26 Con v ¯ f i , decon v25 4 × 4 / 1 - N T anh con v26 256 × 192 × 3 27 Con vBlock × 8 ˆ b i j − 9 , ˆ b i j − 8 , ˆ b i j − 1 , con v26 - - - T anh cb27 64 × 64 × 3 28 Crop cb27 - - - - output 32 × 32 × 3 ResBlock 1 Con v input 3 × 3 / 1 - Y ReLU con v1 - 2 Con v con v1 3 × 3 / 1 - N ReLU con v2 - 3 Add input, conv2 - - - - output - Con vBlock 1 DilCon v input 3 × 3 / 1 1 Y ReLU con v1 - 2 DilCon v input 3 × 3 / 1 2 Y ReLU con v2 - 3 DilCon v input 3 × 3 / 1 4 Y ReLU con v3 - 4 DilCon v input 3 × 3 / 1 8 Y ReLU con v4 - 5 Concat conv1, conv2, conv3, con v4 - - N - output - variable block size coding, enhanced prediction, advanced post-processing techniques and integration of various metrics (e.g., perceptual/semantic metric), etc. A P P E N D I X A D E T A I L E D A R C H I T E C T U R E F O R P M C N N W e provide all parameters in PMCNN in the T able III. The notation is consistent with paper . In addition, ‘BN’ denotes Batch Normalization. ‘DeConv’ denotes deconv olution layer . ‘Concat’ denotes concatenate feature maps along the last dimension. ‘DilConv’ denotes dilated conv olution layer . R E F E R E N C E S [1] C. Systems, “Cisco visual networking index: Forecast and methodology , 2016–2021, ” CISCO Systems White paper . [2] A. Habibi, “Hybrid coding of pictorial data, ” IEEE T ransactions on Communications , vol. 22, no. 5, pp. 614–624, 1974. [3] R. Forchheimer, “Differential transform coding: A new hybrid coding scheme, ” in Pr oc. Picture Coding Symp.(PCS-81), Montreal, Canada , 1981, pp. 15–16. [4] G. T oderici, D. V incent, N. Johnston, S. Jin Hwang, D. Minnen, J. Shor , and M. Covell, “Full resolution image compression with recurrent neural networks, ” in Computer V ision and P attern Recognition (CVPR) , July 2017. [5] J. Ball ´ e, V . Laparra, and E. P . Simoncelli, “End-to-end optimized image compression, ” in International Confer ence on Learning Representations (ICLR) , 2017. [6] L. Theis, W . Shi, A. Cunningham, and F . Husz ´ ar , “Lossy image com- pression with compressive autoencoders, ” in International Conference on Learning Representations (ICLR) , 2017. [7] G. T oderici, S. M. O’Malley , S. J. Hwang, D. V incent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar, “V ariable rate image com- pression with recurrent neural networks, ” in International Conference on Learning Representations (ICLR) , 2016. [8] J. Ohm and M. W ien, “Future video coding coding tools and develop- ments beyond hevc, ” in T utorial in International Conference on Image Pr ocessing (ICIP) , Sept 2017. [9] A. Prakash, N. Moran, S. Garber, A. DiLillo, and J. Storer , “Semantic perceptual image compression using deep conv olution networks, ” in Data Compression Conference (DCC) . IEEE, 2017, pp. 250–259. [10] N. Y an, D. Liu, H. Li, and F . W u, “ A con volutional neural network approach for half-pel interpolation in video coding, ” in International Symposium on Circuits and Systems (ISCAS) . IEEE, 2017. [11] F . Jiang, W . T ao, S. Liu, J. Ren, X. Guo, and D. Zhao, “ An end-to-end compression framew ork based on con volutional neural networks, ” IEEE T ransactions on Circuits and Systems for V ideo T echnology , 2017. [12] G. J. Sullivan, J. Ohm, W .-J. Han, and T . W iegand, “Overvie w of the high efﬁciency video coding (hevc) standard, ” IEEE Tr ansactions on Cir cuits and Systems for V ideo T echnology , vol. 22, no. 12, pp. 1649– 1668, 2012. [13] R. Song, D. Liu, H. Li, and F . Wu, “Neural network-based arithmetic coding of intra prediction modes in hevc, ” in International Confer ence on V isual Communications and Image Processing (VCIP) . IEEE, 2017. [14] T . Li, M. Xu, and X. Deng, “ A deep conv olutional neural netw ork approach for complexity reduction on intra-mode hevc, ” in International Confer ence on Multimedia and Expo (ICME) . IEEE, 2017, pp. 1255– 1260. [15] X. Y u, Z. Liu, J. Liu, Y . Gao, and D. W ang, “Vlsi friendly fast cu/pu IEEE TRANSA CTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 11 mode decision for hevc intra encoding: Leveraging conv olution neural network, ” in International Confer ence on Image Pr ocessing (ICIP) . IEEE, 2015, pp. 1285–1289. [16] C. Dong, Y . Deng, C. Change Loy , and X. T ang, “Compression artifacts reduction by a deep conv olutional network, ” in International Conference on Computer V ision (ICCV) , 2015, pp. 576–584. [17] T . W ang, M. Chen, and H. Chao, “ A nov el deep learning-based method of improving coding efﬁciency from the decoder-end for hevc, ” in Data Compr ession Confer ence (DCC) . IEEE, 2017, pp. 410–419. [18] Y . Dai, D. Liu, and F . Wu, “ A conv olutional neural network approach for post-processing in hevc intra coding, ” in International Confer ence on Multimedia Modeling (ICMM) . Springer, 2017, pp. 28–39. [19] Y . Li, D. Liu, H. Li, L. Li, F . W u, H. Zhang, and H. Y ang, “Conv olutional neural network-based block up-sampling for intra frame coding, ” IEEE T ransactions on Circuits and Systems for V ideo T echnology , 2017. [20] K. Gregor , F . Besse, D. J. Rezende, I. Danihelka, and D. W ierstra, “T owards conceptual compression, ” in Advances In Neural Information Pr ocessing Systems (NIPS) , 2016, pp. 3549–3557. [21] N. Johnston, D. V incent, D. Minnen, M. Covell, S. Singh, T . Chinen, S. J. Hwang, J. Shor , and G. T oderici, “Improved lossy image compres- sion with priming and spatially adaptiv e bit rates for recurrent networks, ” arXiv preprint arXiv:1703.10114 , 2017. [22] M. H. Baig, V . Koltun, and L. T orresani, “Learning to inpaint for image compression, ” in Advances In Neural Information Pr ocessing Systems (NIPS) , 2017. [23] Z. Chen and T . He, “Learning based facial image compression with semantic ﬁdelity metric, ” arXiv preprint , 2018. [24] S. Santurkar , D. Budden, and N. Sha vit, “Generati ve compression, ” arXiv pr eprint arXiv:1703.01467 , 2017. [25] M. Mathieu, C. Couprie, and Y . LeCun, “Deep multi-scale video prediction beyond mean square error, ” in International Conference on Learning Representations (ICLR) , 2016. [26] W . Lotter , G. Kreiman, and D. Cox, “Deep predictive coding networks for video prediction and unsupervised learning, ” in International Con- fer ence on Learning Representations (ICLR) , 2017. [27] X. Jin, Z. Chen, S. Liu, and W . Zhou, “ Augmented coarse-to-ﬁne video frame synthesis with semantic loss, ” in Chinese Conference on P attern Recognition and Computer V ision (PRCV) . Springer , 2018, pp. 439– 452. [28] A. Dosovitskiy , P . Fischer, E. Ilg, P . Hausser, C. Hazirbas, V . Golkov , P . V an Der Smagt, D. Cremers, and T . Brox, “Flownet: Learning optical ﬂow with conv olutional networks, ” in International Confer ence on Computer V ision (ICCV) , 2015, pp. 2758–2766. [29] Z. Ren, J. Y an, B. Ni, B. Liu, X. Y ang, and H. Zha, “Unsupervised deep learning for optical ﬂow estimation. ” in AAAI , vol. 3, 2017, p. 7. [30] M. Jaderberg, K. Simonyan, A. Zisserman et al. , “Spatial transformer networks, ” in Advances in Neural Information Pr ocessing Systems (NIPS) , 2015, pp. 2017–2025. [31] S. K. Sønderby , C. K. Sønderby , L. Maaløe, and O. Winther , “Recurrent spatial transformer networks, ” arXiv preprint , 2015. [32] A. v . d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks, ” in International Conference on Machine Learning (ICML) , 2017. [33] Z. Chen, J. Xu, Y . He, and J. Zheng, “Fast integer-pel and fractional- pel motion estimation for h. 264/avc, ” Journal of V isual Communication and Image Representation , vol. 17, no. 2, pp. 264–290, 2006. [34] S. Xingjian, Z. Chen, H. W ang, D.-Y . Y eung, W .-K. W ong, and W .- c. W oo, “Con volutional lstm network: A machine learning approach for precipitation no wcasting, ” in Advances in Neur al Information Processing Systems (NIPS) , 2015, pp. 802–810. [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in Computer V ision and P attern Recognition (CVPR) , 2016, pp. 770–778. [36] S. Ioffe and C. Szegedy , “Batch normalization: Accelerating deep network training by reducing internal covariate shift, ” in International Confer ence on Machine Learning (ICML) , 2015, pp. 448–456. [37] J. O’neal, “Predictive quantizing systems (differential pulse code mod- ulation) for the transmission of television signals, ” Bell Labs T echnical Journal , vol. 45, no. 5, pp. 689–721, 1966. [38] T . Raiko, M. Berglund, G. Alain, and L. Dinh, “T echniques for learning binary stochastic feedforward neural networks, ” in International Con- fer ence on Learning Representations (ICLR) , 2015. [39] G. J. Sullivan and T . Wie gand, “V ideo compression-from concepts to the h. 264/avc standard, ” Pr oceedings of the IEEE , vol. 93, no. 1, pp. 18–31, 2005. [40] H. Common, “T est conditions and software reference conﬁgurations, jctvc-l1100, ” ITU-T/ISO/IEC Joint Collaborativ e T eam on V ideo Coding (JCT -VC), T ech. Rep., 2013. [41] D. V aisey and A. Gersho, “V ariable block-size image coding, ” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , vol. 12. IEEE, 1987, pp. 1051–1054. [42] D. Kingma and J. Ba, “ Adam: A method for stochastic optimization, ” in International Conference on Learning Representations (ICLR) , 2015. [43] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers: Surpassing human-le vel performance on imagenet classiﬁcation, ” in International Conference on Computer V ision (ICCV) , 2015, pp. 1026– 1034. [44] P . List, A. Joch, J. Lainema, G. Bjontegaard, and M. Karczewicz, “ Adaptive deblocking ﬁlter , ” IEEE transactions on circuits and systems for video technology , vol. 13, no. 7, pp. 614–619, 2003. [45] ITU-T and I. J. 1, “Generic coding of moving pictures and associated audio information-part 2: video, ” 1994. [46] T . Wie gand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h. 264/avc video coding standard, ” IEEE T ransactions on circuits and systems for video technology , vol. 13, no. 7, pp. 560–576, 2003. [47] Z. W ang, E. P . Simoncelli, and A. C. Bovik, “Multiscale structural simi- larity for image quality assessment, ” in Conference Recor d of the Thirty- Seventh Asilomar Confer ence on Signals, Systems and Computer s , vol. 2. IEEE, 2003, pp. 1398–1402. [48] G. Bjontegeard, “Calcuation of average psnr differences between rd- curves, ” Doc. VCEG-M33 ITU-T Q6/16, Austin, TX, USA, 2-4 April 2001 , 2001. [49] G. Bjontegaard, “Improvements of the bd-psnr model, vceg-ai11, ” ITU-T Q. 6/SG16, 34th VCEG Meeting, Berlin, Germany , 2008.

Learning for Video Compression

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment