Deep Video Precoding

1 Deep V ideo Precoding Eirina Bourtsoulatze, Aaron Chadha, Ilya Fadee v , V asileios Giotsas, and Y iannis Andreopoulos Abstract —Several groups worldwide are currently in vestigating how deep learning may advance the state-of-the-art in image and video coding. An open question is how to mak e deep neural networks work in conjunction with existing (and upcoming) video codecs, such as MPEG H.264/A VC, H.265/HEVC, VVC, Google VP9 and A OMedia A V1, A V2, as well as existing container and transport formats, without imposing any changes at the client side. Such compatibility is a crucial aspect when it comes to practical deployment, especially when considering the fact that the video content industry and hardwar e manufacturers are expected to remain committed to supporting these standards f or the f oreseeable futur e. W e propose to use deep neural networks as precoders for current and future video codecs and adaptiv e video streaming systems. In our current design, the cor e precoding component comprises a cascaded structure of downscaling neural networks that operates during video encoding, prior to transmission. This is coupled with a pr ecoding mode selection algorithm f or each independently-decodable stream segment, which adjusts the downscaling factor according to scene characteristics, the utilized encoder , and the desired bitrate and encoding conﬁguration. Our framework is compatible with all current and future codec and transport standards, as our deep precoding network structure is trained in conjunction with linear upscaling ﬁlters (e.g., the bilinear ﬁlter), which are supported by all web video players. Extensiv e evaluation on FHD (1080p) and UHD (2160p) content and with widely-used H.264/A VC, H.265/HEVC and VP9 encoders, as well as a preliminary evaluation with the current test model of VVC (v .6.2rc1), shows that coupling such standards with the proposed deep video precoding allows f or 8% to 52% rate reduction under encoding conﬁgurations and bitrates suitable for video-on-demand adaptive streaming systems. The use of precoding can also lead to encoding complexity reduction, which is essential for cost-effective cloud deployment of complex encoders like H.265/HEVC, VP9 and VVC, especially when considering the prominence of high-r esolution adaptive video streaming . Index T erms —video coding, neural networks, downscaling, upscaling, adaptiv e streaming, DASH/HLS I . I N T R O D U C TI O N I N JUST a fe w years, technology has completely overhauled the way we consume television, feature ﬁlms and other prime content. For example, Ofcom reported in July 2018 that there are now more UK subscriptions to Netﬂix, Amazon and NOW TV than to traditional pay TV services. 1 The proliferation of ov er-the-top (O TT) streaming content has been matched by an appetite for high-resolution content. For example, 50% of the US homes will have UHD/4K TVs by 2020. At the same time, costs of 4K camera equipment hav e been falling rapidly . Looking ahead, 8K TVs were introduced All authors are with iSIZE Ltd., 3 Falconet Court, 123 W apping High Street, London, E1W 3NX, United Kingdom, email: yiannis@isize.co. This paper has been presented in part at the International Broadcasting Conference, IBC 2019, Amsterdam, The Netherlands. 1 https://www .ofcom.org.uk/about-ofcom/latest/media/media- releases/2018/streaming-overtak es-pay-tv at the 2018 CES by sev eral major manufacturers and se veral broadcasters announced the y will begin 8K broadcasts in time for the 2020 Olympic games in Japan. Alas, for most countries, ev en the deli very of FHD (1080p) content is still plagued by broadband infrastructure problems. T o get round this problem, O TT content providers resort to adaptiv e streaming technologies, such as Dynamic Adaptive Streaming o ver HTTP (D ASH) and HTTP Liv e Streaming (HLS), where the streaming server is of fering bitrate/resolution ladders via the so-called “manifest” ﬁle [1], [2]. This allows the client device to switch to a range of lower resolutions and bitrates when the connection bandwidth does not suf- ﬁce to support the high-quality/full-resolution video bitstream [1]. In order to produce the bitrate/resolution ladders, high- resolution frames are downscaled to lo wer resolutions, with the frame dimensions being reduced with decreased bitrate. W ithin all current adaptiv e streaming systems, this is done using standard downscaling ﬁlters, such as the bicubic ﬁlter , and the chosen resolution per bitrate stays constant throughout the content’ s duration and is indicated in the manifest ﬁle. At the client side, if the chosen bitrate corresponds to low- resolution frames, these frames are upscaled after decoding to match the resolution capabilities of the client’ s device. Unfortunately , the impact on visual quality from the widely- used bicubic downscaling and the lack of dynamic resolution adaptation per bitrate can be quite sev ere [3]. In principle, this could be remedied by post-decoding learnable video upscaling solutions similar to learnable super-resolution techniques for still images [4], [5]. Howe ver , their deployment requires substantial changes to the client device, which are usually too cumbersome and complex to make in practice (a.k.a., the hidden technical debt of machine learning [6]). For example, con volutional neural network (CNN) based upscalers with tens of millions of parameters cannot be supported by mainstream CPU-based web browsers that support D ASH and HLS video playback. A. Pr ecoding for V ideo Communications Precoding has been initially proposed for MIMO wireless communications as the means to preprocess the transmitted signal and perform transmit di versity [7]. Precoding is similar to channel equalization, but the key difference is that it shapes (precodes) the signal according to the operation of the utilized receiv er prior to channel coding. While channel equalization aims to minimize channel errors, a precoder aims to minimize the error in the receiv er output. In this paper , we introduce the concept of precoding for adaptiv e video streaming. As illustrated in Fig. 1, precoding for video is done by preprocessing the input video frames of each independently-decodable group of pictures (GOP) prior 2 CNN-based Downscaling Precoding mode selection algorithm Encoding Decoding T ransport/ storage GOP Linear upscaling Precoding Server Client Dynamic adaptation to transport conditions Optimal scale for manifest Rate adaptation Manifest Fig. 1. Proposed deep video precoding framework. The precoding module performs dynamic resolution adaptation during encoding, prior to streaming. The optimal scale factor is chosen by our adaptiv e precoding mode selection algorithm which operates per GOP , bitrate and codec and is passed to the server-based manifest ﬁle that points all adapti ve streaming clients to the av ailable content chunks (GOPs) and their bitrate and resolution settings. to standard encoding, while allo wing a standard video decoder and player to decode and display them without requiring any modiﬁcations. The ke y idea of precoding is to minimize the distortion at the output of the client’ s player . T o achie ve this, we leverage on the support for multiple resolutions at the player side and introduce a multi-scale precoding conv olu- tional neural network (CNN) that progressiv ely downscales input high-resolution frames ov er multiple scale factors. A mode selection algorithm then selects the best precoding mode (i.e., resolution) to use per GOP based on the GOP frame content, bitrate and codec characteristics. The precoding CNN is designed to compact information in such a way that the aliasing and blurring artifacts generated during linear upscaling are mitigated. This is because, in the v ast majority of devices, video upscaling is performed by means of linear ﬁlters. Thus, our proposed deep video precoding solution focuses on matching standard video players’ built-in linear upscaling ﬁlter implementations, such as the bilinear ﬁlter 2 . Our experiments with standard FHD (1080p) and UHD (2160p) test content from the XIPH repository and well- established implementations of H.264/A VC, H.265/HEVC and VP9 encoders show that, by using deep precoding modes for do wnscaling, we can signiﬁcantly reduce the distortion of video playback compared to conv entional do wnscaling ﬁlters and ﬁxed do wnscaling modes used in standard D ASH/HLS streaming systems. W e further demonstrate through extensi ve experimentation that the proposed adaptiv e precoding mode selection achie ves 8% to 52% bitrate reduction for FHD and UHD content encoding at typical bitrate ranges used in commercial deployments. An important ef fect of the proposed precoding is that video encoding can be accelerated, since 2 Despite an abundance of possible upscaling ﬁlters, in order to remain efﬁ- cient over a multitude of client devices, most web bro wsers only support the bilinear ﬁlter for image and video upscaling in YUV colorspace, e.g., see the Chromium source code that uses libyuv (https://chromium.googlesource.com). many GOPs tend to be shrunk by the precoder to only 6%-64% of their original size, depending on the selected downscaling factor . This is beneﬁcial when considering cloud deployments of such encoders, especially , in vie w of upcoming standards with increased encoding complexity . In summary , our contributions are as follows: • the concept of deep precoding for video deli very is intro- duced as the means of enhancement of the rate-distortion characteristics of any video codec without requiring any changes on the client/decoder side; • a multi-scale precoding CNN is proposed, which do wn- scales high-resolution frames over multiple scale factors and is trained to mitigate the aliasing and blurring arti- facts generated by standard linear upscaling ﬁlters; • an adapti ve precoding mode selection algorithm is pro- posed, which adaptiv ely selects the optimal resolution prior to encoding. It is important to emphasize that deep video precoding is a source encoding optimization framework carried out at the server side, prior to transmission, in order to optimally “shape” the input signal by deri ving the best downscaled representation according to input GOP segment, bitrate, codec and upscaling capabilities at the recei ver , without considering transport conditions. Fig. 1 illustrates how our proposed deep video precoding can be used in conjunction with client- driv en adaptiv e streaming systems like the widely-deployed D ASH/HLS standards. As shown in Fig. 1, precoding for each content, codec and target bitrate is independent of how D ASH/HLS players will switch between encoding ladders to adapt the video bitrate to the transport bandwidth, latency and buffer conditions. That is, once the DASH/HLS-compliant stream is produced by our approach for each encoding bitrate and the manifest ﬁle is created with the corresponding bitrate and resolution information per video se gment, adaptiv e bitrate streaming, stream caching and stream switching mechanisms 3 that cope with bitrate and channel ﬂuctuations can operate as usual. The ke y difference in our case is that the client receiv es, decodes and upscales bespoke representations created by the proposed precoder . Hence, our proposal for deep video precoding is done in its entirety during content encoding and remains agnostic to the transport conditions experienced during each individual video bitstream deli very . B. P aper Or ganization The remainder of this paper is or ganized as follows. In Sec. II, we revie w related work. The proposed multi-scale deep precoding network architecture design, loss function, and implementation and training details are presented in Sec. III. The precoding mode selection algorithm is presented in Sec. IV. Experimental results are gi ven in Sec. V, and Sec. VI concludes the paper . I I . R E L A T E D W O R K Content-adaptiv e encoding has emerged as a popular solu- tion for quality or bitrate gains in standards-based video en- coding. Most commercial providers ha ve already demonstrated content-adaptiv e encoding solutions, typically in the form of bitrate adaptation based on combinations of perceptual metrics, i.e., lowering the encoding bitrate for scenes that are deemed to be simple enough for a standard encoder to process. Such solutions can also be extended to other encoder parameter adaptations, and their essence is in the coupling of a visual quality proﬁle to a pre-cooked encoder -speciﬁc tuning recipe. A. Resolution Adaptation in Image and V ideo Coding It has been known for some time that reducing the input res- olution in image or video coding can impro ve visual quality for lower bitrates as the encoder operates better at the “knee” of its rate-distortion curve [8]. Starting from non-learnable designs, Tsaig et al. [9] explored the design of optimal decimation and interpolation ﬁlters for block image coders like JPEG and showed that low-bitrate image coding between 0.05 bits-per- pixel (bpp) to 0.35 bpp beneﬁts from such designs. Kopf et al. [10] proposed a content-adaptive method, wherein ﬁlter kernel coefﬁcients are adapted with respect to image content. Oztireli et al. [11] proposed an optimization framework to minimize structural similarity between the nearest-neighbor upsampled low-resolution image and the high-resolution image. Recently , Katsav ounidis et al. [3], [12] proposed the notion of the dy- namic optimizer in video encoding: each scene is do wnscaled to a range of resolutions and is subsequently compressed to a range of bitrates. After upscaling to full resolution, the con ve x hull of bitrates/qualities is produced in order to select the best operating resolution for each video segment. Quality can be measured with a wide range of metrics, ranging from simple peak signal to noise ratio (PSNR) to comple x fusion- based metrics like the video multimethod assessment fusion (VMAF) metric [13]. While Bjontegaard distortion-rate (BD- rate) [14] gains of 30% hav e been sho wn in experiments for H.264/A VC and VP9, the dynamic optimizer requires very signiﬁcant computational resources, while it still uses non- learnable downscaling ﬁlters. A learnable CCN-based down- scaling method for image compaction was proposed by Li et al. [15], where the do wnscaling CNN is trained jointly with either a ﬁxed linear upscaling ﬁlter or a trainable upscaling CNN. While that work also explored the concept of learnable downscaling, it is designed for ﬁxed-ratio downscaling and does not provide content, bitrate and codec adapti vity as our proposed deep precoding approach. Howe ver , it forms an important learnable downscaling framework that we use as one of the benchmarks for our work. B. Super-r esolution Methods Overall, while the abo ve methods ha ve sho wn the possibility of rate saving via image and video downscaling, they have not managed to signiﬁcantly outperform classical bicubic downscaling within the context of practical encoding. This has led most researchers and practitioners to conclude that down- scaling with bicubic or Lanczos ﬁlters is the best approach, and instead the focus has shifted on upscaling solutions at the client (i.e., decoder) side that learn to recover image detail assuming such downscaling operators. For example, Georgis et al. [16] proposed backprojection-based upscaling that is tailored to Gaussian kernel downscaling and sho wed that such approaches can be beneﬁcial for FHD and UHD/4K encoding with H.264/A VC and HEVC up to 10mbps and 5mbps, respectively . The majority of recent works in this ﬁeld consider CNN-based upscaling methods. This has been largely motiv ated by the success of deep CNN architectures for single image super-resolution, that have set the state-of- the-art, with recent architectures like VDSR [17], EDSR [18], FSRCNN [19], DRCN [20] and DBPN [4] achieving se veral dB higher PSNR in the luminance channel of standard image benchmarks for lossless image upscaling. Thus, Afonso et al. propose a spatio-temporal resolution adaptation where a CNN-based super -resolution model is used to reconstruct full- resolution content [21]. Li et al. [22] introduce the block adaptiv e resolution coding framework for intra frame coding, where each block within a frame is either do wnscaled or coded at original resolution and then upscaled with a trained CNN at the decoder side. This concept was later extended to include P and B frames as well [23]. Differently from the previous methods that operate in the pix el domain, Liu et al. perform down and upsampling in the residue domain and design upsampling CNN for residue super resolution (SR) with the help of the motion compensated prediction signal [24]. Howe ver , regardless of the domain where they operate, the common principle of all these works is that they use hand- crafted ﬁlters for downscaling and perform upscaling at the decoder side using codec-tailored CNN models inte grated in the coding standard which requires modiﬁcations to the codec. C. Neural-network Repr esentation and Coding Methods While most research efforts hav e focused on learning optimal upscaling ﬁlters [25], inspired by the success of autoencoders for image compression [26], [27], some recent works revise the problem of joint do wnscaling and upscaling 4 using deep CNN-based methods. Shocher et al. [28] recently proposed an upscaling method using deep learning, which trains an image-speciﬁc CNN with high and low-resolution pairs of patches extracted from the test image itself. W eber et al. [29] used con volutional ﬁlters to preserve important visual details, and Hou et al. [30] recently proposed a deep perceptual loss based method. Kim et al. [31] proposed an im- age downscaling/upscaling method using a deep con volutional autoencoder , and achieved state-of-the-art results on stan- dard image benchmarks. An end-to-end image compression framew ork for low bitrates compatible with existing image coding standards was introduced in [32]. It comprises two CNNs: one for image compaction prior to encoding and one for post-processing after decoding. Adapti ve spatio-temporal decomposition prior to encoding, followed by CNN-based spatio-temporal upscaling after decoding was proposed by Afonso et al. and was v alidated with H.265/HEVC encoding [21]. Finally , W ave One recently proposed video encoding with deep neural networks [33] and demonstrated quality gains against a con ventional video encoder without B frames, and focusing on very-high bitrate encoding (20mbps or higher for FHD). While these are important achiev ements, most of these proposals are still outperformed by post-2013 video encoders, like HEVC and VP9, when utilized with their most adv anced video buf fering veriﬁer (VBV) encoding conﬁgurations and appropriate constant rate factor tuning [34]. In addition, all these proposals require advanced GPU capabilities on both the client and the serv er side that cannot be supported by existing video players as they break away from current stan- dards. Therefore, despite the signiﬁcant advances that may be of fered by all these methods in their future incarnations, they do not consider the stringent comple xity and standards compatibility constraints imposed when dealing with adapti ve video streaming under DASH or HLS-compliant clients like web bro wsers. Our work ﬁlls this gap by of fering deep video precoding as the means to optimize existing video encoders with the entirety of the precoding process taking place on the server side and not requiring an y change in the video transport, decoding and display side. I I I . M U LT I - S C A L E P R E C O D I N G N E T W O R K S While any downscaling method can be used with the video precoding frame work of Fig. 1, to enhance the performance of our proposal in a data-driven manner , we introduce a multi-scale precoding neural network. The precoding network comprises a series of CNN precoding blocks that progressively downscale high resolution (HR) video frames over multiple scale factors. W e design the precoding CNN to compact information such that a standard linear upscaler at the video player side will be able to recov er in the best possible way . This is the complete opposite of recent image upscaling architectures that assume simple bicubic downscaling and an extremely complex super-resolution CNN architecture at the video player side. F or example, EDSR [5] comprises o ver 40 million parameters and would be highly impractical on the client side for 30-60 frame-per-second (fps) FHD/UHD video. In the following subsections, we describe the design of the proposed multi-scale precoding networks, including the net- Fig. 2. The architecture of our multi-scale precoding network for video down- scaling, comprising a root mapping R and precoding streams P 1 , P 2 , . . . P M . The luminance frame of each input video frame is downsampled to multiple lower resolutions by the precoding network at the server via the precoding streams. work architecture, loss function, and details of implementation and training. A. Network Ar chitectur e The overall architecture of the precoding network is de- picted in Fig. 2. It consists of a “root” mapping R fol- lowed by M parallelized precoding streams P m . The net- work progressi vely do wnscales individual luminance frames x ∈ R H × W (where H and W are the height and width, respectiv ely) ov er the scale f actors in S . Considering that the human eye is most sensitive to luma information, we intentionally process only the luminance (Y) channel with the precoding network and not the chrominance (Cb, Cr) channels, in order to av oid unnecessary computation. Dong et al. [35] support this claim empirically and, additionally , ﬁnd that training a network on all three channels can actually worsen performance due to the network falling into a bad local minimum. W e also note that this permits for chroma subsampling (e.g., YUV420) as the chrominance channels (Cb,Cr) are downscaled independently using the standard bicubic ﬁlter . 1) Root mapping: The root mapping R , illustrated in Fig 3, comprises two con volutional layers and extracts a set of high- dimensional feature maps r ∈ R H × W × K from the input x , where K is the number of output channels of the root mapping. The root mapping R constitutes less abstract features (such as edges) that are common amongst all precoding streams and scale factors. Therefore, this module is shared between all precoding streams, which helps in reducing complexity . 2) Pr ecoding str eam: The extracted feature maps r are passed to the precoding streams. As depicted in Fig. 3, a precoding stream P m comprises a sequence of N m precoding blocks, which progressi vely downscale the input over a subset of N m scale factors, S m = { s m 1 , s m 2 , . . . , s mN m } ⊆ S , where 1 < s m 1 < s m 2 < · · · < s mN m . The allocation of scale factors to a precoding stream is done in such a way that: (i) complexity is shared equally between streams for ef ﬁcient 5 Fig. 3. Root mapping R and m -th precoding stream P m . The root mapping extracts high-dimensional feature maps r and is shared by all precoding streams. The precoding stream P m contains a sequence of precoding blocks and progressively downsamples the input high-resolution frames over a set of N m scale factors. parallel processing and (ii) the ratio of most of the consecutive pairs of scales s m ( n − 1) , s mn within a stream is constant, i.e., α mn = s mn /s m ( n − 1) = α m ∈ Z + . Such construction of precoding streams exploits the inter-scale correlations among similar scales and enables parameter sharing within each precoding stream, thus further reducing the computational complexity . Most importantly , it renders our netw ork amenable to standard encoding resolution ladders, such as those used in D ASH/HLS streaming [2]. Giv en a precoding stream P m , the n -th constituent precod- ing block receives a function of the output map p m ( n − 1) ∈ R H/s m ( n − 1) × W/s m ( n − 1) × K from the preceding block and out- puts embedding p mn ∈ R H/s mn × W/s mn × K . Notably , we utilize a global residual learning strategy , where we use a skip connection and perform a pixel-wise summation between the root feature maps r (pre-activ ation function and after linear downscaling to the correct resolution with a linear do wnscaling operator D ↓ s mn ) and p mn . Similar global residual learning implementations ha ve also been adopted by SR models [36]– [38]. In our case, our precoding stream effecti vely follo ws a pre-acti vation conﬁguration [39] without batch normaliza- tion. W e ﬁnd empirically that con vergence during training is generally faster with global residual learning, as the precoding blocks only hav e to learn the residual map to remov e distortion introduced by downscaling operations. The resulting do wnscaled feature map is ﬁnally mapped by a single con volutional layer F mn to y mn . Importantly , the sequence of precoding blocks only operates in a higher dimensional feature space without any heavy bottlenecks. It is then the job of con volutional layers F mn to aggregate block outputs into single channel representations of the input x , downscaled by a factor of s mn . For N m precoding blocks and set of N m scale factors S m , the embedding stream outputs a set of N m corresponding downscaled representations Fig. 4. Precoding block design, comprising a series of 3 × 3 and 1 × 1 con volutions. The linear downscaling operation D ↓ α mn is only performed when the downscaling to the target resolution cannot be achiev ed via stride in the ﬁrst con volutional layer . The linear mapping learned from the ﬁrst layer (pre-activ ation function) is passed to the output of the second 3 × 3 con volutional layer (post-acti vation function) with a skip connection and pixel- wise summation. Y m = { y m 1 , y m 2 , . . . , y mN m } of the input x . The output activ ations in Y m are clipped (rescaled) between the minimum and maximum pixel intensities and each repre- sentation can, thus, be passed to the codec as a do wnscaled lo w resolution (LR) frame. These frames can then be individually upscaled to the original resolution using a linear upscaling U ↑ s on the client side, such as bilinear, lanczos or bicubic, where ↑ s indicates upscaling by scale factor s . W e refer to the upscaled frame generated from the downscaled frame y mn as ˆ x mn and denote the set of N m upscaled representations of x as ˆ X m = { ˆ x m 1 , ˆ x m 2 , . . . , ˆ x mN m } . 3) Pr ecoding block: Our precoding block, which consti- tutes the primary component of our network, is illustrated in Fig. 4. The precoding block consists of alternating 3 × 3 and 1 × 1 con volutional layers, where each layer is followed by a parametric ReLU (PReLU) [40] activ ation function. The 1 × 1 con volution is used as an efﬁcient means for channel reduction in order to reduce the overall number of multiply-accumulates (MA Cs) for computation. The n -th precoding block in the m -th precoding stream is effecti vely responsible for downscaling the original high resolution frame by a factor of s mn . In order to maintain lo w complexity , it is important to do wnscale to the tar get resolution as early as possible. Therefore, we group all downsampling operations with the ﬁrst con volutional layer in each precoding block. Denoting the input to the precoding block as i mn , the output of the ﬁrst con volutional layer C mn of the precoding block as c mn (as labelled in Figure 4) and spatial stride as k : c mn = ( C mn ( i mn ; k = α mn ) , if α mn ∈ Z + C mn ( D ↓ α mn ( i mn ); k = 1) , otherwise (1) In other words, downscaling is implemented in the ﬁrst con volutional layer with a stride if α mn is an inte ger; other- wise we use a preceding linear do wnscaling operation D ↓ α mn (bilinear or bicubic). The aim of the precoding block is to reduce aliasing artifacts in a data-dri ven manner . Considering that the upscaling opera- tions are linear and, therefore, hea vily constrained, the network is asymmetric and the precoding structure can not simply learn to pseudo in vert the upscaling. If that were to be the case, it would simply result in a traditional linear anti-aliasing ﬁlter , i.e., a lowpass ﬁlter that removes the high frequency 6 components of the image in a globally-tuned manner, such that the image can be properly resolved. Removing the high fre- quencies from the image leads to blurring. Adapti ve deblurring is an in verse problem, as there is typically limited information in the blurred image to uniquely determine a viable input. As such, within our proposed precoding structure, we can respectiv ely model the anti-aliasing and deblurring processes as a function composition of linear and non-linear mappings. As illustrated in Fig. 4, this is implemented with a skip connection between linear and non-linear paths, as utilized in ResNet [41] and its variants, again follo wing the pre-activ ation structure. In order to ensure that the output of the non-linear path has full range ( −∞ , ∞ ), we can initialize the ﬁnal PReLu before the skip connection such that it approximates an identity function. B. Loss Function Giv en the luminance channel of a ground-truth frame x ∈ R H × W (with Y ranging between 16-235 as per ITU-R BT .601 con version), our goal is to learn the parameters for the root R and all precoding streams P 1 , P 2 , . . . P M . W e denote the root module parameters as θ and the parameters of the m -th precoding stream as φ m . For the m -th precoding stream with downscaling over N m scale factors and I training samples per batch, the composite loss function L m can be deﬁned as: L m ( ˆ X m , x ; θ , φ m ) = 1 I I X i =1 N m X n =1     ˆ x ( i ) mn − x ( i )    + λ    ∇ ˆ x ( i ) mn − ∇ x ( i )     (2) The ﬁrst term represents the L1 loss between each generated upscaled frame ˆ x mn = U ↑ s mn ( y mn ) and the ground-truth frame x , summed over all N m scales, where y mn = F m,n ◦ P m, ↓ s mn ◦ R ( x ; θ , φ m ) and P m, ↓ s mn is the part of the m - th precoding stream that includes all precoding blocks up to the n -th block and is responsible for downscaling the input feature maps r by the scale f actor s mn . The second term represents the L1 loss between the ﬁrst order deri vati ves of the generated high resolution and ground-truth frames. As the ﬁrst order deriv ativ es correspond to edge extraction, the second term acts as an edge preservation regularization for improving perceptual quality . W e set the weight coefﬁcient λ to 0.5 for all experiments; empirically , this was found to produce the best visual quality in output frames. Contrary to recent work [15], [31], we do not add a loss function constraint between the downscaling and upscaling as our upscaling is only linear and, therefore, already heavily constrains the downscaled frames. Most importantly , we do not include the codec in the training process and train end-to-end (i.e., without encoding/transcoding), such that the model does not learn codec speciﬁc dependencies and is able to generalize to mul- tiple codecs. Finally , as we train the parallelized multi-scale precoding network ov er all streams synchronously , our ﬁnal loss function is the summation of (2) over all M precoding streams: L ( ˆ X , x ; θ , φ ) = M X m =1 L m ( ˆ X m , x ; θ , φ m ) (3) where ˆ X = ˆ X 1 ∪ ˆ X 2 ∪ · · · ∪ ˆ X M and φ = { φ 1 , φ 2 , . . . , φ M } . C. Implementation and T raining Details In our proposed multi-scale precoding network, we initialize all kernels using the method of Xavier intialization [42]. W e use PReLU as the acti vation function, as indicated in Fig. 3 and 4. W e use zero padding to ensure that all layers are the same size and do wnscaling is only controlled by a downsampling operation such as a stride or a linear downscaling ﬁlter . The root mapping R comprises a single 3 × 3 and 1 × 1 con volutional layers. W e set the number of channels in all 1 × 1 and 3 × 3 con volutional layers to 4 and 8, respecti vely (excluding F m,n , which uses a kernel size of 3 × 3 but with only a single output channel). Our ﬁnal parallelized implementation comprises three pre- coding streams P 1 , P 2 and P 3 , with the set of scale factors S \ { 1 } partitioned into three subsets: S 1 = { 4 / 3 , 2 , 4 } , S 2 = { 3 / 2 , 3 , 6 } and S 3 = { 5 / 4 , 5 / 2 } . Collecti vely , these include all representative scale f actors used in D ASH/HLS streaming systems [2], as well as additional scale factors that offer higher ﬂexibility in our adapti ve mode selection algorithm. In terms of complexity , for a 1920 × 1080 × 4 - dimensional feature map (assuming no downscaling), a single precoding block requires approximately only 1.33G MACs for downscaling and 640 parameters. Our ﬁnal implementation requires only 3.38G MA Cs and 5.5K parameters o ver all scales per FHD input frame ( 1920 × 1080 ), including root mapping and all precoding streams. W e train the root module and all precoding streams end-to- end with linear upscaling, without the codec, on images from the DIV2K [43] training set. W e train all models with the Adam optimizer with a batch size of 32 for 200k iterations. The initial learning rate is set as 0.001 and decayed by a factor of 0.1 at 100k iterations. W e use data augmentation during training, by randomly ﬂipping the images and train with a 120 × 120 random crops extracted from the DIV2K images. All experiments were conducted in T ensorﬂo w on NVIDIA K80 GPUs. W e do not use T ensorﬂow’ s built-in linear image resizing functions and rewrite all linear upscaling/do wnscaling functions from scratch, such that they match standard FFmpeg and OpenCV implementations. I V . A DA P T I V E V I D E O P R E C O D I N G Giv en the trained multi-scale precoding network, the mode selection algorithm operates on a per GOP , bitrate and codec basis. The goal of mode selection is to determine the optimal precoding scale factor for each GOP . While one can use operational rate-distortion (RD) models for H.264/A VC or H.265/HEVC [44] for ev aluation of the best precoding modes, such models cannot encapsulate the complex and sequence- dependent RD behavior of each codec preset. On the other hand, e xhaustive search solutions like the Netﬂix dynamic optimizer [3] require numerous highly-complex encodings per 7 bitrate and scale factor . This makes them impractical for high- volume/lo w-cost encoding systems. Our approach provides a middle ground between these tw o opposites by computing operational rate-distortion characteristics of the video encoder for each precoding mode and video segment in an ef ﬁcient manner . The precoding mode selection algorithm is outlined in Algorithm 1 and comprises three steps. The ﬁrst step is to obtain the rate-distortion characteristics of each precoding mode (scale). Let S = { s 1 , s 2 , . . . , s N } be the complete set of scale factors that may include the native (i.e., full) resolution. Each GOP segment g is precoded into all possible scales in S with the multi-scale precoding network. Let h i denote the precoded version of GOP g using scale factor s i ∈ S . All precoded v ersions of GOP g are then encoded with the video encoder’ s preset and encoding parameters. As such, per scale factor s i , we obtain a single GOP encoding, which, after decoding and upscaling to the native resolution, produces a single rate-distortion point ( R i , D i ) on the RD plane. 3 This provides signiﬁcant reduction of the required encodings versus approaches like the Netﬂix dynamic optimizer, which needs a con ve x hull of se veral rate-distortion points per scale factor s i , i.e., se veral encodings per scale factor [3]. T o accelerate this step e ven further and av oid unnecessary computational ov erhead, we introduce a “footprinting” process: instead of encoding all frames of the GOP , we perform selecti ve encoding of only a fe w frames, e.g., keeping only ev ery n -th frame in the GOP . This signiﬁcantly speeds up the initial encoding step, especially if multiple precoding modes are considered, as n times fewer frames need to be precoded and encoded per scale factor . W e store the set of tuples R = { ( R i , D i , h i , s i ) } | S | i =1 . Once all the RD points are obtained, we start the pruning process. First, we eliminate all precoding modes, whose RD points do not provide for a monotonically decreasing RD curve. That is, for every precoding mode i , if there exists a precoding mode j , such that R j ≤ R i and D j < D i , the s i precoding mode is pruned out. If, after this elimination pro- cedure, the number of remaining precoding modes is greater than two, we further prune out precoding modes by eliminating those modes whose RD points do not lie on the conv ex hull of the remaining RD points [45], [3]. W e refer to the set of pruned tuples as R pruned ⊆ R . After the pruning stage, we re- encode the remaining precoded GOP representations h i using constant bitrate (CBR) encoding. The bitrate used for the CBR encoding is equal to the average of the bitrates of all RD points remaining after the elimination step. After decoding and upscaling to the native resolution, we obtain a new set of tuples, R 0 pruned , corresponding to the CBR encoding. This ﬁnal step of the algorithm essentially remaps the RD points that remain after the pruning process to a common bitrate value, which is enforced by the CBR encoding. The optimal precoding mode s ∗ is then selected as the mode providing the lowest distortion among this new set of remapped RD points. For illustration purposes, we demonstrate in Fig. 5a-5d the operation of the mode selection algorithm on the ﬁrst 3 W e use mean squared error (MSE) to measure distortion. Even though other distortion measures could be used, MSE is fast to compute and is automatically provided by all encoders. Algorithm 1 : Algorithm for adaptiv e mode selection Input : GOP segment g , complete set of scale factors S Output : optimal precoding mode s ∗ , optimal precoded v ersion h ∗ of GOP segment g 1: procedur e M O D E S E L E C T I O N ( g , S )  Step 1: Extract RD points for all precoding modes 2: R ← {} 3: for each s ∈ S do 4: h ← P R E C O D E ( g , s ) 5: ( e , R ) ← E N C O D E ( h , preset, params) 6: ˆ h ← D E C O D E ( e ) 7: ˆ g ← U P S C A L E ( ˆ h ) 8: D ← k ˆ g − g k 2 9: R ← R ∪ { ( R , D , h , s ) } 10: end for  Step 2: Prune out RD points 11: R sorted ← (( R i , D i , h i , s i ) ∈ R | R i > R i − 1 ) | S | i =1 12: R pruned ← { R sorted [1] } ; D ref ← D 1 13: if | S | > 1 then 14: for i = 2 to | S | do 15: ( R i , D i ) ← R sorted [ i ] 16: if D i < D ref then 17: R pruned ← R pruned ∪ { ( R i , D i , h i , s i ) } 18: D ref ← D i 19: end if 20: end for 21: end if 22: if | R pruned | > 2 then 23: R pruned ← C O N V E X H U L L ( R pruned ) 24: end if  Step 3: Re-encode remaining RD points with CBR 25: R 0 pruned ← {} 26: for each ( R, D, h , s ) ∈ R pruned do 27: ( e , R ) ← E N C O D E ( h , preset, params, ‘CBR’) 28: ˆ h ← D E C O D E ( e ) 29: ˆ g ← U P S C A L E ( ˆ h ) 30: D ← k ˆ g − g k 2 31: R 0 pruned ← R 0 pruned ∪ { ( R, D , h , s ) } 32: end for 33: ( R ∗ , D ∗ , h ∗ , s ∗ ) ← min D ( R 0 pruned ) 34: end procedur e GOP (ﬁrst 30 frames) of the aspen FHD sequence when the latter is encoded with H.264/A VC with target bitrate set to 500kbps. For downscaling, we use nine scale factors S = { 1 , 5 / 4 , 4 / 3 , 3 / 2 , 2 , 5 / 2 , 3 , 4 , 6 } with our trained multi- scale precoding network (described in Section III), while upscaling is performed using the bilinear ﬁlter . Fig. 5a shows the set of RD points obtained after precoding the GOP for each scale factor in S and encoding with H.264/A VC via VBV encoding with the CRF v alues of T able I, which is representativ e of real-world streaming presets [34], and both maximum bitrate and buf fer size set to 500kbps. As shown in Fig. 5b, the RD points (and the corresponding precoding modes) that do not provide for a monotonically increasing rate-PSNR curve are eliminated from the RD plane. Since 8 (a) (b) (c) (d) Fig. 5. Illustration of the operation of the proposed precoding mode selection algorithm during the encoding of the ﬁrst 30 frames of the aspen FHD video sequence. (a) RD points corresponding to all encodings of the ﬁrst GOP with H.264/A VC for all scale factors in S . (b) Remaining RD points after the pruning of the points in (a) that do not provide for a monotonically increasing RD curve. (c) RD points remaining after the elimination of the points that do not lie on the con vex hull of the RD points in (b). (d) Remapping of the RD points in (c) with CBR encoding at average bitrate after the pruning process. there are more than two points remaining after this step, we next prune out the points that do not lie on the con vex hull, which leaves us with two candidate precoding modes s = 6 and s = 5 / 2 . W e ﬁnally re-encode the GOP , precoded with the two modes, with CBR encoding at 567kbps, which is the av erage bitrate of the two points in Fig. 5c. This results in the two RD points shown in Fig. 5d. The selected precoding mode is s ∗ = 6 , since it renders the highest PSNR v alue between the two points. Notice that the CBR encoding acts as an RD remapping and leads to both modes obtaining lower PSNR values than their PSNRs under VBV encoding. Ho wever , the absolute PSNR v alues are not rele vant, since we are looking for the maximum PSNR under CBR; the ﬁnal precoding mode is applied on the entire GOP and the encoding uses the VBV encoding mode preset. Finally , it is also of interest to notice that for this low-bitrate case, the s = 1 case (full resolution) is immediately sho wn to be suboptimal, and the mode selection is left to select from the higher-quality/higher-bitrate mode of s = 5 / 2 and the lo wer-quality/lo wer-bitrate mode of s = 6 , with the latter pre vailing when remapping the two points into CBR encoding at their average rate. V . E X P E R I M E N TA L R E S U LT S In this section, we ev aluate our proposed adaptiv e video optimization framework in scenarios with the widest prac- tical impact. W e ﬁrst compare the performance of the pro- posed multi-scale precoding netw ork with the performance of standard linear do wnscaling ﬁlters for indi vidual precoding modes. Then, we e valuate the entire adaptiv e video precoding framew ork by comparing it to standard video codecs via their FFmpeg open-source implementations, as well as to external highly-regarded video encoders that act as third-party anchors. The use of FFmpeg encoding libraries instead of the reference software libraries provided by the MPEG/ITU-T or A OMedia standardization bodies allo ws for the use of the same VBV model architecture for all tests and corresponds to a widely- used streaming-oriented scenario found in systems deployed around the world. A. Content and T est Settings The test content comprises 16 FHD ( 1920 × 1080 ) and 14 UHD ( 3840 × 2160 ) standard video sequences T ABLE I C R F V A L U ES F O R LI B X 2 64 A N D LI B X 2 65 A ND S P EE D F O R L I BV P X - V P 9 U N DE R O U R U T IL I Z E D PR E C O DI N G M O DE S . Encoders s libx264 libx265 libvpx-vp9 1 23 - - 5/4 23 - - 4/3 23 23 speed=1 3/2 23 23 speed=1 2 18 18 speed=1 5/2 18 18 speed=1 3 18 18 speed=1 4 18 18 speed=1 6 18 18 speed=1 in 8-bit YUV420 format from the XIPH collection 4 , which have also been used in A OMedia standardiza- tion efforts. The FHD test content comprises the se- quences aspen , blue sky , contr olled burn , rush ﬁeld cuts , sunﬂower , rush hour , old town cross , cr owd run , tractor , touchdown , riverbed , r ed kayak , west wind easy , pedes- trian ar ea , duc ks take off , park joy , which have frame rates between 25fps and 50fps. The UHD sequences used in the tests are Netﬂix BarScene , Netﬂix Boat , Netﬂix BoxingPractice , Netﬂix Cr osswalk , Netﬁx Dancers , Netﬂix DinnerScene , Net- ﬂix DrivingPO V , Netﬂix F oodMarket , Netﬂix F oodMarket2 , Netﬂix Narrator , Netﬂix RitualDance , Netﬂix RollerCoaster , Netﬂix T ango , Netﬂix T unnelFlag , all at 60fps. Performance is measured in terms of average PSNR and av erage VMAF , calculated with the tools made a vailable by Netﬂix [13]. A verage PSNR is the arithmetic mean of the PSNR v alues of all YUV channels of each frame. Similarly to PSNR, VMAF is measured per frame and the a verage VMAF is obtained by taking the arithmetic mean over all frames. 4 https://media.xiph.org/video/derf/ The 4K ( 4096 × 2160 ) sequences used were cropped to the 3840 × 2160 (UHD) section of the central portion and were encoded to 8-bit YUV420 format (using x265 lossless compression) prior to encoding to produce UHD sequences. For the experiments of Section V -B, we used only the ﬁrst 240 frames of these UHD sequences as well as the standard two-pass rate control settings in FFmpeg [46]. This corresponds to the typical conﬁguration used in general-purpose video encoding. For the experiments of Section V -C, we used the single-pass VBV encoding settings for libx264/libx265/libvpx-vp9 that correspond to OTT streaming scenarios and can be found in FFmpeg documentation [47] and the downscaling factors and CRF v alues of T able I. 9 T ABLE II B D - R ATE ( ∆ R ) A N D B D -P S N R ( ∆ P) R E SU LT S FO R R E P R ES E N T ATI V E D OW N S CA L I N G F A CT O R S . H.264/A VC H.265/HEVC bicubic Lanczos bicubic Lanczos s ∆ R ∆ P ∆ R ∆ P ∆ R ∆ P ∆ R ∆ P 5/2 -24.70% 0.61dB -19.21% 0.45dB -25.17% 0.55dB -18.84% 0.39dB 2 -18.85% 0.56dB -14.71% 0.42dB -19.25% 0.52dB -14.46% 0.37dB 3/2 -17.11% 0.45dB -11.75% 0.31dB -13.18% 0.32dB -8.26% 0.20dB T ABLE III B D - R ATE ( ∆ R ) A N D B D -V M A F ( ∆ V) F O R RE P R E SE N TA T I V E D OW N SC A L I NG FA CT O R S . H.264/A VC H.265/HEVC bicubic Lanczos bicubic Lanczos s ∆ R ∆ V ∆ R ∆ V ∆ R ∆ V ∆ R ∆ V 5/2 -39.74% 7.86 -34.30% 6.49 -39.73% 7.03 -33.75% 5.74 2 -30.32% 5.81 -27.57% 5.18 -30.20% 5.12 -27.41% 4.57 3/2 -23.21% 3.43 -21.73% 3.18 -18.66% 2.61 -17.67% 2.46 0 1 2 3 4 5 6 7 8 9 10 Bitrate (Mbps) 28 29 30 31 32 33 34 35 Y-PSNR (dB) H.264/AVC (libx264) + proposed precoding H.264/AVC (libx264) + bicubic H.264/AVC (libx264) + lanczos 0.5 1 1.5 2 29 30 31 32 (a) 0 1 2 3 4 5 6 7 8 9 10 Bitrate (Mbps) 30 40 50 60 70 80 90 VMAF H.264/AVC (libx264) + proposed precoding H.264/AVC (libx264) + bicubic H.264/AVC (libx264) + lanczos 0.5 1 1.5 2 40 50 60 (b) Fig. 6. Rate-distortion curves in terms of (a) PSNR and (b) VMAF for FHD content encoded with H.264/A VC and scale factor s = 5 / 2 . While PSNR has been used for decades as a visual quality metric, VMAF is a relatively recent perceptual video quality metric adopted by the video streaming community , which has been sho wn to correlate very well with the human perception of video quality . It is a self-interpretable [0 , 100] visual quality scoring metric that uses a pretrained fusion approach to merge sev eral state-of-the-art individual visual quality scores into a single metric. Both PSNR and VMAF are calculated on nativ e resolution frames after decoding and upscaling with the bilinear ﬁlter that is supported by all video players and web browsers. B. Evaluation of Precoding Modes W e ﬁrst ev aluate the performance of our proposed multi- scale precoding network against the bicubic and Lanczos ﬁlters, which are the two standard do wnscaling ﬁlters sup- ported by all mainstream encoding libraries like FFmpeg. W e focus on three indicativ e scale factors on FHD content, opting for a very common scenario of H.264/A VC encoding under its FFmpeg libx264 implementation. Speciﬁcally , we use the “medium” preset, two pass rate control mode [46], GOP=30, and bitrate range of 0 . 5 − 10 Mbps. BD-rate [14] gains with respect to PSNR and VMAF are shown in T able II and T able III, respectively . Our precoding is shown to consistently outperform bicubic and Lanczos down- scaling for all modes. For PSNR, its BD-rate gains ranged from 8% to 25%, while, for VMAF , rate reduction of 18%- 40% is obtained. Indicative rate-distortion curv es with respect to PSNR and VMAF for s = 5 / 2 scaling factor are presented in Fig. 6a and Fig. 6b, showing that the proposed precoding network consistently outperforms conv entional do wnscaling ﬁlters. While the gain increases at higher bitrates, substantial gain is observed in the low bitrate region as well. Speciﬁcally , for PSNR, the BD-rate and BD-PSNR gains ov er bicubic (Lanczos) downscaling for the 0.5-2Mbps rate region (zoomed part of the curv e in Fig. 6a) are 10.54% (6.9%) and 0.27dB (0.18dB), respectively . For VMAF , the BD-rate and BD- 10 Fig. 7. T wo segments of frame 25 of the crowd run FHD sequence encoded at 5000Kbps with the settings corresponding to Fig. 6. The precoded stream preserves the lettering and overall shapes signiﬁcantly better than Lanczos and bicubic do wnscaling. Best viewed under magniﬁcation. Fig. 8. T wo segments of frame 77 of rush ﬁeld cuts FHD sequence encoded at 5000Kbps with the settings of Fig. 6. The precoded stream preserves the ge- ometric structures signiﬁcantly better than Lanczos and bicubic downscaling. Best viewed under magniﬁcation. VMAF gains over bicubic (Lanczos) do wnscaling are 29.29% (24.74%) and 5.94 (4.91), respecti vely , for the same lo w bitrate region. Example segments of a frame encoded at 5000kbps with the proposed precoding, and the Lanczos and bicubic downscaling are sho wn in Fig. 7 and Fig. 8. The improvement in visual ﬁdelity demonstrated in the ﬁgures is also captured by the (approx.) 10-point average VMAF difference sho wn at the 5Mbps point of Fig. 6b. Se veral of the FHD video sequences encoded with a variation of the proposed precod- ing and H.265/HEVC (and the corresponding H.265/HEVC encoded results) are also av ailable for visual inspection at www .isize.co/portfolio/demo. They can be played with any player , since our proposal does not require any change at the streaming or client side. T ABLE IV A V E R AG E B D- R A T E ( ∆ R ) A N D BD - P S NR ( ∆ P ) F O R 1 6 F H D T E S T S E QU E N CE S A N D S E TT I N G S DE S C R IB E D I N S E CT I O N V - C . T H E P RO P O SE D P R EC O D I NG , I S I ZE , I N C O N JU N C T IO N W I T H H .2 6 4 / A V C, H . 2 6 5/ H E V C A N D V P9 I S E V A L UA T E D AG A I NS T S TAN D AL O N E H .2 6 4 / A VC , H . 26 5 / H EV C A N D V P 9 E N C O DI N G S A S W EL L A S C O M ME R C I AL S O LU T I O NS I N C L UD I N G A W S M E DI A C O N V ERT ( A V C A N D H E V C) A N D A W S E L A ST I C T R AN S C O DE R ( V P 9 ). A VC + iSIZE HEVC + iSIZE VP9 + iSIZE ∆ R ∆ P ∆ R ∆ P ∆ R ∆ P A VC -14.80% 0.72dB – – – – HEVC – – -8.09% 0.27dB – – VP9 – – – – -30.70% 1.21dB A WS -18.23% 0.86dB – – -19.62% 1.17dB T ABLE V A V E R AG E B D- R A T E ( ∆ R ) A N D BD - V M AF ( ∆ V ) F O R 1 6 F H D T E S T S E QU E N CE S A N D S E TT I N G S DE S C R IB E D I N S E CT I O N V - C . T H E P RO P O SE D P R EC O D I NG , I S I ZE , I N C O N JU N C T IO N W I T H H .2 6 4 / A V C, H . 2 6 5/ H E V C A N D V P9 I S E V A L UA T E D AG A I NS T S TAN D AL O N E H .2 6 4 / A VC , H . 26 5 / H EV C A N D V P 9 E N C O DI N G S A S W EL L A S C O M ME R C I AL S O LU T I O NS I N C L UD I N G A W S M E DI A C O N V ERT ( A V C A N D H E V C) A N D A W S E L A ST I C T R AN S C O DE R ( V P 9 ). A VC + iSIZE HEVC + iSIZE VP9 + iSIZE ∆ R ∆ V ∆ R ∆ V ∆ R ∆ V A VC -26.29% 8.47 – – – – HEVC – – -15.57% 3.03 – – VP9 – – – – -25.81% 6.07 A WS -41.60% 12.18 – – -19.52% 6.75 C. Evaluation of Adaptive Pr ecoding for V ideo-on-Demand Encoding Since precoding can be applied to any codec and any video resolution, there is a virtually unlimited range of tests that can be carried out to assess its performance on multitude scenarios of interest. Here, we focus on test conditions appropriate for highly-optimized video-on-demand (V OD) encoding systems that are widely deployed today for video deliv ery . Our e valua- tion focuses on a verage bitrate-PSNR and birate-VMAF curv es for our test FHD and UHD sequences and we present the results for: H.264/A VC in Fig. 9 and Fig. 10; H.265/HEVC in Fig. 11 and Fig. 12; and VP9 in Fig. 13 and Fig. 14. The corresponding BD-rate gains of our approach, which we term “iSize”, in conjunction with each of these encoders vs. the corresponding encoder implementations, averaged over all test sequences, are presented in T ables IV-VII. For the proposed precoding method, we use footprinting with speed-up factor 5, i.e., only ev ery 5th frame is processed during the selection of the best precoding mode, and the same encoding conﬁguration is used as for the corresponding baseline encoder . Regarding H.264/A VC and H.265/HEVC, we use the highly-optimized “slower” preset and VBV encoding for libx264/libx265, with GOP=90, and the widely-used crf=23 conﬁguration for VBV (see footnote 4 for further details). For our approach, we employed (per codec) the precoding modes and crf v alues shown in T able I. T o illustrate that our gains are achieved over a commercially competitiv e V OD encoding setup, for H.264/A VC we also include results 11 T ABLE VI A V E R AG E B D- R A T E ( ∆ R ) A N D BD - P S NR ( ∆ P ) F O R 1 4 U H D T E S T S E QU E N CE S A N D S E TT I N G S DE S C R IB E D I N S E CT I O N V - C . T H E P RO P O SE D P R EC O D I NG , I S I ZE , I N C O N JU N C T IO N W I T H H .2 6 4 / A V C, H . 2 6 5/ H E V C A N D V P9 I S E V A L UA T E D AG A I NS T S TAN D AL O N E H .2 6 4 / A VC , H . 26 5 / H EV C A N D V P 9 E N C O DI N G S A S W EL L A S C O M ME R C I AL S O LU T I O NS I N C L UD I N G A W S M E DI A C O N V ERT ( A V C A N D H E V C) A N D A W S E L A ST I C T R AN S C O DE R ( V P 9 ). A VC + iSIZE HEVC + iSIZE VP9 + iSIZE ∆ R ∆ P ∆ R ∆ P ∆ R ∆ P A VC -52.30% 4.17dB – – – – HEVC – – -17.76% 0.58dB – – VP9 – – – – -48.82% 2.03dB A WS -47.25% 3.77dB – – -36.50% 2.02dB T ABLE VII A V E R AG E B D- R A T E ( ∆ R ) A N D BD - V M AF ( ∆ V ) F O R 1 4 U H D T E S T S E QU E N CE S A N D S E TT I N G S DE S C R IB E D I N S E CT I O N V - C . T H E P RO P O SE D P R EC O D I NG , I S I ZE , I N C O N JU N C T IO N W I T H H .2 6 4 / A V C, H . 2 6 5/ H E V C A N D V P9 I S E V A L UA T E D AG A I NS T S TAN D AL O N E H .2 6 4 / A VC , H . 26 5 / H EV C A N D V P 9 E N C O DI N G S A S W EL L A S C O M ME R C I AL S O LU T I O NS I N C L UD I N G A W S M E DI A C O N V ERT ( A V C A N D H E V C) A N D A W S E L A ST I C T R AN S C O DE R ( V P 9 ). A VC + iSIZE HEVC + iSIZE VP9 + iSIZE ∆ R ∆ V ∆ R ∆ V ∆ R ∆ V A VC -46.58% 15.52 – – – – HEVC – – -19.68% 4.27 – – VP9 – – – – -33.32% 5.82 A WS -68.70% 35.11 – – -67.77% 16.65 with the high-performing A WS MediaCon vert encoder 5 using the MUL TI P ASS HQ H.264/A VC proﬁle and its recently- announced high-performance QVBR mode with the default value of quality lev el 7. The results of T ables IV-VII sho w that, against the H.264/A VC libx264 implementation, the av- erage rate saving of our approach for both FHD and UHD resolution under both metrics (PSNR and VMAF) is 35%; the corresponding saving of our approach against H.264/A VC A WS MediaConv ert is 44%. For H.265/HEVC libx265, the av erage saving of our approach is 15%. Regarding VP9, we employed VBV encoding with min- max rate (see more details at [49]), GOP=90 frames, maxrate=1.45 × minrate, speed=1 for lo wer-resolution encod- ing (see T able I) and speed=2 for the full-resolution encod- ing anchor , since we only utilize do wnscaled versions with 6% to 64% of the video pixels of the original resolution. Additional bitrate reduction may be achiev able by utilizing two-pass encoding in lib vpx-vp9, but we opted not to use VBV encoding to make our comparison balanced with the VP9 implementation provided by the A WS Elastic Transcoder , which was used as our external benchmark for VP9. The settings of the Elastic Transcoder jobs were based on the built- in presets 6 , which we customized to match the desired output video codec, resolution, bitrate, and GOP size, and we set 5 A WS tools do not support H.265/HEVC, so no corresponding benchmark is presented for that encoder from implementations external to FFmpeg. Howe ver , libx265 is well recognized as a state-of-the-art implementation and is frequently used in encoding benchmarks [48]. 6 https://docs.aws.amazon.com/elastictranscoder/latest/de veloperguide/preset- settings.html the framerate according to the input video framerate. Such customization is necessary because the built-in presets do not follow the input video parameters and they serve mainly as boilerplates. The results of T ables IV-VII show that, against the VP9 libvpx-vp9 implementation, the average rate saving of our approach for both FHD and UHD resolution under both metrics (PSNR and VMAF) is 35%; the corresponding saving of our approach against VP9 A WS Elastic Transcoder is 36%. D. Further Comparisons and Discussion 1) Evaluation of adaptive precoding on HD (720p) content: T o examine the performance of our approach for lower- resolution inputs, we carried out a smaller-scale e valuation of our proposed adaptiv e precoding method on four 1280 × 720 HD video sequences, namely ducks take off , in to tree , old town cr oss and park joy from the XIPH collection. The av erage BD-rate gains of our approach versus the standalone H.264/A VC encoding for bitrates in the re gion of 0.5-2.5Mbps and same settings as in Section V -C were found to be 3.01% and 0.11dB for PSNR, and 2.27% and 0.68 for VMAF . These gains are modest in comparison to those obtained for FHD and UHD content. This indicates that our proposal is more suitable for high-resolution content. 2) Evaluation of adaptive precoding in conjunction with the curr ent VVC T est Model (VTM): Beyond comparisons with existing standards, to sho w that our proposed deep video precoding can offer gains even ov er upcoming video coding standards, we ev aluate our method in conjunction with the cur- rent VTM (v ersion 6.2rc1). As the utilized VTM is very slo w (2-10min per frame on single-core CPU execution), for this ev aluation, we limited our tests to se ven FHD sequences from the XIPH collection ( aspen, contr olled burn, old town cross, cr owd run, rush ﬁeld cuts, touc hdown, tractor ) and two bi- trates: 1.8mbps and 3mbps. The utilized settings were as provided in the VTM software 7 , with the default rate control enabled and set to the target bitrates and IntraPeriod is set to 64 frames. The results are summarized in T able VIII. W e can see that rate reduction of 8%-9% is offered at slight increase of PSNR and VMAF . This ﬁrst ev aluation illustrates that our proposal can achieve gains even against the current VTM model of the Joint V ideo Experts T eam. W e would also like to note that the VVC standard is still under development and it is expected that these results will change depending on the features that will be included in the ﬁnal v ersion of the standard. 3) Evaluation against other CNN-based downscaling frame works: T o compare the core CNN designs of our precod- ing framew ork against recent work on CNN-based downscal- ing, we inv estigate the performance of our proposed precoding network under ﬁxed do wnscaling ratio of s = 2 followed by bilinear upscaling on four standard test image datasets: Set5, Set14, BSDS100 and Urban100. Our benchmarks comprise bicubic and CNN-based [15] downscaling coupled with bicu- bic and Lanczos upscaling ﬁlters that are more complex than our bilinear upscaling. As summarized in T able IX, with the exception of Li et al . [15] on BSDS100 follo wed by Lanczos 7 https://vcgit.hhi.fraunhofer .de/jvet/VVCSoftware VTM 12 0 1 2 3 4 5 6 7 Bitrate (Mbps) 26 28 30 32 34 36 PSNR (dB) Summary FHD Results For 16 Test Sequences AVC/H.264 (libx264) + iSize AVC/H.264 (libx264) AWS MediaConvert QVBR (a) 0 1 2 3 4 5 6 7 Bitrate (Mbps) 20 30 40 50 60 70 80 90 VMAF Summary FHD Results For 16 Test Sequences AVC/H.264 (libx264) + iSize AVC/H.264 (libx264) AWS MediaConvert QVBR (b) Fig. 9. Performance comparison of H.264/A VC encoding with proposed adaptiv e precoding versus standalone H.264/A VC encoding and A WS MediaCon vert H.264/A VC encoder (QVBR mode) on FHD content: (a) PSNR and (b) VMAF . 0 2 4 6 8 10 12 14 Bitrate (Mbps) 25 27 29 31 33 35 37 39 PSNR (dB) Summary UHD Results For 14 Test Sequences AVC/H.264 (libx264) + iSize AVC/H.264 (libx264) AWS MediaConvert QVBR (a) 0 2 4 6 8 10 12 14 Bitrate (Mbps) 10 20 30 40 50 60 70 80 90 VMAF Summary UHD Results For 14 Test Sequences AVC/H.264 (libx264) + iSize AVC/H.264 (libx264) AWS MediaConvert QVBR (b) Fig. 10. Performance comparison of H.264/A VC encoding with proposed adaptiv e precoding versus standalone H.264/A VC encoding and A WS MediaConv ert H.264/A VC encoder (QVBR mode) on UHD content: (a) PSNR and (b) VMAF . 0.5 1.0 1.5 2 2.5 3 3.5 4 4.5 Bitrate (Mbps) 29 30 31 32 33 34 35 36 PSNR (dB) Summary FHD Results For 16 Test Sequences HEVC (libx265) + iSize HEVC (libx265) (a) 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Bitrate (Mbps) 45 50 55 60 65 70 75 80 85 VMAF Summary FHD Results For 16 Test Sequences HEVC (libx265) + iSize HEVC (libx265) (b) Fig. 11. Performance comparison of H.265/HEVC encoding with proposed adaptiv e precoding v ersus standalone H.265/HEVC encoding on FHD content: (a) PSNR and (b) VMAF . 13 0 2 4 6 8 10 12 Bitrate (Mbps) 33 34 35 36 37 38 39 40 41 PSNR (dB) Summary UHD Results For 14 Test Sequences HEVC (libx265) + iSize HEVC (libx265) (a) 0 2 4 6 8 10 12 Bitrate (Mbps) 60 65 70 75 80 85 90 95 VMAF Summary UHD Results For 14 Test Sequences HEVC (libx265) + iSize HEVC (libx265) (b) Fig. 12. Performance comparison of H.265/HEVC encoding with proposed adaptive precoding versus standalone H.265/HEVC encoding on UHD content: (a) PSNR and (b) VMAF . 0 1 2 3 4 5 6 7 8 Bitrate (Mbps) 30 31 32 33 34 35 36 PSNR (dB) Summary FHD Results For 16 Test Sequences VP9 (libvpx-vp9) + iSize VP9 (libvpx-vp9) AWS Elastic Transcoder (a) 0 1 2 3 4 5 6 7 8 Bitrate (Mbps) 45 50 55 60 65 70 75 80 85 VMAF Summary FHD Results For 16 Test Sequences VP9 (libvpx-vp9) + iSize VP9 (libvpx-vp9) AWS Elastic Transcoder (b) Fig. 13. Performance comparison of VP9 encoding with proposed adaptiv e precoding versus standalone VP9 encoding and A WS Elastic Transcoder VP9 encoder on FHD content: (a) PSNR and (b) VMAF 0 5 10 15 20 Bitrate (Mbps) 32 33 34 35 36 37 38 39 40 PSNR (dB) Summary UHD Results For 14 Test Sequences VP9 (libvpx-vp9) + iSize VP9 (libvpx-vp9) AWS Elastic Transcoder (a) 0 5 10 15 20 Bitrate (Mbps) 40 50 60 70 80 90 100 VMAF Summary UHD Results For 14 Test Sequences VP9 (libvpx-vp9) + iSize VP9 (libvpx-vp9) AWS Elastic Transcoder (b) Fig. 14. Performance comparison of VP9 encoding with proposed adaptiv e precoding versus standalone VP9 encoding and A WS Elastic Transcoder VP9 encoder on UHD content: (a) PSNR and (b) VMAF . 14 T ABLE VIII A V E R AG E P SN R , V M A F A N D O B T A I N ED B I T R A T E F O R V V C E N C O DI N G ( V TM V . 6 . 2 R C 1 ) W I TH P R OP O S ED P R EC O D IN G V E R S US S TA ND A LO N E V V C E N C O DI N G O N F H D C O N T EN T . VVC+iSize VVC PSNR 27.22dB 28.18dB 27.15dB 28.04dB VMAF 94.06 96.57 91.48 95.72 Bitrate (Mbps) 1.636 2.748 1.796 2.988 upscaling, our approach outperforms all reference methods on all datasets, and achie ves this result with the very lightweight bilinear upscaling at the client side. In addition, our frame work has signiﬁcantly lower complexity than Li et al. [15], as for an 1920 × 1080 input frame and s = 2 , our precoding network requires only 3.38G MA Cs and 5.5K parameters o ver all scales compared to 153G MA Cs and 30.6K parameters required by Li et al. [15]. 4) Impact of edge pr eservation loss: The purpose of the edge preserv ation loss in Eq. (2) is to ensure structural preservation for the non-integer downscaling ratios that utilize bilinear downscaling instead of a stride. W e explore the impact of the edge preserv ation loss by computing a verage PSNR and SSIM over the DIV2K validation set for λ ∈ { 0 , 0 . 5 , 2 , 5 } and indicativ e scaling factors. The results are reported in T able X. W ithout the edge preservation loss ( λ = 0) , the average PSNR and SSIM for scaling factor s = 3 / 2 that utilizes bilinear downscaling are 35.86dB and 0.957, respectiv ely . Increasing lambda to 0.5, the PSNR and SSIM increase to 36.6dB and 0.962, respectively . A similar increase is exhibited for scaling factor s = 4 / 3 , where PSNR increases from 37.96dB to 38.23dB. Ho wev er, we notice that as the weight increases, the metrics saturate and there can be a detrimental ef fect on inte ger ratios, such as s = 4 , where the mean absolute error (MAE) and feed-through from higher scaling factors is suf ﬁcient in ensuring ﬁdelity to the input frame. Therefore, in order to ensure a balance in performance over all scaling factors, we set λ = 0 . 5 in all our results. E. Runtime Analysis for Cloud-based VOD Encoding Since V OD encoding conﬁgurations are typically deployed ov er a cloud implementation, it is of interest to benchmark the complexity of encoding with our deep video precoding modes versus the corresponding plain encoder . T able XI presents benchmarks for typical precoding scales on an A WS t2.2xlar ge instance with each precoding and encoding running in two threads on two of the Intel Xeon 3.3GHz CPUs. The results correspond to the a verage processing of each GOP (comprising 90 frames and e xcluding I/O). By comparing the standard encoding time per video coding standard (scale 1) with our precoding and encoding time for the remaining scales, we see that, as scale increases, the encoding time reduces by up to a factor of ﬁv e, while precoding time remains quasi-constant. The rate sa vings versus downscaling by these factors using bicubic and Lanczos ﬁlters are shown in T able II and T able III (for FHD content). These results indicate that, especially for the case of complex encoding standards like H.265/HEVC and VP9 that require long encoding times for high-quality V OD streaming systems, the combination of precoding with the appropriate ratio may allow for a more efﬁcient realization on a cloud platform with substantial reduction in rate (or improv ement in quality) than using linear ﬁlters. In this con- text, precoding effecti vely acts as a data-driven pre-encoding compaction mechanism in the pixel domain, which allo ws for accelerated encoding, with the client linearly upscaling to the full resolution and producing high quality video. V I . C O N C L U S I O N W e propose the concept of deep video precoding based on con volutional neural networks, with the current focus being on downscaling-based compaction under D ASH/HLS adaptation. A key aspect of our approach is that it remains backward com- patible to existing systems and does not require any change for the streaming, decoder, and display components of a VOD solution. Gi ven that our approach does not alter the encoding process, it offers an additional optimization dimension going beyond content-adaptive encoding and codec parameter opti- mization. Indeed, experiments show that it brings beneﬁts on top of such well-kno wn optimizations: under high-performing two-pass and VBV -based FHD and UHD video encoding, our precoding offers 8%-52% bitrate saving versus leading A VC/H.264, HEVC and VP9 encoders, with lower gains of- fered for HD (720p) content. An early-stage ev aluation against the VVC T est Model v6.2rc1 sho wed that our approach may also be beneﬁcial for advanced encoding framew orks currently under consideration by video coding standardization bodies. In addition, a comparison against a state-of-the-art CNN-based downscaling framew ork and bicubic or Lanczos upscaling showed that our proposal offers a better downscaler e ven when the less-complex bilinear ﬁlter is used for upscaling at the client side. The compaction features of our solution ensure that, not only bitrate is sav ed, but also that video encoding complexity reduction can be achiev ed, especially for HEVC and VP9 V OD encoding. Future work will consider ho w to extend the notion of precoding beyond adaptive streaming systems by learning to adaptively preprocess video inputs such that they are optimally recovered by current decoders under speciﬁed perceptual quality metrics. R E F E R E N C E S [1] I. Sodagar , “The MPEG-DASH standard for multimedia streaming ov er the internet, ” IEEE multimedia , vol. 18, no. 4, pp. 62–67, 2011. [2] D. W einberger , “Choosing the right video bitrate for streaming HLS and DASH, ” Feb 2015. [Online]. A vailable: https://bitmovin.com/ video- bitrate- streaming- hls- dash/ [3] I. Katsav ounidis and L. Guo, “V ideo codec comparison using the dynamic optimizer framew ork, ” in Applications of Digital Image Pro- cessing XLI , vol. 10752, International Society for Optics and Photonics. SPIE, 2018, pp. 266 – 281. [4] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep back-projection networks for super-resolution, ” in Proc. of the IEEE Conf. on Computer V ision and P attern Recognition (CVPR) , 2018, pp. 1664–1673. [5] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual networks for single image super-resolution, ” in Proc. of the IEEE Conf. on Computer V ision and P attern Recognition W orkshops (CVPRW) , July 2017, pp. 1132–1140. [6] D. Sculley , G. Holt, D. Golovin, E. Davydov , T . Phillips, D. Ebner , V . Chaudhary , M. Y oung, J.-F . Crespo, and D. Dennison, “Hidden technical debt in machine learning systems, ” in Advances in neural information processing systems , 2015, pp. 2503–2511. 15 T ABLE IX E V A L UATI O N O F T H E PR OP O S E D P RE C O D IN G N E U RA L N E T WO R K C O UP L E D W I TH B I L I NE A R U P SC A L I NG FI L T E R AG A I NS T B I C UB I C A N D C NN - BA S E D D OW N S CA L I NG M E TH O D S P A I R ED W I TH D I FF E RE N T U P SC A L I NG FI L T E R S IN T E RM S O F R E C ON S T RU C TI O N Q UA L IT Y ( P S NR ) O N S TA ND AR D I M AG E DAT A S ET S F O R S C AL I N G FAC TO R s = 2 . Bicubic ↓ CNN − CR S ep [15] ↓ Bicubic ↓ CNN − CR S ep [15] ↓ Bicubic ↓ CNN − CR S ep [15] ↓ Our precoding ↓ Bilinear ↑ Bilinear ↑ Bicubic ↑ Bicubic ↑ Lanczos ↑ Lanczos ↑ Bilinear ↑ Set5 32.26 33.55 33.67 33.36 33.72 33.37 34.81 Set14 28.97 30.27 30.01 30.30 30.05 30.32 30.81 BSDS100 28.70 29.80 29.57 29.86 29.61 29.87 29.67 Urban100 25.60 26.99 26.56 27.09 26.60 27.11 27.65 A verage 27.38 28.63 28.32 28.70 28.36 28.72 28.94 T ABLE X E FF EC T O F T H E E DG E P R E S ERV A T I O N L OS S O N P S N R/ S S I M F O R I N D IC A T I V E S CA L I N G FAC TO R S O N T HE D IV 2 K V A L I DA T I ON DAT A S ET . λ = 0 λ = 0 . 5 λ = 2 λ = 5 s PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM 4/3 37.96dB 0.97247 38.23dB 0.97384 38.21dB 0.97367 38.13dB 0.97331 3/2 35.86dB 0.95689 36.61dB 0.96223 36.61dB 0.96216 36.60dB 0.96211 3 29.88dB 0.84640 29.82dB 0.84634 29.77dB 0.84569 29.74dB 0.84548 4 28.47dB 0.79431 28.38 0.79346 28.28dB 0.79204 28.19dB 0.79088 T ABLE XI A V E R AG E RU N T IM E ( M S EC P ER G O P O N C P U ) F O R R E P RE S E NTA T I V E D OW N SC A L I NG FA CT O RS A N D MU LT IP L E E N C OD I N G B IT R A T E S . Precoding A VC HEVC VP9 s FHD UHD FHD UHD FHD UHD FHD UHD 1 0 0 1451 4928 11002 54016 3072 594400 3/2 671 4067 745 1982 7634 32565 2107 318445 2 775 4468 529 1854 7476 31786 1763 221522 5/2 703 4408 358 1008 6292 14692 1015 156259 [7] A. W iesel, Y . C. Eldar , and S. Shamai, “Linear precoding via conic opti- mization for ﬁxed MIMO receiv ers, ” IEEE T rans. on Signal Pr ocessing , vol. 54, no. 1, pp. 161–176, Jan 2006. [8] C. W eidmann and M. V etterli, “Rate distortion behavior of sparse sources, ” IEEE T ransactions on information theory , vol. 58, no. 8, pp. 4969–4992, 2012. [9] Y . Tsaig, M. Elad, P . Milanfar, and G. H. Golub, “V ariable projection for near-optimal ﬁltering in low bit-rate block coders, ” IEEE T ransactions on Circuits and Systems for V ideo T echnolo gy , vol. 15, no. 1, pp. 154– 160, 2005. [10] J. Kopf, A. Shamir , and P . Peers, “Content-adapti ve image downscaling, ” ACM Tr ans. on Graphics (TOG) , vol. 32, no. 6, p. 173, 2013. [11] A. C. ¨ Oztireli and M. Gross, “Perceptually based downscaling of images, ” A CM Tr ans. on Graphics (TOG) , vol. 34, no. 4, p. 77, 2015. [12] C. G. Bampis, Z. Li, I. Katsav ounidis, T .-Y . Huang, C. Ekanadham, and A. C. Bovik, “T owards perceptually optimized end-to-end adapti ve video streaming, ” arXiv pr eprint arXiv:1808.03898 , 2018. [13] Z. Li, A. Aaron, I. Katsavounidis, A. Moorthy , and M. Manohara, “T oward a practical perceptual video quality metric, ” The Netﬂix T ech Blog , vol. 6, 2016. [14] G. Bjontegaard, “Calculation of average psnr dif ferences between rd- curves, ” VCEG-M33 , 2001. [15] Y . Li, D. Liu, H. Li, L. Li, Z. Li, and F . Wu, “Learning a conv olutional neural network for image compact-resolution, ” IEEE Tr ans. on Image Pr ocessing , vol. 28, no. 3, pp. 1092–1107, March 2019. [16] G. Georgis, G. Lentaris, and D. Reisis, “Reduced complexity superreso- lution for low-bitrate video compression, ” IEEE T rans. on Circuits and Systems for V ideo T echnology , vol. 26, no. 2, pp. 332–345, 2016. [17] J. Kim, J. Kwon Lee, and K. Mu Lee, “ Accurate image super-resolution using very deep con volutional networks, ” in Proc. of the IEEE Conf. on Computer V ision and P attern Recognition (CVPR) , 2016, pp. 1646– 1654. [18] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep resid- ual networks for single image super-resolution, ” in Pr oc. of the IEEE Confer ence on Computer V ision and P attern Recognition W orkshops , 2017, pp. 136–144. [19] C. Dong, C. C. Loy , and X. T ang, “ Accelerating the super-resolution con volutional neural network, ” in Pr oc. of the Eur opean Conference on Computer V ision (ECCV) . Springer , 2016, pp. 391–407. [20] J. Kim, J. Kwon Lee, and K. Mu Lee, “Deeply-recursiv e con volutional network for image super -resolution, ” in Proc. of the IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , 2016, pp. 1637– 1645. [21] M. Afonso, F . Zhang, and D. R. Bull, “V ideo compression based on spatio-temporal resolution adaptation, ” IEEE T ransactions on Circuits and Systems for V ideo T echnolo gy , v ol. 29, no. 1, pp. 275–280, 2018. [22] Y . Li, D. Liu, H. Li, L. Li, F . W u, H. Zhang, and H. Y ang, “Con volutional neural network-based block up-sampling for intra frame coding, ” IEEE T rans. on Cir cuits and Systems for V ideo T echnology , vol. 28, no. 9, pp. 2316–2330, Sep. 2018. [23] J. Lin, D. Liu, H. Y ang, H. Li, and F . W u, “Convolutional neural network-based block up-sampling for HEVC, ” IEEE T rans. on Circuits and Systems for V ideo T echnolo gy , 2018. [24] K. Liu, D. Liu, H. Li, and F . W u, “Con volutional neural network-based residue super-resolution for video coding, ” in Pr oc. of IEEE V isual Communications and Imag e Pr ocessing (VCIP) , Dec 2018. [25] D. Liu, Y . Li, J. Lin, H. Li, and F . W u, “Deep learning-based video coding: A revie w and a case study , ” arXiv preprint , 2018. [26] L. Theis, W . Shi, A. Cunnigham, and F . Husz ´ ar , “Lossy image com- pression with compressi ve autoencoders, ” in Pr oc. of the Int. Conf. on Learning Representations (ICLR) , 2017. [27] O. Rippel and L. Bourdev , “Real-time adapti ve image compression, ” in Pr oc. Int. Conf. on Machine Learning (ICML) , vol. 70, Aug. 2017, pp. 2922–2930. [28] A. Shocher , N. Cohen, and M. Irani, ““zero-shot” super-resolution using deep internal learning, ” in Pr oc. of the IEEE Confer ence on Computer V ision and P attern Recognition , 2018, pp. 3118–3126. [29] N. W eber , M. W aechter , S. C. Amend, S. Guthe, and M. Goesele, “Rapid, 16 detail-preserving image downscaling, ” A CM T rans. on Graphics (TOG) , vol. 35, no. 6, p. 205, 2016. [30] X. Hou, J. Duan, and G. Qiu, “Deep feature consistent deep image transformations: Downscaling, decolorization and HDR tone mapping, ” arXiv preprint arXiv:1707.09482 , 2017. [31] H. Kim, M. Choi, B. Lim, and K. Mu Lee, “T ask-aware image downscaling, ” in Pr oc. of the Eur opean Conference on Computer V ision (ECCV) , 2018, pp. 399–414. [32] F . Jiang, W . T ao, S. Liu, J. Ren, X. Guo, and D. Zhao, “ An end-to-end compression framework based on conv olutional neural networks, ” IEEE T ransactions on Circuits and Systems for V ideo T echnology , vol. 28, no. 10, pp. 3007–3018, Oct 2018. [33] O. Rippel, S. Nair, C. Le w , S. Branson, A. G. Anderson, and L. Bourde v , “Learned video compression, ” arXiv pr eprint arXiv:1811.06981 , 2018. [34] A. Asan, I.-H. Mkwawa, L. Sun, W . Robitza, and A. C. Begen, “Optimum encoding approaches on video resolution changes: A com- parativ e study , ” in 2018 25th IEEE International Conference on Image Pr ocessing (ICIP) . IEEE, 2018, pp. 1003–1007. [35] C. Dong, C. C. Loy , K. He, and X. T ang, “Image super-resolution using deep con volutional networks, ” IEEE T rans. on P attern Analysis and Machine Intelligence , vol. 38, no. 2, pp. 295–307, 2016. [36] W .-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Y ang, “Deep Laplacian pyramid networks for fast and accurate super-resolution, ” in Pr oc. of the IEEE Conf. on Computer V ision and P attern Recognition (CVPR) , 2017, pp. 624–632. [37] Y . T ai, J. Y ang, and X. Liu, “Image super-resolution via deep recursive residual network, ” in Pr oc. of the IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , 2017, pp. 3147–3155. [38] Y . T ai, J. Y ang, X. Liu, and C. Xu, “Memnet: A persistent memory network for image restoration, ” in Pr oc. of the IEEE Int. Conf. on Computer V ision (ICCV) , 2017, pp. 4539–4547. [39] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks, ” in Proc. of the Eur opean Conf. on Computer V ision (ECCV) . Springer , 2016, pp. 630–645. [40] ——, “Delving deep into rectiﬁers: Surpassing human-le vel performance on imagenet classiﬁcation, ” in Proc. of the IEEE Int. Conf. on Computer V ision (ICCV) , ser . ICCV ’15. W ashington, DC, USA: IEEE Computer Society , 2015, pp. 1026–1034. [41] ——, “Deep residual learning for image recognition, ” in Pr oc. of the IEEE Conf. on Computer V ision and P attern Recognition (CVPR) , 2016, pp. 770–778. [42] X. Glorot and Y . Bengio, “Understanding the difﬁculty of training deep feedforward neural networks, ” in Proc. of the13th Int. Conf. on Artiﬁcial Intelligence and Statistics , 2010, pp. 249–256. [43] E. Agustsson and R. Timofte, “NTIRE 2017 challenge on single image super-resolution: Dataset and study , ” in Pr oc. of the IEEE Conf. on Computer V ision and P attern Recognition W orkshops , 2017, pp. 126– 135. [44] S. Ma, W en Gao, and Y an Lu, “Rate-distortion analysis for H.264/A VC video coding and its application to rate control, ” IEEE T rans. on Cir cuits and Systems for V ideo T echnology , vol. 15, no. 12, pp. 1533–1544, Dec 2005. [45] A. Ortega and K. Ramchandran, “Rate-distortion methods for image and video compression, ” IEEE Signal Pr ocessing Magazine , vol. 15, no. 6, pp. 23–50, Nov 1998. [46] “FFmpeg, ” https://trac.ffmpeg.org/wiki/Encode/H.264#tw opass. [47] “FFmpeg, ” https://trac.ffmpe g.org/wiki/Encode/H.264# AdditionalInformationT ips. [48] J. D. Cock, A. Mavlankar , A. Moorthy , and A. Aaron, “ A large-scale video codec comparison of x264, x265 and libvpx for practical V OD applications, ” in Proc. of SPIE Applications of Digital Image Processing XXXIX , vol. 9971, 2016. [49] “FFmpeg, ” https://trac.ffmpeg.org/wiki/Encode/VP9.

Deep Video Precoding

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment