A Mamba-based Perceptual Loss Function for Learning-based UGC Transcoding

A Mamba-based Perceptual Loss Function for Learning-based UGC T ranscoding Zihao Qi † , Chen Feng † , Fan Zhang † , Xiaozhong Xu § , Shan Liu § and David Bull † † V isual Information Laboratory , Univer sity of Bristol, Bristol, UK, BS1 5DD {zihao.qi, chen.feng, fan.zhang, dav e.bull}@bristol.ac.uk § T encent Media Lab, P alo Alto, CA 94306, USA {xiaozhongxu, shanl}@tencent.com Abstract —In user-generated content (UGC) transcoding, source videos typically suffer various degradations due to prior compression, editing, or suboptimal capture conditions. Conse- quently , existing video compression paradigms that solely opti- mize for ﬁdelity relativ e to the reference become suboptimal, as they for ce the codec to r eplicate the inherent artifacts of the non- pristine source. T o address this, we propose a novel perceptually inspired loss function for learning-based UGC video transcoding that redeﬁnes the role of the reference video, shifting it from a ground-truth pixel anchor to an informativ e contextual guide. Speciﬁcally , we train a lightweight neural quality model based on a Selective Structured State-Space Model (Mamba) optimized us- ing a weakly-supervised Siamese ranking strategy . The proposed model is then integrated into the rate-distortion optimization (RDO) process of two neural video codecs (DCVC and HiNeR V) as a loss function, aiming to generate reconstructed content with improv ed perceptual quality . Our experiments demonstrate that this framework achieves substantial coding gains over both autoencoder and implicit neural repr esentation-based baselines, with 8.46% and 12.89% BD-rate savings, respectively . Index T erms —Per ceptual loss, UGC transcoding, learning- based video coding, neural video codecs. I . I N T R O D U C T I O N The explosiv e growth of user-generated content (UGC) in recent years has fundamentally reshaped Internet video traf ﬁc. Industry reports [1] indicate that more than 500 hours of video are uploaded worldwide every minute, serving a user base that exceeds 1.06 billion. Unlike professionally-generated content (PGC), which originates from high-end capture devices with pristine master copies, UGC is typically captured by amateur users using mobile devices and undergoes lossy processing before it reaches a video sharing platform. As illustrated in Fig. 1, a source video ( S ) will typically be captured under suboptimal conditions and then compressed immediately on a mobile device to create a non-pristine reference ( R ). This reference is then uploaded to and transcoded (to produce a distorted video D ) by UGC service pro viders, ready for streaming. It should be noted that the UGC transcoding process fundamentally differs from conv entional coding as, in the former case, the input content ( R ) is inherently degraded by prior processing. Most existing compression frameworks (both con v entional and learning-based) are ﬁdelity-oriented, following a rate- distortion optimization (RDO) strategy with the primary ob- The authors thank the funding from T encent (US), University of Bristol, and the UKRI MyW orld Strength in Places Programme (SIPF00006/1). UGC platfor ms S R D Accurate Quality Measur e User Device T ranscoded R efer ence Sour ce Fig. 1: Illustration of the UGC video deliv ery pipeline. The S ource content captured by a user is directly compressed on the user device for storage, as non-pristine R eferences. The latter is then uploaded onto UGC streaming platforms and further compressed into D istorted videos before being transmitted to the viewer . jectiv e of minimizing the dif ference between the reconstructed video and the input reference [2]. While this assumption is valid in cases where reference videos are pristine, it introduces a critical problem in UGC transcoding, because the reference video R often contains visible compression artifacts, noise, and/or editing ﬂaws. The RDO process will force the codec to replicate these distortions, therefore resulting in low-quality reconstructions [3]. T o address this issue, based on recent advances in quality assessment for UGC transcoding [4], we propose a paradigm shift in the optimization objecti ve for UGC transcoding. Here, we i) use the non-pristine reference as semantic context rather than as a pixel-le vel ground truth, and ii) introduce a perceptually-aware loss tailored for UGC transcoding. This new perceptual transcoding loss (PT -Loss) is based on a lightweight Selecti ve Structured State-Space (Mamba) net- work, which has been trained via a weakly-supervised Siamese ranking strategy to predict perceptual quality degradation compared to pristine source content based only on non-pristine references and distorted video sequences. By integrating this loss into the transcoding process, a video codec can be guided to achie ve improved coding gain over ﬁdelity-based baselines. The main contributions of this paper are summarized below: • Per ceptual loss for transcoding : W e propose a novel Mamba-based perceptual loss that treats the non-pristine reference as informativ e guidance rather than as a strict anchor . This allo ws the codec to deviate from reference ﬁdelity to perceptual enhancement. • Model-agnostic generalization : W e demonstrate the versa- tility of our loss by integrating it into two different neu- ral video coding framew orks: an autoencoder-based codec (DCVC [5]) and an implicit neural representation (INR) based codec (HiNeR V [6]). • Signiﬁcant coding gains : Experimental results on a UGC transcoding dataset, BVI-UGC [3], show that both codecs with the proposed transcoding loss achiev e consistent BD- rate gains (up to 12.89% BD-rate) over the original base- lines. Qualitativ e results also conﬁrm the performance im- prov ement from the perceptual perspectiv e. I I . R E L A T E D W O R K S A. V ideo Compr ession Con ventional video compression remains widely used across most applications, with video coding standards e volving ov er the past fe w decades [7–10]. Recently , howe v er , learning- based coding tools and end-to-end neural video codecs hav e demonstrated their potential to compete with or ev en outper- form con ventional video codecs. While early approaches such as D VC [11] replace indi vidual modules in the traditional compression pipeline with neural networks, more recent ap- proaches extend this idea by adopting more advanced architec- tures, including deep context modeling [5], hierarchical latent representations [11], and implicit neural representations [6, 12, 13]. It is noted that, for both con v entional and learning- based methods, the underlying optimization principle remains the same – the codec optimizes rate–distortion performance based on ﬁdelity-dri ven quality metrics, such as PSNR or SSIM, which are computed between the reconstructed and reference videos. While this optimization strategy has prov en effecti v e for high-quality reference videos, it has been widely reported to pose a fundamental limitation for UGC video transcoding [3]. In such a scenario, the reference video itself already contains various visible artifacts, causing ﬁdelity- driv en optimization to preserve and ev en reinforce undesired distortions that exist in the reference content. B. V ideo Quality Assessment Objectiv e video quality assessment (VQA) methods are in v ariably used to guide the evaluation and optimization of video compression. Existing metrics can be divided into two major categories according to the a vailability of reference content: full-reference (FR) and no-reference (NR) 1 . No-reference (NR) VQA metrics predict video quality with- out access to the reference. While effecti ve for general quality estimation, it is dif ﬁcult for NR models to measure ﬁne- grained degradations relative to a source video [3]. Full- reference (FR) VQA, on the other hand, explicitly compares a video to its reference and has long been used to guide 1 Although reduced reference (RR) VQA methods do e xist in the literature, they are not commonly deployed in practical applications of video coding. rate–distortion optimization in compression. Simple metrics like PSNR and SSIM [14, 15] provide straightforward ﬁ- delity measurements, while perceptually inspired metrics, such as VMAF [16], incorporate human visual system charac- teristics to better align with subjectiv e judgments. Recent deep learning-based FR models, including DeepVQA [17], LPIPS [18], C3D VQA [19], and Compressed UGC VQA [20], as well as unsupervised or weakly-supervised approaches like RankD VQA [21], hav e shown improv ed performance and gen- eralization across datasets. Ho we ver , existing FR methods ha ve inherent limitations: they are optimized solely to approximate the reference, so the compression process guided by these metrics tends to reproduce the distortions in the reference. C. P er ceptual Optimization Optimizing compression tow ards reference does not always lead to high perceptual quality of the reconstructed videos. Sev eral prior works ha ve e xplored perceptual quality optimiza- tion by mitigating artifacts or enhancing visual details with deep learning-based methods. In early works, generativ e ad- versarial networks were employed to assist in intra prediction, reducing spatial redundancy [22–25], and to reconstruct high- frequency details [26]. Beyond architectural or module-le vel enhancements, some contrib utions modify the optimization objectiv es by introducing perceptually-aware loss instead of relying solely on reference-ﬁdelity distortion measures such as PSNR and SSIM [27–30]. While these methods have been shown to be ef fectiv e in improving the perceptual quality of PGC videos, they are neither specially designed for UGC nor tailored to address the challenges f aced during transcoding. In particular , recent work [31] introduces a quality saturation detection framework for UGC compression that incorporates perceptual considerations into the rate–distortion optimization process. While such approaches demonstrate the potential of perceptually guided compression, their ef fectiv eness depends strongly on the quality of the input content. I I I . M E T H O D T o address the issues mentioned abov e in UGC transcoding, we propose a perceptually inspired loss function, PT -Loss , tailored speciﬁcally for transcoding scenarios. This loss is designed to process both the non-pristine reference content fed into the transcoding codec and the reconstructed output produced by the neural codec, as illustrated in Fig. 2(A). A. Network Ar chitectur e T o serve as a practical loss function, the quality model should satisfy two requirements: (i) high sensitivity to percep- tual artifacts and (ii) low computational overhead to minimize training latency . The network architecture for the proposed loss function is illustrated in Fig. 2(B). Following [4, 21], the network extracts features at multiple levels from two co- located reference and distorted spatio-temporal patches, R and D , with a resolution of 256 × 256 × 12 . In contrast to [4, 21], considering the second requirement mentioned above, we replace the prior Swin Transformer -based design [21] with a DCV C Enc DCV C Dec P at ch Index ( i , j, t) Grid S t em HiNeRV Blocks Head Model Compr ession PT - Loss ( i , j) (A) Dist or t ed p at ches R efer ence p at ches PT - Loss P dis P r ef 2D Conv 3x3 Mamb a bloc k - - - Lev el 1 f eatur es Lev el 2 featur es Lev el 5 f eatur es … … C C C A fu sion chann el att n fu sion channel att n fu sion chann el att n P at ch - lev el pr ediction … … r es r ef dis r es r ef dis r es r ef dis C - A concat enat e differ ence av erage (B) Fig. 2: (A) Illustration of the PT -Loss integration into both DCVC and HiNeR V codecs. (B) The architecture of the PT -Loss network. Selectiv e Structured State-Space Model (Mamba) [32] in order to efﬁciently learn spatio-temporal correlations by decoupling spatial feature extraction and temporal selectiv e scanning. This enables linear complexity O ( L ) , while ef fectiv ely capturing long-range temporal dependencies, which pro vide the capa- bility to distinguish temporal consistency from ﬂickering arti- facts. This makes it particularly suitable for use as a perceptual loss under extended temporal context and signiﬁcantly reduces inference overhead compared to attention-based designs [33]. The network then fuses the multi-level features and outputs a patch-lev el quality index. The quality indices of all the patches in the video are then aggregated into a clip-lev el score via a differentiable pooling layer . B. T raining Strate gy In order to train the PT -loss function and allow it to use a distorted reference R as the optimization context rather than as a ground truth in the transcoding process, the loss must distinguish perceptual degradations between the absent (in practice) pristine source S and the distorted content D , rather than calculating the ﬁdelity to the non-pristine reference R . T o achiev e this objective, we follow the learning strategy in [4] and employ a weakly supervised, ranking-based training strategy to optimize the proposed PT -Loss. Similarly , due to the lack of large-scale human-annotated datasets dev eloped for UGC transcoding, we produce training patch pairs while keeping their corresponding pristine sources, which are used to generate reliable quality labels using proxy metrics [4]. Speciﬁcally , distorted reference videos R and transcoded videos D are generated from pristine source videos S through a two-stage compression process, simulating realistic UGC deliv ery . The quality label of each transcoded video D is computed using VMAF [34] between its corresponding source S (rather than using R ) and D . Although the annotations are not obtained via direct subjecti ve scoring, their reliability has been conﬁrmed in prior studies [4, 21] when used in a ranking- inspired learning process. Essentially , this process distills source-aware kno wledge from a full-reference metric (VMAF) into a proxy network that relies solely on the non-pristine reference. In this w ork, a total of 252 pristine source sequences were used, extracted from the BVI-D VC dataset [35], the CVPR 2022 CLIC challenge training dataset [36] and the Y ouT ube-UGC 2K database [37] alongside six self-captured videos to further extend UGC content div ersity . These source sequences S were ﬁrst compressed using the fast implemen- tation of H.264/A VC [7], x264 2 [38] with QP= 30, 37, 42 to produce non-pristine references R . The latter w as then transcoded using three dif ferent codecs (x264, A V1, VP9 [39]) with quantization parameters 32–42 (x264) with an interval of 2 and 38–63 (A V1, VP9) with an interval of 5. The generated reference and distorted video pairs are then randomly cropped into 256 × 256 × 12 patch pairs, which are annotated using the method mentioned abov e. This results in a total of 604,800 patch pairs that are used to train the proposed PT -Loss via a Siamese ranking-based learning strategy proposed in [3, 21]. During training, we sampled single-source and cross-source patch pairs with an 8:2 ratio to encourage the model to learn robust and content-inv ariant perceptual features. After con ver gence, the model is further ﬁne-tuned using only patch pairs from the single source, as accurately modeling quality differences for the same content is more critical than cross-content ranking performance for the loss function. C. Inte gration into Neural V ideo Codecs T o v alidate the effecti veness of the proposed loss function, we integrated it into two dif ferent neural video codecs as the basis for rate-quality optimization, including an autoencoder- based codec [5] and an overﬁtting (or implicit neural repre- sentation based) codec, HiNeR V [6], as shown in Fig. 2(A). Speciﬁcally , DCVC adopts a learning-based inter-frame predictiv e coding paradigm, which emplo ys a pre-trained image codec for the independent spatial compression of anchor frames and a specialized P-frame model that le verages inter- frame correlations. T o integrate the proposed perceptual loss, we focus exclusi vely on ﬁne-tuning the P-frame model on the training split of the V imeo-90k septuplet dataset [40], while keeping other modules ﬁxed. The total loss is a combination of the original MSE loss and the proposed PT -Loss, L PT . L total = (1 − α ) L M S E + α L PT . (1) 2 It is noted that we only used conventional codecs to produce the training material due to their f ast coding speeds. Further improvement may be achiev ed if neural codecs are used here, but this remains future work. T ABLE I: BD-rate (measured in PSNR/VMAF) results on the BVI-UGC dataset across different reference groups. Negati ve values indicate bitrate savings compared to the baseline. The complexity is computed by taking the original video codecs as baselines. Since HiNeR V contains multiple scales of models, its complexities are presented as Smallest–Lar gest models. Low Medium High Overall Complexity PSNR VMAF PSNR VMAF PSNR VMAF PSNR VMAF MA Cs Params EncT DCVC 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 2268G 7.94M 100.0% +[4] -8.54% -12.19% -7.22% -7.97% -7.91% -6.29% -7.90% -7.93% +0.0G +0.0M 100.0% +PT -Loss (ours) -12.98% -12.49% -8.11% -9.21% -7.84% -6.93% -8.46% -8.64% +0.0G +0.0M 100.0% HiNeR V 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 43–718G 0.76–12.8M 100.0% +[4] -15.95% -16.18% -13.46% -11.67% -10.51% -7.76% -12.70% -10.74% +5.2G +4.6M 195.0% +PT -Loss (ours) -16.15% -16.46% -13.44% -11.72% -10.65% -7.83% -12.89% -10.83% +1.4G +0.8M 121.3% Here α is a hyperparameter to balance L M S E and L PT . T o address the magnitude discrepancy between the dif ferent objectiv es, we align the magnitude of L PT to that of L M S E adaptiv ely by calculating the ratio of their mean values in the ﬁrst ﬁv e training iterations. The weighting coefﬁcient α is then applied to balance their contributions. In contrast to DCVC, HiNeR V is an implicit neural repre- sentation (INR) based video codec that represents an entire video as a continuous neural representation and performs compression by overﬁtting the network parameters to a target video sequence. During the per-video ov erﬁtting, the original HiNeR V optimizes a combined loss consisting of ℓ 1 and MS- SSIM. In our integration, we replaced the MS-SSIM loss with the proposed PT -Loss, as giv en below . L total = (1 − α ) L 1 + α L PT . (2) I V . R E S U LT S A N D D I S C U S S I O N In order to ev aluate the two neural video codecs in conjunc- tion with the proposed loss function, we conducted an exper- iment using the BVI-UGC dataset [3], which is speciﬁcally designed for UGC video transcoding. The dataset contains 60 uncompressed (HD) source videos spanning 15 common UGC categories, cov ering a wide range of real-world content characteristics, including v arious artifacts, motion blur, defo- cus, over - and under-exposure, screen content, as well as both landscape and portrait formats. For each source sequence, the database also contains three non-pristine reference sequences (compressed by x264) at v arious quantization le vels, QP = 30, 37, and 42, simulating high-, medium-, and low-quality reference conditions, respectively . Due to the high computational cost associated with video compression, in this experiment, we selected a subset of 15 source sequences from BVI-UGC, following the selection strategy proposed in [41] to ensure wide feature cov erage and a uniform feature distribution across the database [42], with one representativ e video from each UGC category . BVI-UGC is chosen because it provides high-quality source videos and has been pre viously used in research on UGC transcoding [3]. Other av ailable UGC datasets [37, 43–46] do not pro vide high-quality source videos. The corresponding non-pristine references are used as the input to the transcoding process when using the tested video codecs. This results in three groups of reference sequences, each containing 15 videos. When measuring the video quality of the transcoded content, instead of using the non-pristine references, the uncompressed source videos are employed as the anchor in order to ensure a reliable quality ev aluation. W e compared the integrated video codecs, annotated as DCVC+PT -Loss and HiNeR V+PT -Loss, with their original counterparts (based on their original training losses), DCVC and HiNeR V . For both codecs, the weighting parameter , α , is set to 0.2. T o measure the video quality , we employ two quality metrics, PSNR and VMAF – the former focuses on the pixel- wise differences, while the latter is a perceptual metric that correlates well with human perception. As mentioned above, both quality metrics are calculated between the transcoded content and its corresponding uncompressed source (rather than the input non-pristine reference for compression) to ensure the reliability of the quality prediction (see Fig. 1). A. Quantitative Results T ABLE I summarizes the av erage BD-rate results between each integrated codec and its original baseline, with the quality measured in PSNR and VMAF for all 15 tested source sequences. It can be observed that by incorporating the proposed loss, signiﬁcant ov erall coding gains hav e been achiev ed, with 8.46% (in PSNR) and 8.64% (in VMAF) BD-rate savings ov er DCVC, and 12.89% and 10.83% ov er HiNeR V . Speciﬁcally , the performance improv ement is more evident when the quality of the input reference content is lower . This observ ation is consistent with the moti vation of this work. It is also noted that, when the PT -Loss is integrated into HiNeR V , more signiﬁcant coding gains are possible compared to DCVC. This may be due to the overﬁtting nature of the INR-based codecs. T o further verify coding gains across the full bitrate range, we provide the rate–distortion (R–D) curves for both anchor codecs on the whole test database and the three reference groups, as shown in Fig. 3. It can be observed that the proposed loss function achieves consistent coding gains in most cases. Notably , these improvements are more pronounced in the low-quality reference groups. As mentioned in Subsection III-A, one of the primary requirements for a loss function is lo w computational complex- ity . In T ABLE I, we pro vide the complexity ﬁgures of encoders with and without the proposed PT -Loss, in terms of parameter 0 0.05 0.1 0.15 0.2 Bits per pixel (BPP) 32 33 34 35 36 PSNR (dB) high 0 0.05 0.1 0.15 0.2 Bits per pixel (BPP) 31 32 33 34 PSNR (dB) medium 0 0.05 0.1 0.15 0.2 Bits per pixel (BPP) 29.5 30 30.5 31 31.5 PSNR (dB) low 0 0.05 0.1 0.15 0.2 Bits per pixel (BPP) 31 32 33 34 PSNR (dB) ov erall 0 0.05 0.1 0.15 0.2 Bits per pixel (BPP) 65 70 75 80 85 90 VMAF 0 0.05 0.1 0.15 0.2 Bits per pixel (BPP) 60 65 70 75 80 VMAF 0 0.05 0.1 0.15 0.2 Bits per pixel (BPP) 56 58 60 62 64 66 VMAF 0 0.05 0.1 0.15 0.2 Bits per pixel (BPP) 60 65 70 75 80 VMAF Fig. 3: Overall and group-wise compression results on BVI-UGC datasets across different reference groups. original reference QP42 (33.96dB) DCVC bpp=0.1701 (33.23dB) +PT -Loss bpp=0.1681 (34.33dB) original reference QP37 (32.56dB) DCVC bpp=0.1165 (26.53dB) +PT -Loss bpp=0.1139 (26.66dB) Fig. 4: V isual comparison between DCVC and DCVC+PT -Loss. count, MA Cs and relativ e encoding time (to the correspond- ing baselines). W e did not include the decoding complexity ﬁgures, as the loss function only applies in the encoding process, while the decoding complexity remains unchanged for both codecs. For the DCVC codec, as the perceptual loss is exclusiv ely utilized as a training objectiv e to supervise the ofﬂine optimization of the encoder and decoder , it does not contribute to the online inference process. Therefore, the encoding complexity also remains the same. For HiNeR V , due to the overﬁtting nature of the codec, the new loss function is used in the encoding (training) process, introducing a 21.3% encoding time overhead, a 0.8M parameter count increase, and 1.4G additional MACs. B. Qualitative Results T o further compare the perceptual quality between the proposed method and the baselines, we have also provided visual comparison e xamples in Fig. 4 and Fig. 5. In each example, the baseline results and those based on PT -Loss share identical bitrates. It can be observed that for both baseline codecs, the integration of the proposed PT -Loss results in improv ed perceptual quality , particularly in structured details such as text and edges. These improv ements are also veriﬁed by higher PSNR/VMAF values measured against the original source sequences. C. Ablation Study T o validate the ef fectiv eness of the Mamba-based PT -Loss, we conduct a comparati ve study against the Swin T ransformer- based transcoding quality model (denoted as RankDVQA- UGC) in [4], which has been directly used as a training loss for training DCVC and HiNeR V . T ABLE I reports the BD- rate results on the BVI-UGC dataset across dif ferent reference quality groups. For both DCVC and HiNeR V , integrating original reference QP42 (24.35dB) HiNeR V bpp=0.0999 (23.06dB) +PT -Loss bpp=0.1001 (23.83dB) original reference QP42 (24.35dB) HiNeR V bpp=0.0502 (21.43dB) +PT -Loss bpp=0.0501 (22.04dB) Fig. 5: V isual comparison between HiNeR V and HiNeR V+PT -Loss. 0.02 0.04 0.06 0.08 0.1 Bits per pixel (BPP) 31 31.5 32 32.5 33 33.5 PSNR (dB) 0.0 0.1 0.2 0.3 0.4 0.02 0.04 0.06 0.08 0.1 Bits per pixel (BPP) 65 70 75 VMAF 0.0 0.1 0.2 0.3 0.4 Fig. 6: Impact of the perceptual loss weight on overall compression performance when integrated into HiNeR V . PT -Loss consistently yields higher o verall BD-rate sa vings compared to using RankDVQA as the perceptual loss, with more evident coding gains in the low-quality reference group. More importantly , PT -Loss is associated with much lower computational complexity compared to RankD VQA-UGC, which makes it more suitable for use as a loss function. Specif- ically , RankD VQA-UGC, relying on a Swin T ransformer- based fusion, yields a complexity of 5.2 GMACs and +4.6M parameters, while our Mamba-based PT -Loss reduces this to 1.4 GMACs and +0.8M parameters. T o in vestigate the impact of the perceptual loss weight on compression performance, we conduct an ablation study by varying the weight α in the total loss function. Speciﬁcally , we ev aluate α = { 0 . 0 , 0 . 1 , 0 . 2 , 0 . 3 , 0 . 4 } in Equation 2 for HiNeR V . Here α = 0 . 0 corresponds to the original HiNeR V setup without PT -Loss. As shown in Fig. 6, optimal rate- quality performance for both PSNR and VMAF is achiev ed when α = 0 . 2 . This conﬁrms our selection of α . V . C O N C L U S I O N In this paper, we hav e proposed a perceptually-inspired loss speciﬁcally targeted at UGC transcoding, which is based on a lightweight, Mamba-accelerated neural network. By redeﬁning the non-pristine reference as a semantic guide rather than em- ploying it as a rigid pixel anchor , our approach enables a host video codec to escape the ﬁdelity trap. Extensi ve experiments on both autoencoder-based (DCVC) and ov erﬁtting-based (HiNeR V) architectures demonstrate that our loss achieves substantial coding gains (up to 12.89%), particularly when encoding heavily degraded content. Future work should focus on exploring generative compression architectures e xplicitly designed to reconstruct video quality that surpasses that of the distorted reference. R E F E R E N C E S [1] IndustryResearch, “User-generated content (ugc) platforms mar - ket size, trends and forecast, ” 2024. [2] D. Bull and F . Zhang, Intelligent image and video compression: communicating pictures . Academic Press, 2021. [3] Z. Qi, C. Feng, F . Zhang, X. Xu, S. Liu, and D. Bull, “Bvi-ugc: A video quality database for user-generated content transcoding, ” arXiv pr eprint arXiv:2408.07171 , 2024. [4] Z. Qi, C. Feng, D. Danier , F . Zhang, X. Xu, S. Liu, and D. Bull, “Full-reference video quality assessment for user generated content transcoding, ” in 2024 Pictur e Coding Symposium (PCS) . IEEE, 2024, pp. 1–5. [5] J. Li, B. Li, and Y . Lu, “Deep contextual video compression, ” Advances in Neural Information Processing Systems , vol. 34, pp. 18 114–18 125, 2021. [6] H. M. Kwan, G. Gao, F . Zhang, A. Gower , and D. Bull, “HiNeR V : V ideo compression with hierarchical encoding-based neural representation, ” NeurIPS , v ol. 36, pp. 72 692–72 704, 2023. [7] T . W iegand, G. J. Sulli van, G. Bjønteg aard, and A. Luthra, “Overvie w of the h.264/avc video coding standard, ” IEEE T ransactions on Circuits and Systems for V ideo T echnology , vol. 13, no. 7, pp. 560–576, 2003. [8] G. J. Sulli van, J.-R. Ohm, W .-J. Han, and T . W iegand, “Overvie w of the high efﬁcienc y video coding (hevc) standard, ” IEEE T ransactions on Cir cuits and Systems for V ideo T echnol- ogy , vol. 22, no. 12, pp. 1649–1668, 2012. [9] B. Bross, Y .-K. W ang, Y . Y e, S. Liu, J. Chen, G. J. Sulliv an, and J.-R. Ohm, “Overvie w of the versatile video coding (vvc) standard and its applications, ” IEEE T ransactions on Cir cuits and Systems for V ideo T echnology , vol. 31, no. 10, pp. 3736– 3764, 2021. [10] Y . Chen, D. Mukherjee, J. Han, A. Grange, Y . Xu, S. Parker , C. Chen, H. Su, U. Joshi, C.-H. Chiang et al. , “ An overview of coding tools in av1: the ﬁrst video codec from the alliance for open media, ” APSIP A T ransactions on Signal and Information Pr ocessing , vol. 9, p. e6, 2020. [11] G. Lu, W . Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “D VC: An end-to-end deep video compression framework, ” in Pr oc. of the IEEE/CVF Conf. on Computer V ision and P attern Recognition , 2019, pp. 11 006–11 015. [12] H. Chen, B. He, H. W ang, Y . Ren, S. N. Lim, and A. Shri- vasta va, “Nerv: Neural representations for videos, ” Advances in Neural Information Processing Systems , vol. 34, pp. 21 557– 21 568, 2021. [13] G. Gao, S. T eng, T . Peng, F . Zhang, and D. Bull, “Givic: Generativ e implicit video compression, ” in Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision , 2025, pp. 17 356–17 367. [14] Z. W ang, A. C. Bovik, H. R. Sheikh, and E. P . Simoncelli, “Image quality assessment: from error visibility to structural similarity , ” IEEE T rans. on image pr ocessing , vol. 13, no. 4, pp. 600–612, 2004. [15] Z. W ang, E. P . Simoncelli, and A. C. Bovik, “Multiscale struc- tural similarity for image quality assessment, ” in The Thrity- Seventh Asilomar Conf. on Signals, Systems & Computers, 2003 , vol. 2. Ieee, 2003, pp. 1398–1402. [16] F . Zhang, A. Katsenou, C. Bampis, L. Krasula, Z. Li, and D. Bull, “Enhancing vmaf through new feature integration and model combination, ” in 2021 Pictur e Coding Symposium (PCS) . IEEE, 2021, pp. 1–5. [17] W . Kim, J. Kim, S. Ahn, J. Kim, and S. Lee, “Deep video quality assessor: From spatio-temporal visual sensitivity to a con volutional neural aggregation network, ” in Pr oc. of the Eur opean Conf. on Computer V ision , 2018, pp. 219–234. [18] R. Zhang, P . Isola, A. A. Efros, E. Shechtman, and O. W ang, “The unreasonable effecti veness of deep features as a perceptual metric, ” in Pr oc. of the IEEE Conf. on computer vision and pattern recognition , 2018, pp. 586–595. [19] M. Xu, J. Chen, H. W ang, S. Liu, G. Li, and Z. Bai, “C3DVQA: Full-reference video quality assessment with 3d conv olutional neural network, ” in ICASSP 2020-2020 IEEE International Conf. on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2020, pp. 4447–4451. [20] Y . Li, L. Feng, J. Xu, T . Zhang, Y . Liao, and J. Li, “Full- reference and no-reference quality assessment for compressed user-generated content videos, ” in 2021 IEEE International Conf. on Multimedia & Expo W orkshops (ICMEW) . IEEE, 2021, pp. 1–6. [21] C. Feng, D. Danier , F . Zhang, and D. Bull, “RankDVQA: Deep vqa based on ranking-inspired hybrid training, ” in Proceedings of the IEEE/CVF W inter Confer ence on Applications of Com- puter V ision , 2024, pp. 1648–1658. [22] L. Zhu, S. Kwong, Y . Zhang, S. W ang, and X. W ang, “Gen- erativ e adversarial network-based intra prediction for video coding, ” IEEE transactions on multimedia , vol. 22, no. 1, pp. 45–58, 2019. [23] F . Zhang, C. Feng, and D. R. Bull, “Enhancing vvc through cnn- based post-processing, ” in 2020 IEEE International Conference on Multimedia and Expo (ICME) . IEEE, 2020, pp. 1–6. [24] F . Zhang, D. Ma, C. Feng, and D. R. Bull, “V ideo compression with cnn-based postprocessing, ” IEEE MultiMedia , vol. 28, no. 4, pp. 74–83, 2021. [25] D. Ramsook, A. Kokaram, N. Birkbeck, Y . Su, and B. Adsumilli, “ A deep learning post-processor with a perceptual loss function for video compression artifact remov al, ” in 2022 Pictur e Coding Symposium (PCS) . IEEE, 2022, pp. 85–89. [26] J. W ang, X. Deng, M. Xu, C. Chen, and Y . Song, “Multi- lev el wav elet-based generativ e adversarial network for percep- tual quality enhancement of compressed video, ” in Eur opean confer ence on computer vision . Springer , 2020, pp. 405–421. [27] V . V eerabadran, R. Pourreza, A. Habibian, and T . S. Cohen, “ Adversarial distortion for learned video compression, ” in Pr o- ceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition W orkshops , 2020, pp. 168–169. [28] S. Zhang, M. Mrak, L. Herranz, M. G. Blanch, S. W an, and F . Y ang, “D VC-P: Deep video compression with perceptual optimizations, ” in 2021 International Confer ence on V isual Communications and Image Pr ocessing (VCIP) . IEEE, 2021, pp. 1–5. [29] D. Ma, F . Zhang, and D. R. Bull, “Cveg an: a perceptually- inspired gan for compressed video enhancement, ” Signal Pro- cessing: Image Communication , vol. 127, p. 117127, 2024. [30] J. Ballé, L. V ersari, E. Dupont, H. Kim, and M. Bauer, “Good, cheap, and fast: Overﬁtted image compression with wasserstein distortion, ” in Pr oceedings of the Computer V ision and P attern Recognition Conference , 2025, pp. 23 259–23 268. [31] X. Xiong, E. Pa vez, A. Orte ga, and B. Adsumilli, “Rate- distortion optimization with alternative references for ugc video compression, ” in ICASSP 2023-2023 IEEE International Con- fer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2023, pp. 1–5. [32] A. Gu and T . Dao, “Mamba: Linear-time sequence modeling with selective state spaces, ” in F irst Confer ence on Language Modeling (COLM) , 2024. [33] K. Li, X. Li, Y . W ang, Y . He, Y . W ang, L. W ang, and Y . Qiao, “V ideomamba: State space model for efﬁcient video understanding, ” in European conference on computer vision . Springer , 2024, pp. 237–255. [34] Z. Li, A. Aaron, I. Katsav ounidis, A. Moorthy , and M. Manohara, “T oward a practical perceptual video quality metric, ” The Netﬂix T ech Blog , vol. 6, no. 2, 2016. [35] D. Ma, F . Zhang, and D. Bull, “BVI-DVC: A training database for deep video compression, ” IEEE T rans. on Multimedia , 2021. [36] CLIC, “W orkshop and challenge on learned image compression (clic2022), ” in CVPR W orkshop , 2022. [37] Y . W ang, S. Inguv a, and B. Adsumilli, “Y outube UGC dataset for video compression research, ” in 2019 IEEE 21st Interna- tional W orkshop on Multimedia Signal Processing (MMSP) . IEEE, 2019, pp. 1–5. [38] L. Merritt and R. V anam, “x264: A high performance h. 264/avc encoder , ” 2006. [39] D. Mukherjee, J. Bankoski, A. Grange, J. Han, J. Koleszar , P . W ilkins, Y . Xu, and R. Bultje, “The latest open-source video codec vp9-an ov erview and preliminary results, ” in 2013 Pictur e Coding Symposium (PCS) . IEEE, 2013, pp. 390–393. [40] T . Xue, B. Chen, J. W u, D. W ei, and W . T . Freeman, “V ideo enhancement with task-oriented ﬂow , ” International Journal of Computer V ision (IJCV) , vol. 127, no. 8, pp. 1106–1125, 2019. [41] F . Zhang, F . M. Moss, R. Baddeley , and D. R. Bull, “BVI-HD: A video quality database for HEVC compressed and texture synthesized content, ” IEEE T rans. on Multimedia , vol. 20, no. 10, pp. 2620–2630, 2018. [42] S. W inkler, “ Analysis of public image and video databases for quality assessment, ” IEEE Journal of Selected T opics in Signal Pr ocessing , vol. 6, no. 6, pp. 616–625, 2012. [43] V . Hosu, F . Hahn, M. Jenadeleh, H. Lin, H. Men, T . Szirányi, S. Li, and D. Saupe, “The konstanz natural video database (kon vid-1k), ” in 2017 Ninth International Conf. on Quality of Multimedia Experience (QoMEX) . IEEE, 2017, pp. 1–6. [44] Z. Sinno and A. C. Bovik, “Large-scale study of perceptual video quality , ” IEEE T rans. on Image Pr ocessing , vol. 28, no. 2, pp. 612–627, 2018. [45] X. Y u, N. Birkbeck, Y . W ang, C. G. Bampis, B. Adsumilli, and A. C. Bovik, “Predicting the quality of compressed videos with pre-existing distortions, ” IEEE T rans. on Image Pr ocessing , vol. 30, pp. 7511–7526, 2021. [46] Y . Li, S. Meng, X. Zhang, S. W ang, Y . W ang, and S. Ma, “UGC-VIDEO: perceptual quality assessment of user-generated videos, ” in 2020 IEEE Conf. on Multimedia Information Pr o- cessing and Retrieval (MIPR) . IEEE, 2020, pp. 35–38.

A Mamba-based Perceptual Loss Function for Learning-based UGC Transcoding

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment