Non-reference metrics (NRMs) can assess the visual quality of images and videos without a reference, making them well-suited for the evaluation of user-generated content. Nonetheless, rate-distortion optimization (RDO) in video coding is still mainly driven by full-reference metrics, such as the sum of squared errors, which treat the input as an ideal target. A way to incorporate NRMs into RDO is through linearization (LNRM), where the gradient of the NRM with respect to the input guides bit allocation. While this strategy improves the quality predicted by some metrics, we show that it can yield limited gains or degradations when evaluated with other NRMs. We argue that NRMs are highly non-linear predictors with locally unstable gradients that can compromise the quality of the linearization; furthermore, optimizing a single metric may exploit model-specific biases that do not generalize across quality estimators. Motivated by this observation, we extend the LNRM framework to optimize ensembles of NRMs and, to further improve robustness, we introduce a smoothing-based formulation that stabilizes NRM gradients prior to linearization. Our framework is well-suited to hybrid codecs, and we advocate for its use with overfitted codecs, where it avoids iterative evaluations and backpropagation of neural network-based NRMs, reducing encoder complexity relative to direct NRM optimization. We validate the proposed approach on AVC and Cool-chic, using the YouTube UGC dataset. Experiments demonstrate consistent bitrate savings across multiple NRMs with no decoder complexity overhead and, for Cool-chic, a substantial reduction in encoding runtime compared to direct NRM optimization.
Non-reference metrics (NRMs) [1], which assess the perceptual quality of images and videos without a pristine reference, are important for the compression of user-generated content (UGC) [2]. UGC is typically noisy due to motion blur, non-ideal exposure, and artifacts from prior compression. Once uploaded to platforms like YouTube and TikTok, UGC is re-encoded at different bitrates and resolutions to support adaptive streaming [3]. In such pipelines, where the original content is unreliable, NRMs are popular for evaluating perceptual quality [2,4]. Despite their widespread adoption for quality evaluation, NRMs are still rarely used as bit-allocation objectives during compression [5]. Instead, rate-distortion optimization (RDO) in hybrid (i.e., block-based) [6] and overfitted [7] codecs is mostly driven by full-reference metrics (FRMs), such as the sum of squared errors (SSE) [8], which assume the input is an ideal reference. Hence, distortion converges to perfect quality (as measured This work was funded in part by a YouTube gift. *Equal contribution. by the FRM) as bitrate increases [5]. While appropriate for pristine content, FRMs encourage the preservation of artifacts when applied to UGC, which leads to suboptimal compression [9].
Thus, UGC setups can benefit from using NRMs in RDO. The strategy to incorporate them depends on the codec architecture. In overfitted codecs [7], the NRM can be included directly in the loss function and optimized end-to-end via gradient descent, as we show in Section 4. However, overfitted codecs require multiple evaluations of the RD cost during encoding. Since modern NRMs are mostly implemented as deep neural networks, their repeated evaluation and backpropagation add substantial encoder complexity and runtime overhead [10]. For hybrid encoders, given that most NRMs map the input to a single value, obtaining per-pixel or per-block importance is not straightforward. As a result, RDO with NRMs for hybrid encoders may require iterative encoding, decoding, and metric evaluation. To address this issue, [5] proposed using the gradient of the metric evaluated at the input to guide bit allocation, a strategy termed linearized NRM (LNRM), which drastically reduces the complexity of accounting for NRMs in hybrid codecs.
While LNRM makes NRM-based RDO computationally efficient, NRMs are highly non-linear mappings from images to scalars, and their input gradients may exhibit local instability, i.e., the metric’s response can vary sharply even for very small input changes [11]. Moreover, NRMs are imperfect quality estimators with modelspecific biases, so that different NRMs produce different results for the same input. Experimentally, we observe, with both hybrid and overfitted codecs, that optimizing an LNRM derived from a specific (target) NRM often leads to significant improvements for the target NRM, while we observe marginal improvements (or even degradations) for other NRMs (Figure 1). Although in some settings this might be desirable, gains that are more consistent across an ensemble of NRMs are more likely to reflect perceptual quality improvements.
In this work, we address these issues along two directions. First, we extend the LNRM formulation [5] to ensembles of metrics, thereby reducing sensitivity to the choice of NRM and encouraging improvements across multiple, potentially unrelated quality predictors. Second, we introduce a smoothing-based optimization strategy, in which for a given NRM we average the scores obtained over small Gaussian perturbations of the input before linearization. This smoothed gradient strategy, similar to methods developed in the machine learning literature [14,15], computes an average of slightly perturbed gradients, attenuating sharp local variations and favoring directions that remain stable under small input perturbations. This improves the reliability of the first-order linearization underlying LNRM. Our approach can be incorporated into hybrid codecs, improving performance not only for the target NRMs but also across an ensemble of metrics. We further argue for its use with overfitted codecs, where linearization substantially reduces encoding complexity relative to direct metric optimization. When combined with smoothing, further computational gains are obtained by avoiding (d) Normalized scores (between 0 and 1) reported by multiple NRMs. While optimizing the bit-allocation for QualiCLIP improves the quality as predicted by some metrics, it yields limited improvements or even degradations in others.
iterative evaluations of the smoothed objective. We validate our framework using AVC [16] and Cool-chic [7] for several NRMs [17][18][19]. We test frames sampled from the YouTube UGC dataset [2], setting the RDO to optimize different NRMs using our approach, considering both ensembles of metrics and smoothed objectives. For Cool-chic, we also consider direct NRM optimization. Results in Section 4 show that our method can yield improvements
This content is AI-processed based on open access ArXiv data.