Evaluating quality metrics through the lenses of psychophysical measurements of low-level vision

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Image and video quality metrics, such as SSIM, LPIPS, and VMAF, aim to predict perceived visual quality and are often assumed to reflect principles of human vision. However, relatively few metrics explicitly incorporate models of human perception, with most relying on hand-crafted formulas or data-driven training to approximate perceptual alignment. In this paper, we introduce a set of tests for full-reference quality metrics that evaluate their ability to capture key aspects of low-level human vision: contrast sensitivity, contrast masking, and contrast matching. These tests provide an additional framework for assessing both established and newly proposed metrics. We apply the tests to 34 existing quality metrics and highlight patterns in their behavior, including the ability of LPIPS and MS-SSIM to predict contrast masking and the tendency of SSIM to overemphasize high spatial frequencies, which is mitigated in MS-SSIM, and the general inability of metrics to model supra-threshold contrast constancy. Our results demonstrate how these tests can reveal properties of quality metrics that are not easily observed with standard evaluation protocols.

💡 Research Summary

This paper introduces a psychophysics‑inspired evaluation framework for full‑reference image and video quality metrics, aiming to reveal how well these metrics capture fundamental low‑level properties of human vision. While traditional assessment relies on correlation with subjective scores (MOS, DMOS, JOD), such approaches mask the underlying perceptual mechanisms that drive human quality judgments. To address this gap, the authors design a suite of synthetic tests that mimic classic psychophysical experiments: (1) contrast detection, (2) contrast masking (both phase‑coherent sinusoidal maskers and phase‑incoherent broadband noise), (3) flicker detection (temporal modulation), and (4) contrast matching across spatial frequencies and across color directions.

In the detection tests, a metric plays the role of an observer: a test image containing a stimulus (e.g., a Gabor patch of varying spatial frequency and contrast) is compared against a uniform reference. By sweeping stimulus parameters, each metric produces a two‑dimensional “detection surface.” The authors compare these surfaces to human contrast‑sensitivity functions (CSF) and masking curves, quantifying alignment with an “Alignment Score” (higher is better). In the matching tests, the metric is asked to find the stimulus contrast that yields the same quality score as a reference stimulus; the deviation from human matching data is measured by RMSE.

The framework is applied to 34 full‑reference metrics, covering traditional (PSNR, SSIM, MS‑SSIM, GMSD, VIFp, DSS, NLPD), color‑specific (FSIMc, VSI, sCIELab, CIEDE2000, ICtCp, HyAB), deep‑learning based (WaDIQaM, LPIPS‑Alex, LPIPS‑VGG, DISTS, AHIQ, TOPIQ), and video‑oriented (VMAF, SpeedQA, FUNQUE, HDR‑VDP‑3, FovVideoVDP, ColorVideoVDP, CVVDP‑ML) families.

Key findings:

Contrast detection – HDR‑VDP‑3, MS‑SSIM, and VMAF achieve the highest alignment with the human CSF. SSIM, despite its popularity, over‑emphasizes high spatial frequencies, resulting in poor alignment. PSNR performs poorly, confirming that simple pixel‑wise error does not reflect human detection thresholds.
Contrast masking – LPIPS (both Alex and VGG backbones) and MS‑SSIM best predict human masking curves for both sinusoidal and broadband noise maskers. Traditional structure‑based metrics (GMSD, VIFp, DSS) show limited sensitivity to masking, indicating that they lack an explicit suppression mechanism.
Flicker detection – Only video‑specific metrics can be evaluated; VMAF, SpeedQA, and FUNQUE obtain moderate alignment scores, while static‑image metrics are inapplicable.
Contrast matching (supra‑threshold) – The majority of metrics, including deep‑learning ones, fail to reproduce human contrast‑constancy across spatial frequencies; RMSE values are high. Color‑direction matching is better captured by dedicated color difference formulas (CIEDE2000, ICtCp, HyAB), whereas most quality metrics show large variations across color axes, reflecting a lack of explicit color‑contrast modeling.
Deep‑learning metrics – LPIPS shows strong performance on detection and masking but weak on matching, suggesting that learned feature spaces encode near‑threshold sensitivity but not supra‑threshold constancy. DISTS behaves similarly. WaDIQaM performs poorly across the board, likely due to training objectives that do not emphasize low‑level vision.
Color and video metrics – ColorVideoVDP and CVVDP‑ML (both saliency‑aware and transformer‑based variants) excel in color‑direction matching and achieve respectable flicker scores, demonstrating the benefit of incorporating perceptual color models.

Overall, the study demonstrates that many widely used quality metrics do not faithfully model basic visual processes such as contrast constancy, and that explicit incorporation of psychophysical models (CSF, masking, color‑contrast) can markedly improve alignment with human perception. The proposed test suite provides a systematic, interpretable complement to conventional MOS‑based benchmarking, enabling researchers to diagnose specific perceptual shortcomings of a metric and to guide the design of next‑generation quality predictors that embed low‑level human vision principles. Future work may extend the framework to high‑dynamic‑range, chromatic aberration, and complex distortion interactions, and explore integrating CSF‑based loss functions into deep learning‑based quality models.

Evaluating quality metrics through the lenses of psychophysical measurements of low-level vision

💡 Research Summary

Comments & Academic Discussion

Leave a Comment