Guiding Perception-Reasoning Closer to Human in Blind Image Quality Assessment

Reading time: 5 minute
...

📝 Original Info

  • Title: Guiding Perception-Reasoning Closer to Human in Blind Image Quality Assessment
  • ArXiv ID: 2512.16484
  • Date: 2025-12-18
  • Authors: Yuan Li, Yahan Yu, Youyuan Lin, Yong-Hao Yang, Chenhui Chu, Shin’ya Nishida

📝 Abstract

Humans assess image quality through a perception-reasoning cascade, integrating sensory cues with implicit reasoning to form self-consistent judgments. In this work, we investigate how a model can acquire both human-like and self-consistent reasoning capability for blind image quality assessment (BIQA). We first collect human evaluation data that capture several aspects of human perception-reasoning pipeline. Then, we adopt reinforcement learning, using human annotations as reward signals to guide the model toward human-like perception and reasoning. To enable the model to internalize self-consistent reasoning capability, we design a reward that drives the model to infer the image quality purely from self-generated descriptions. Empirically, our approach achieves score prediction performance comparable to state-of-the-art BIQA systems under general metrics, including Pearson and Spearman correlation coefficients. In addition to the rating score, we assess human-model alignment using ROUGE-1 to measure the similarity between model-generated and human perception-reasoning chains. On over 1,000 human-annotated samples, our model reaches a ROUGE-1 score of 0.512 (cf. 0.443 for baseline), indicating substantial coverage of human explanations and marking a step toward human-like interpretable reasoning in BIQA.

💡 Deep Analysis

📄 Full Content

Blind image quality assessment (BIQA) aims to simulate how humans perceive and evaluate the visual quality of an image. To understand which visual features are extracted from the image in perception and how these features are logically integrated into an overall judgment, researchers have explored a wide range of computational approaches.

Early studies [2,6,13,36] achieved some level of success by focusing on low-and mid-level visual features. Using contrastive learning or visual feature encoders, these methods could effectively predict numerical quality scores. However, as multimodal models [3,4,19,32,35] evolve, the research community has begun to seek interpretability beyond score regression-expecting models not only to rate image quality but also to articulate why an image appears better or worse. This shift has inspired a new wave of pioneering work that bridges vision and language. Recent studies [5,7,16,28,29,33,37] have explored semantic-based classification, textual explanation generation, and multimodal regression guided by visual features. These models laid the foundation for multimodal BIQA, demonstrating the potential of combining image and text representations. Nevertheless, they still fall short of replicating the human process of perception and reasoning. Rather than integrating perceptual cues into a logical judgment, they often directly generate both explanations and scores from image embeddings, so the two are often not logically connected. As a result, these models may appear to reason, yet their “reasoning” remains shallow and directly coupled with visual input.

As illustrated in Fig. 1, existing multimodal large language model (MLLM) -based BIQA models can be roughly divided into two categories. The first category is supervised fine-tuning (SFT) models, which lack a genuine reasoning process and treat image captions merely as by-products of rating. The second category is reinforcement learning (RL) models, where the reasoning process is generated jointly with the quality score, but without human supervision. To build a system that more closely resembles humans, we propose to supervise the perception-reasoning stage by human annotations in an RL framework. Before detailing the method, we first formalize the concept of human perception-reasoning in BIQA: (1) Perception -visual image is transformed into internal representations, including low-level visual features and high-level semantic features; (2) Reasoning -these representations are integrated into a coherent quality judgment. By explicitly modeling this two-stage process, we enable the system to analyze intermediate textual information, simulate human perceptual focus, and reconstruct human evaluation criteria-thereby enhancing interpretability.

Our contributions are fourfold:

• We collect human annotations on 1,495 images spanning eight dimensions related to image quality, which directly capture the human perception and reasoning process. • We design new reward functions that enable the model to effectively evolve toward self-consistent and human-like reasoning under these human-annotated signals.

• We introduce ROUGE-1 as a metric for evaluating the alignment between model-generated and human perception-reasoning chains, providing a new direction for measuring human-aligned reasoning in BIQA. • Our model achieves competitive performance under both image-based and caption-based conditions, offering a step toward interpretable, human-aligned BIQA.

A variety of publicly available datasets [7-9, 12, 14, 15, 18, 29] have greatly advanced research in image quality assessment (IQA). These datasets include large collections of both real-world and synthetically degraded images, significantly broadening the scope of quality evaluation studies. From the perspective of annotation, in addition to the mandatory mean opinion score (MOS), some datasets provide supervisory signals. For example, KonIQ [12] supplements MOS with score distributions and feature-level attributes such as contrast, colorfulness, and sharpness, while Q-Pathway [29] introduces rich textual annotations that describe lowlevel perceptual attributes and visual content. We categorize these as perceptual features, which reflect how humans perceive visual quality through sensory cues.

However, image features are inherently complex, and the same texture pattern may influence perceived quality differently across contexts. What remains largely missing are authentic annotations that capture reasoning-level cues-how humans interpret and integrate perceptual information to form consistent judgments of quality. To address this gap, we collect human evaluation data that explicitly describe how specific perceptual features affect perceived quality, providing a foundation for modeling the reasoning process underlying human IQA.

Early researchers primarily focused on predicting a numerical quality score. Most existing methods [2,6,20,22] adopt a visual encoder-regression framework,

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut