Blind image quality assessment (BIQA) plays a crucial role in evaluating and optimizing visual experience. Most existing BIQA approaches fuse shallow and deep features extracted from backbone networks, while overlooking the unequal contributions to quality prediction. Moreover, while various vision encoder backbones are widely adopted in BIQA, the effective quality decoding architectures remain underexplored. To address these limitations, this paper investigates the contributions of shallow and deep features to BIQA, and proposes a effective quality feature decoding framework via GCN-enhanced \underline{l}ayer\underline{i}nteraction and MoE-based \underline{f}eature d\underline{e}coupling, termed \textbf{(Life-IQA)}. Specifically, the GCN-enhanced layer interaction module utilizes the GCN-enhanced deepest-layer features as query and the penultimate-layer features as key, value, then performs cross-attention to achieve feature interaction. Moreover, a MoE-based feature decoupling module is proposed to decouple fused representations though different experts specialized for specific distortion types or quality dimensions. Extensive experiments demonstrate that Life-IQA shows more favorable balance between accuracy and cost than a vanilla Transformer decoder and achieves state-of-the-art performance on multiple BIQA benchmarks.The code is available at: \href{https://github.com/TANGLONG2/Life-IQA/tree/main}{\texttt{Life-IQA}}.
Image quality assessment (IQA) has long been an important task in computer vision, which aims to accurately produce quality predictions that are consistent with human perceptual assessments [1]- [7]. Among these, blind image quality assessment (BIQA) qualifies the absolute quality of images in the absence of pristine references [8]- [12], which is widely used in various applications. Existing BIQA methods commonly utilize pretrained backbone networks to extract multi-scale features from distorted images then fuse them to predict the quality [13]- [17]. The underlying assumption of this strategy is that different hierarchical features provide complementary information. Specifically, shallow level features are thought to capture local details and texture information, making them sensitive to local distortions, as exemplified by Long Tang, Jianbo Zhang, Huiyu Duan are with the College of Integrated Circuits, Shanghai Jiao Tong University, Shanghai 200240, China (email:sjtu8126288@sjtu.deu.cn) Guoquan Zhen, Jie Hao are with the College of Integrated Circuits, Beijing University of Chemical Technology, Beijing 100029, China.
Liang Yuan and Guangtao Zhai are with the ICCI and the Institute of Image Communication and Information Processing, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail: lyuan@sjtu.edu.cn; zhaiguangtao@sjtu.edu.cn). To quantify the actual contribution of features from different stages to the BIQA, six different pretrained models are evaluated on the authentic distortion dataset LIVEC [18], including three convolutional neural networks (CNNs), i.e., ResNet50 [19], EfficientNet [20], MobileNet [21] and three Transformer models, i.e., Swin Transformer [22], MaxViT [23], PVT [24]. To separately investigate the quality-representation capacity of the features from different layers, the image features from each stage are directly appended with a simple prediction head, which consists of a global average pooling layer (GAP) and a linear regression layer. As illustrated in Figure 2, the regression results on the LIVEC [18] clearly demonstrate that the predictive performance of shallow level features (stage1, stage2) from all backbone networks is significantly inferior to that of deep-level features (stage3, stage4). To further investigate the discriminative ability of the features for specific distortion types, the features from stage1 and stage4 are visualized using t-SNE on the synthetic distortion dataset KADID-10k [25]. As shown in Figure 3, the features from stage1 are entangled and hard to be completely distinguished, leading to severely blurred class boundaries, while the stage4 features form highly compact and wellseparated cluster structures. These experiments indicate that shallow features contribute less to BIQA tasks compared with deep features. Because of the limited data in IQA, models struggle to learn effective quality representations from shallow features dominated by details, which indicates that completely and directly fuse features from all stages may introduce noise and redundancy.
Based on the aforementioned analysis, we avoid the unstable and often ineffective traditional multi-scale feature fusion and focus on enabling efficient interaction among deeper, semantically richer features. To this end, we propose Life-IQA, an effective quality feature decoding framework built upon GCN-enhanced Layer Interaction and MoE-based Feature Decoupling. In Life-IQA, the deepest-stage features, enhanced through a graph convolutional network (GCN), serve as query tokens that carry high-level quality semantics, while the penultimate-stage features act as keys and values, providing structured local evidence, then the cross-attention-based interaction is performed. Furthermore, a mixture-of-expert (MoE) module decouples the fused representations, allowing different experts to specialize in particular distortion types or quality dimensions. This design yields a dynamic, dataefficient decoding mechanism tailored to the characteristics of BIQA. Overall, the main contributions are summarized as follows.
• We propose a cross-attention based layer interaction method for BIQA, which extracts the features of last two stages and fuse them via cross-attention layers.
With the success of the Vision Transformer (ViT) [26], the powerful capability for modeling global dependencies has also been applied to the BIQA task [27], [28]. Golestaneh et al [29] constructing a complete encoder decoder pipeline. However, the query generation relied solely on the single CLS token output from the ViT encoder. This approach not only limited the diversity of query perspectives but also made the effectiveness of the decoding process heavily dependent on the representation quality of the upstream encoder. Therefore, designing decoders with richer information sources and more flexible structures is a key to unlocking the full potential in BIQA.
The MoE model was a neural network architecture that expands model c
This content is AI-processed based on open access ArXiv data.