Electrical Engineering and Systems Science / Audio Processing

SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment

February 20, 2026

Reading time: 5 minute

...

#Learning #Audio Processing #Electrical Engineering and Systems Science

📝 Original Info

Title: SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment
ArXiv ID: 2602.14785
Date: 2026-02-16
Authors: ** - 논문에 명시된 저자 정보는 제공되지 않았습니다. (원문에 저자 리스트가 포함되지 않음) **

📝 Abstract

Designing a speech quality assessment (SQA) system for estimating mean-opinion-score (MOS) of multi-rate speech with varying sampling frequency (16-48 kHz) is a challenging task. The challenge arises due to the limited availability of a MOS-labeled training dataset comprising multi-rate speech samples. While self-supervised learning (SSL) models have been widely adopted in SQA to boost performance, a key limitation is that they are pretrained on 16 kHz speech and therefore discard high-frequency information present in higher sampling rates. To address this issue, we propose a spectrogram-augmented SSL method that incorporates high-frequency features (up to 48 kHz sampling rate) through a parallel-branch architecture. We further introduce a two-step training scheme: the model is first pre-trained on a large 48 kHz dataset and then fine-tuned on a smaller multi-rate dataset. Experimental results show that leveraging high-frequency information overlooked by SSL features is crucial for accurate multi-rate SQA, and that the proposed two-step training substantially improves generalization when multi-rate data is limited.

💡 Deep Analysis

📄 Full Content

Speech quality assessment (SQA) is the task of evaluating how well human or synthetic speech is perceived by a listener. There are two main approaches to SQA: subjective and objective. Subjective methods involve human listeners rating the speech, typically on the meanopinion-score (MOS) scale, where listeners rate quality from 1 (bad) to 5 (excellent). Objective methods use algorithms to predict human perception and are more efficient and reproducible. These include intrusive methods like PESQ [1] and POLQA [2], which compare a degraded speech signal to its clean reference, and non-intrusive methods that assess quality using only the degraded signal.

Since clean reference signals are rarely available in real-world scenarios, non-intrusive SQA methods are popular. Recent stateof-the-art non-intrusive SQA models [3,4,5,6] leverage selfsupervised learning (SSL) representations extracted from largescale pretrained models, such as Wav2Vec2, HuBERT, and WavLM [7,8,9]. In this framework, an SSL model is pre-trained on vast amounts of unlabeled data and provides generic representations that can be exploited for downstream tasks such as SQA. However, a key limitation is that current SSL models are typically pretrained on 16 kHz speech. As a result, high-fidelity recordings (e.g., 24 kHz or 48 kHz) must be downsampled to 16 kHz before feature extraction, which discards perceptually important high-frequency information and negatively impacts SQA performance.

Developing a generalized SSL-based multi-rate SQA method providing MOS, that works across different sampling rates, is an interesting yet challenging task due to three reasons. (a) First, SSLbased models lack access to high-band information. (b) Second, there is a scarcity of multi-rate datasets. Most MOS-labelled corpora are collected at a single sampling rate, limiting the availability of suitable training data. (c) Third, the range-equalizing bias complicates cross-dataset learning [10]. Human raters typically use the full MOS scale even when the variance in perceived quality is limited, leading to misaligned MOS distributions across datasets. For example, a MOS rating of 5 for a 16 kHz sample may not correspond to the same perceived quality as a MOS rating 5 for a 48 kHz sample. Therefore, it is difficult to directly combine MOS-labeled datasets recorded at different sampling rates and use the combined dataset for model training.

Recently, a multi-rate MOS-labeled subjective dataset containing recordings at 16, 24, and 48 kHz within a single evaluation was released [11] as part of the AudioMOS 2025 challenge, aimed to tackle the issue of multi-rate SQA. However, its limited size makes it challenging to train a generalizable multi-rate SQA model. In this work, we show that SSL-based multi-rate SQA methods trained only on the AudioMOS dataset struggle to generalize when evaluated on diverse external datasets. To address this limitation, we propose SA-SSL-MOS, a spectrogram-augmented SSL-based model for non-intrusive MOS prediction. The proposed method augments SSL-based features at 16 kHz with spectrogram features to preserve high-frequency information. By effectively combining SSL-based and spectral-augmented features, SA-SSL-MOS takes advantage of the robustness and performance of SSL-based approaches while still retrieving information of higher frequencies for high-fidelity recordings. Furthermore, we introduce a two-step pretraining-finetuning framework that enables effective use of limited multi-rate MOS data. From this, we investigate two research questions. First, does high-frequency information improve MOS prediction in high-fidelity recordings? And second, because of dataset limitations, does a pretraining strategy improve generalization to unseen speech recordings? Our contributions in this article are as follows: (1) we propose SA-SSL-MOS, a method for high-fidelity multi-rate speech quality assessment method; (2) we demonstrate that incorporating high-frequency information significantly improves objective speech quality prediction; (3) we show that an SSL-based multi-rate SQA method trained on limited AudioMOS data suffers in generalization, and we introduce a two-step training strategy that improves generalization to out-of-distribution datasets.

Let x denote a speech clip and y its corresponding MOS label. A speech quality dataset can be represented as D = {(xn, yn)} N n=1 , where N is the total number of clips. Our goal is to design a regressor function f θ θ θ (x) with parameters θ θ θ that predicts y for a given input x. The regressor is typically implemented as a deep neural network (DNN), which learns its parameters in a data-driven manner.

We use the SSL-based MOS providing method of [5] as our baseline due to its design simplicity and high performance. The baseline method performs a layer selection and is referred to as ‘SSL-Layer-MOS’ in this article. Following the architectural design from [12] and a comprehensive study of different l

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on open access ArXiv data.

SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Related Posts

Active RIS-Assisted MIMO System for Vital Signs Extraction: ISAC Modeling, Deep Learning, and Prototype Measurements

Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction

Certifying Hamilton-Jacobi Reachability Learned via Reinforcement Learning

Start searching

No results found