Electrical Engineering and Systems Science / Audio Processing

Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction

February 20, 2026

Reading time: 5 minute

...

#Audio Processing #Electrical Engineering and Systems Science

📝 Original Info

Title: Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction
ArXiv ID: 2602.15484
Date: 2026-02-17
Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. **

📝 Abstract

In this study, we have presented a novel approach to predict the Short-Time Objective Intelligibility (STOI) metric using a bottleneck transformer architecture. Traditional methods for calculating STOI typically requires clean reference speech, which limits their applicability in the real world. To address this, numerous deep learning-based nonintrusive speech assessment models have garnered significant interest. Many studies have achieved commendable performance, but there is room for further improvement. We propose the use of bottleneck transformer, incorporating convolution blocks for learning frame-level features and a multi-head self-attention (MHSA) layer to aggregate the information. These components enable the transformer to focus on the key aspects of the input data. Our model has shown higher correlation and lower mean squared error for both seen and unseen scenarios compared to the state-of-the-art model using self-supervised learning (SSL) and spectral features as inputs.

💡 Deep Analysis

📄 Full Content

Speech assessment refers to the process of evaluating various attributes of speech signals, such as quality and intelligibility. Speech assessment metrics are indicators that quantitatively measure the specific attributes of speech signals. Speech assessment is mainly divided into two categories, one that requires human involvement called Subjective Assessment, and one without human listeners called Objective Assessment. Objective assessment is further divided into two subcategories called Intrusive assessment and Nonintrusive assessment. The former requires a clean reference signal to calculate the assessment score, while the latter does not require the clean reference signal. For most of the cases where we have a large amount of data, the clean reference signal is unavailable, and therefore subjective or intrusive assessment methods are not feasible. To overcome this problem, several approaches have been proposed to estimate speech intelligibility as surrogates for human listening test and intrusive assessment.

The researchers in [1] have proposed Quality-Net that used the magnitude of the spectrogram as input to bidirectional long-short-term memory (BiLSTM) modules. They tried to estimate the Perceptual Evaluation of Speech Quality (PESQ) score [2] at an utterance level by incorporating the weighted sum of both the utterance level and frame level evaluations using the mean squared error as the objective function. The researchers in [3] have introduced STOI-Net which also used the magnitude of the spectrogram as input. STOI-Net is a combination of convolutional neural networks (CNN) and bidirectional longshort-term memory (BiLSTM) with multiplicative attention mechanism (CNN-BiLSTM-ATTN). The researchers used the same objective function employed by Quality-Net [1]. The latter model showed a higher correlation between the ground-truth STOI scores. More research works were published with multi-task setup where scores like Objective evaluation scores such as speech transmission index(STI) and Short time objective intelligibility (STOI) [4] and human-subjective ratings from human listening tests were evaluated. In the paper MOSA-Net [5] crossdomain features (spectral and temporal features) and latent representations from an Self Supervised Learning (SSL) HuBERT [6] model were used to predict objective quality and intelligibility scores simultaneously. MOSA-Net can quite accurately predict objective quality (PESQ) and intelligibility (STOI) scores. Later an improved version of MOSA Net called MTI-Net [7] was developed to simultaneously predict Subjective Intelligibility (SI), STOI and WER scores.

Research has been and is being done for MOS prediction. Recent efforts include MOS-Net [8], a CNN-BiLSTMbased model designed to estimate the quality of speech. MB-Net [9] uses two separate networks to predict the mean quality score of an utterance and the difference between the mean score and the listener’s score. QUAL-Net [10] utilizes the same architecture and features as MTI-Net [7] but uses a simpler CNN architecture for feature extraction.

More work has been done in the medical field, DNNbased models are utilized in hearing aids (HA) to predict evaluation metrics such as the Hearing Aid Speech Quality Index (HASQI) [11] and the Hearing Aid Speech Perception Index (HASPI) [12]. MBI-Net [13] resembles MTI-Net [7] and takes spectral feature along with a hearing loss pattern as inputs for the network. It has two branches that uses different input channels, fed into a feature extractor to extract spectral, learnable filter banks (LFB), and SSL features and estimates subjective intelligibility scores. MBI-Net+ [14], an enhanced version, incorporates HASPI in its objective function to enhance the intelligibility prediction score. It uses Whisper model embeddings and speech metadata as the inputs and utilizes a classifier to identify speech signals enhanced by various methods.

In this study, we propose a model for STOI prediction, which is the combination of a convolution block (conv block), bottleneck transformer [15], and dense layers. The conv block is used for extracting and refining the input features. The bottleneck transformer helps to capture short-and long-term contexts while removing redundant information. The dense layer is used for the prediction of the STOI scores. Experimental results show that predicted scores have higher correlation with the ground-truth STOI Scores when tested in both Seen (as explained in Section V) and Unseen conditions (test speakers and utterances are not involved in the training). The experimental results confirm that the proposed model has a comparably better result than the baseline model.

The remainder of the paper is organized as follows. Section II reviews the datasets used, Section III presents the related works, Section IV presents the proposed method, Section V presents the experiments and results, and concluding with discussion on future prospects of this work in Secti

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on open access ArXiv data.

Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Related Posts

Color-based Emotion Representation for Speech Emotion Recognition

Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios

How Much Does Machine Identity Matter in Anomalous Sound Detection at Test Time?

Start searching

No results found