In this study, we have presented a novel approach to predict the Short-Time Objective Intelligibility (STOI) metric using a bottleneck transformer architecture. Traditional methods for calculating STOI typically requires clean reference speech, which limits their applicability in the real world. To address this, numerous deep learning-based nonintrusive speech assessment models have garnered significant interest. Many studies have achieved commendable performance, but there is room for further improvement. We propose the use of bottleneck transformer, incorporating convolution blocks for learning frame-level features and a multi-head self-attention (MHSA) layer to aggregate the information. These components enable the transformer to focus on the key aspects of the input data. Our model has shown higher correlation and lower mean squared error for both seen and unseen scenarios compared to the state-of-the-art model using self-supervised learning (SSL) and spectral features as inputs.
Speech assessment refers to the process of evaluating various attributes of speech signals, such as quality and intelligibility. Speech assessment metrics are indicators that quantitatively measure the specific attributes of speech signals. Speech assessment is mainly divided into two categories, one that requires human involvement called Subjective Assessment, and one without human listeners called Objective Assessment. Objective assessment is further divided into two subcategories called Intrusive assessment and Nonintrusive assessment. The former requires a clean reference signal to calculate the assessment score, while the latter does not require the clean reference signal. For most of the cases where we have a large amount of data, the clean reference signal is unavailable, and therefore subjective or intrusive assessment methods are not feasible. To overcome this problem, several approaches have been proposed to estimate speech intelligibility as surrogates for human listening test and intrusive assessment.
The researchers in [1] have proposed Quality-Net that used the magnitude of the spectrogram as input to bidirectional long-short-term memory (BiLSTM) modules. They tried to estimate the Perceptual Evaluation of Speech Quality (PESQ) score [2] at an utterance level by incorporating the weighted sum of both the utterance level and frame level evaluations using the mean squared error as the objective function. The researchers in [3] have introduced STOI-Net which also used the magnitude of the spectrogram as input. STOI-Net is a combination of convolutional neural networks (CNN) and bidirectional longshort-term memory (BiLSTM) with multiplicative attention mechanism (CNN-BiLSTM-ATTN). The researchers used the same objective function employed by Quality-Net [1]. The latter model showed a higher correlation between the ground-truth STOI scores. More research works were published with multi-task setup where scores like Objective evaluation scores such as speech transmission index(STI) and Short time objective intelligibility (STOI) [4] and human-subjective ratings from human listening tests were evaluated. In the paper MOSA-Net [5] crossdomain features (spectral and temporal features) and latent representations from an Self Supervised Learning (SSL) HuBERT [6] model were used to predict objective quality and intelligibility scores simultaneously. MOSA-Net can quite accurately predict objective quality (PESQ) and intelligibility (STOI) scores. Later an improved version of MOSA Net called MTI-Net [7] was developed to simultaneously predict Subjective Intelligibility (SI), STOI and WER scores.
Research has been and is being done for MOS prediction. Recent efforts include MOS-Net [8], a CNN-BiLSTMbased model designed to estimate the quality of speech. MB-Net [9] uses two separate networks to predict the mean quality score of an utterance and the difference between the mean score and the listener’s score. QUAL-Net [10] utilizes the same architecture and features as MTI-Net [7] but uses a simpler CNN architecture for feature extraction.
More work has been done in the medical field, DNNbased models are utilized in hearing aids (HA) to predict evaluation metrics such as the Hearing Aid Speech Quality Index (HASQI) [11] and the Hearing Aid Speech Perception Index (HASPI) [12]. MBI-Net [13] resembles MTI-Net [7] and takes spectral feature along with a hearing loss pattern as inputs for the network. It has two branches that uses different input channels, fed into a feature extractor to extract spectral, learnable filter banks (LFB), and SSL features and estimates subjective intelligibility scores. MBI-Net+ [14], an enhanced version, incorporates HASPI in its objective function to enhance the intelligibility prediction score. It uses Whisper model embeddings and speech metadata as the inputs and utilizes a classifier to identify speech signals enhanced by various methods.
In this study, we propose a model for STOI prediction, which is the combination of a convolution block (conv block), bottleneck transformer [15], and dense layers. The conv block is used for extracting and refining the input features. The bottleneck transformer helps to capture short-and long-term contexts while removing redundant information. The dense layer is used for the prediction of the STOI scores. Experimental results show that predicted scores have higher correlation with the ground-truth STOI Scores when tested in both Seen (as explained in Section V) and Unseen conditions (test speakers and utterances are not involved in the training). The experimental results confirm that the proposed model has a comparably better result than the baseline model.
The remainder of the paper is organized as follows. Section II reviews the datasets used, Section III presents the related works, Section IV presents the proposed method, Section V presents the experiments and results, and concluding with discussion on future prospects of this work in Secti
This content is AI-processed based on open access ArXiv data.