Exploring Perceptual Audio Quality Measurement on Stereo Processing Using the Open Dataset of Audio Quality

Exploring Perceptual Audio Quality Measurement on Stereo Processing Using the Open Dataset of Audio Quality
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

ODAQ (Open Dataset of Audio Quality) provides a comprehensive framework for exploring both monaural and binaural audio quality degradations across a range of distortion classes and signals, accompanied by subjective quality ratings. A recent update of ODAQ, focusing on the impact of stereo processing methods such as Mid/Side (MS) and Left/Right (LR), provides test signals and subjective ratings for the in-depth investigation of state-of-the-art objective audio quality metrics. Our evaluation results suggest that, while timbre-focused metrics often yield robust results under simpler conditions, their prediction performance tends to suffer under the conditions with a more complex presentation context. Our findings underscore the importance of modeling the interplay of bottom-up psychoacoustic processes and top-down contextual factors, guiding future research toward models that more effectively integrate both timbral and spatial dimensions of perceived audio quality.


💡 Research Summary

This paper presents a comprehensive investigation into the performance of objective audio quality metrics when evaluating degradations introduced by common stereo processing techniques, utilizing the newly expanded Open Dataset of Audio Quality (ODAQ). The core of the study revolves around the distinction between Mid/Side (MS) and Left/Right (LR) processing methods. LR processing applies distortions independently to the left and right channels, while MS processing first encodes the signal into sum (mid) and difference (side) components, applies the degradation, and then decodes back to stereo. This mimics artifacts from real-world audio codecs like parametric stereo.

The researchers generated a wide range of test stimuli by applying two types of perceptual distortions—Quantization Noise (QN) and Spectral Holes (SH)—at multiple intensity levels to various audio excerpts (including solo instruments, music mixes, and dialogue with music) using both LR and MS methods. Subjective quality ratings were collected via a rigorous MUSHRA listening test with trained listeners. A key design aspect was the inclusion of different “presentation contexts”: tests where only LR- or only MS-processed versions were presented, and “mixed” tests where both types were presented together for direct comparison.

Against this benchmark, the study evaluated a suite of state-of-the-art objective quality metrics. These included timbre-focused models like PEAQ (and its variants 2f-model, PEAQ-CSM), PEMO-Q, and HAAQI, which typically process channels independently. It also included models with explicit binaural processing for spatial quality assessment, such as MoBi-Q and its efficient derivative eMoBi-Q, which extract interaural level differences (ILD), time differences (ITD), and coherence. Furthermore, the authors created binaural extensions of the standard PEAQ model by integrating external binaural cue extraction models.

The evaluation results revealed a significant finding: while timbre-focused metrics performed robustly under simpler conditions (e.g., LR-only contexts), their prediction accuracy consistently degraded in scenarios involving MS processing or the mixed presentation context. This performance drop is attributed to their inability to adequately model the inter-channel dependencies and spatial image distortions introduced by MS processing, which affects the side channel carrying stereo width information. In contrast, metrics incorporating binaural models (MoBi-Q, eMoBi-Q) generally showed better and more consistent performance across all test conditions, as they could account for both timbral and spatial aspects of degradation.

The paper concludes that accurate perceptual quality assessment for stereo audio requires moving beyond channel-independent timbre analysis. Future objective models must effectively integrate bottom-up psychoacoustic processes for timbre with binaural auditory models for spatial integrity. Moreover, the observed influence of the presentation context on subjective scores—where the same stimulus can be rated differently depending on what it is compared to—highlights the potential role of top-down cognitive factors in quality judgment. This suggests that truly perceptually-aligned models may need to consider not just the signal properties but also the evaluative context, guiding the field towards more holistic models of auditory perception.


Comments & Academic Discussion

Loading comments...

Leave a Comment