Geometry-Aware Optimization for Respiratory Sound Classification: Enhancing Sensitivity with SAM-Optimized Audio Spectrogram Transformers

Reading time: 5 minute
...

📝 Original Info

  • Title: Geometry-Aware Optimization for Respiratory Sound Classification: Enhancing Sensitivity with SAM-Optimized Audio Spectrogram Transformers
  • ArXiv ID: 2512.22564
  • Date: 2025-12-27
  • Authors: Atakan Işık, Selin Vulga Işık, Ahmet Feridun Işık, Mahşuk Taylan

📝 Abstract

Respiratory sound classification is hindered by the limited size, high noise levels, and severe class imbalance of benchmark datasets like ICBHI 2017. While Transformer-based models offer powerful feature extraction capabilities, they are prone to overfitting and often converge to sharp minima in the loss landscape when trained on such constrained medical data. To address this, we introduce a framework that enhances the Audio Spectrogram Transformer (AST) using Sharpness-Aware Minimization (SAM). Instead of merely minimizing the training loss, our approach optimizes the geometry of the loss surface, guiding the model toward flatter minima that generalize better to unseen patients. We also implement a weighted sampling strategy to handle class imbalance effectively. Our method achieves a state-of-the-art score of 68.10% on the ICBHI 2017 dataset, outperforming existing CNN and hybrid baselines. More importantly, it reaches a sensitivity of 68.31%, a crucial improvement for reliable clinical screening. Further analysis using t-SNE and attention maps confirms that the model learns robust, discriminative features rather than memorizing background noise.

💡 Deep Analysis

📄 Full Content

Geometry-Aware Optimization for Respiratory Sound Classification: Enhancing Sensitivity with SAM-Optimized Audio Spectrogram Transformers Atakan Işık1, Selin Vulga Işık1, Ahmet Feridun Işık2, Mahşuk Taylan3 1 Biomedical Engineering Department, Başkent University,Turkey 2 Thoracic Surgery Department, Gaziantep University, Turkey 3 Chest Diseases Department, Gaziantep University,Turkey Abstract Respiratory sound classification is hindered by the limited size, high noise levels, and severe class imbalance of benchmark datasets like ICBHI 2017. While Transformer-based models offer powerful feature extraction capabilities, they are prone to overfitting and often converge to sharp minima in the loss landscape when trained on such constrained medical data. To address this, we introduce a framework that enhances the Audio Spectrogram Transformer (AST) using Sharpness-Aware Minimization (SAM). Instead of merely minimizing the training loss, our approach optimizes the geometry of the loss surface, guiding the model toward flatter minima that generalize better to unseen patients. We also implement a weighted sampling strategy to handle class imbalance effectively. Our method achieves a state-of-the-art score of 68.10% on the ICBHI 2017 dataset, outperforming existing CNN and hybrid baselines. More importantly, it reaches a sensitivity of 68.31%, a crucial improvement for reliable clinical screening. Further analysis using t-SNE and attention maps confirms that the model learns robust, discriminative features rather than memorizing background noise. Keywords: Lung Sound Analysis, Audio Spectrogram Transformer, SAM, Imbalanced Learning, ICBHI 2017. INTRODUCTION Respiratory diseases remain a leading cause of morbidity and mortality worldwide, necessitating accurate and timely diagnosis for effective patient care. While modern imaging techniques like X-ray and CT scans provide detailed anatomical information, pulmonary auscultation retains its status as the most fundamental and accessible diagnostic method[1, 2]. In clinical practice, distinguishing conditions such as pneumonia from heart failure often relies on the subtle characteristics of breath sounds, even when radiological findings are ambiguous. However, the efficacy of auscultation is heavily dependent on the clinician’s hearing sensitivity and experience[1]. Sound transmission through the thorax encounters complex biological barriers, including tracheobronchial cartilages and varying tissue densities. Consequently, factors such as muscular hypertrophy or obesity can dampen acoustic signals, making manual detection of pathological sounds specifically crackles and wheezes highly subjective and prone to inter observer variability. To standardize interpretation, Computer-Aided Diagnosis (CAD) systems using Deep Learning (DL) have emerged as a powerful tool. Early approaches largely relied on Convolutional Neural Networks (CNNs), such as ResNet or VGG, trained on spectrogram representations of lung sounds[3, 4]. While CNNs are effective at extracting local features, they often struggle to capture long-range temporal dependencies that are crucial for distinguishing continuous respiratory cycles from transient artifacts. Recently, Transformer-based architectures, particularly the Audio Spectrogram Transformer (AST), have demonstrated superior performance in audio tasks by leveraging self-attention mechanisms to model global context[5]. Despite their potential, applying Transformers to respiratory sound analysis presents a significant engineering challenge. Benchmark datasets, such as the ICBHI 2017 Challenge dataset, are characterized by limited sample sizes, severe class imbalance, and high levels of background noise, including stethoscope friction, heartbeats, and speech[2]. Transformers, which typically lack the inductive bias of CNNs, require large amounts of data to generalize well. When trained on such constrained medical datasets, they are prone to overfitting and tend to converge to "sharp minima" in the loss landscape[6]. A model residing in a sharp minimum may perform well on training data but fails drastically when input data varies slightly, a common occurrence in clinical settings due to different recording devices or patient physiology. In this study, we bridge the gap between clinical acoustic complexity and deep learning robustness. We propose a novel framework that integrates the Audio Spectrogram Transformer (AST) with Sharpness-Aware Minimization (SAM)[7]. Unlike standard optimization algorithms like SGD or Adam, which solely minimize the training loss value, SAM simultaneously minimizes the loss value and the sharpness of the loss landscape. This geometry-aware approach guides the model toward "flat minima," ensuring that the solution remains robust even in the presence of noisy or perturbed inputs. By combining this optimization strategy with a weighted sampling technique to handle class imbalance, our framework significantly improves diagnostic per

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut