Synthetic Data Augmentation for Medical Audio Classification: A Preliminary Evaluation
Medical audio classification remains challenging due to low signal-to-noise ratios, subtle discriminative features, and substantial intra-class variability, often compounded by class imbalance and limited training data. Synthetic data augmentation has been proposed as a potential strategy to mitigate these constraints; however, prior studies report inconsistent methodological approaches and mixed empirical results. In this preliminary study, we explore the impact of synthetic augmentation on respiratory sound classification using a baseline deep convolutional neural network trained on a moderately imbalanced dataset (73%:27%). Three generative augmentation strategies (variational autoencoders, generative adversarial networks, and diffusion models) were assessed under controlled experimental conditions. The baseline model without augmentation achieved an F1-score of 0.645. Across individual augmentation strategies, performance gains were not observed, with several configurations demonstrating neutral or degraded classification performance. Only an ensemble of augmented models yielded a modest improvement in F1-score (0.664). These findings suggest that, for medical audio classification, synthetic augmentation may not consistently enhance performance when applied to a standard CNN classifier. Future work should focus on delineating task-specific data characteristics, model-augmentation compatibility, and evaluation frameworks necessary for synthetic augmentation to be effective in medical audio applications.
💡 Research Summary
This paper investigates whether synthetic data augmentation can improve the classification of respiratory sounds, specifically COVID‑19 related cough recordings, using a standard convolutional neural network (CNN). The authors selected the publicly available Coswara dataset, extracting 4,963 cough samples labeled as healthy (3,847) or COVID‑19 positive (1,116), yielding an approximate 3.4:1 class imbalance. Audio recordings were resampled to 16 kHz, trimmed or zero‑padded to a fixed 3‑second duration, and transformed into 128‑band mel‑spectrograms with standard STFT parameters. Spectrograms were normalized per channel using z‑score statistics computed on the training set.
The baseline classifier is a four‑block CNN (32, 64, 128, 256 filters) with batch normalization, ReLU activation, and 2×2 max‑pooling. It was trained from random initialization for up to 100 epochs using cross‑entropy loss and the Adam optimizer (learning rate = 0.001 with cosine annealing). Early stopping was triggered when the validation macro‑averaged F1 score failed to improve for 15 epochs.
To generate synthetic data, three contemporary generative models were trained exclusively on the minority class (COVID‑positive) recordings:
-
Variational Auto‑Encoder (VAE) – a convolutional encoder‑decoder architecture learning a 128‑dimensional latent space. Training employed a composite loss of mean‑squared reconstruction error plus a KL‑divergence term (β = 0.1) for 200 epochs with Adam (lr = 0.0001). Synthetic samples were produced by sampling from a standard normal prior and decoding.
-
Wasserstein GAN with Gradient Penalty (WGAN‑GP) – a generator built from dense layers followed by four transposed convolutional up‑sampling blocks, and a discriminator with four convolutional blocks and spectral normalization. Training ran for 300 epochs, with a critic‑to‑generator update ratio of 5:1, RMSprop optimizer (lr = 5e‑5), and gradient penalty λ = 10.
-
Denoising Diffusion Probabilistic Model (DDPM) – a U‑Net‑style architecture incorporating multi‑head self‑attention at the bottleneck. The forward diffusion added Gaussian noise over 1,000 timesteps following a linear variance schedule (β₁ = 0.0001, β_T = 0.02). The model was trained for 400 epochs using AdamW (lr = 0.0001, weight decay = 0.01). During inference, Denoising Diffusion Implicit Model (DDIM) sampling with 50 steps generated the synthetic spectrograms.
Each generative model produced enough synthetic samples to increase the minority class size by 50 % (adding 558 synthetic recordings). Synthetic data were only added to the training split; validation and test sets remained purely real.
Performance was evaluated on the held‑out test set using macro‑averaged F1 score as the primary metric and AUROC as a secondary metric. The baseline CNN achieved an F1 of 0.645 and AUROC of 0.745. Augmentation results were as follows:
- VAE‑augmented training: F1 = 0.646 (+0.001), AUROC = 0.748 (+0.003).
- GAN‑augmented training: F1 = 0.609 (‑0.036), AUROC = 0.726 (‑0.019).
- Diffusion‑augmented training: F1 = 0.644 (‑0.001), AUROC = 0.746 (‑0.001).
Thus, individual synthetic augmentation did not yield meaningful gains; the GAN even degraded performance.
The authors then constructed an ensemble by averaging the COVID‑positive class probabilities from four independently trained models (baseline, VAE‑augmented, GAN‑augmented, Diffusion‑augmented). The ensemble achieved an F1 of 0.664 (+0.019) and AUROC of 0.761 (+0.016), the best results among all configurations. This suggests that while synthetic data may not improve a single model’s learning, the diversity introduced by different generative approaches can be leveraged through ensembling to obtain modest performance improvements.
In the discussion, the authors attribute the limited benefit of direct augmentation to several factors: (1) respiratory sounds have subtle, overlapping acoustic signatures, making it difficult for generative models trained on a small, noisy minority set to capture clinically relevant variability; (2) synthetic samples may replicate dominant patterns or noise present in the source recordings rather than novel disease‑specific features; (3) the modest size of the minority class restricts the diversity that a generative model can learn. They also note that the ensemble’s success likely stems from uncorrelated error patterns across the differently augmented models.
Limitations highlighted include reliance on a single dataset and task, evaluation of only one baseline CNN architecture, and lack of repeated runs or statistical significance testing. The authors propose future directions such as using synthetic data for pre‑training or self‑supervised representation learning, incorporating physiologically informed priors into generative models, improving source data quality (e.g., noise reduction, standardized recording protocols), and exploring the interaction between synthetic data and adversarial robustness.
In conclusion, the study finds that current generative models do not reliably improve respiratory sound classification when synthetic samples are naively mixed into training data for a standard CNN. However, ensembles that exploit the diversity introduced by multiple augmentation strategies can achieve modest gains, albeit with increased computational cost. The work underscores the need for rigorous evaluation frameworks and task‑specific design when considering synthetic data augmentation for medical audio applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment