TLDiffGAN: A Latent Diffusion-GAN Framework with Temporal Information Fusion for Anomalous Sound Detection

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Existing generative models for unsupervised anomalous sound detection are limited by their inability to fully capture the complex feature distribution of normal sounds, while the potential of powerful diffusion models in this domain remains largely unexplored. To address this challenge, we propose a novel framework, TLDiffGAN, which consists of two complementary branches. One branch incorporates a latent diffusion model into the GAN generator for adversarial training, thereby making the discriminator’s task more challenging and improving the quality of generated samples. The other branch leverages pretrained audio model encoders to extract features directly from raw audio waveforms for auxiliary discrimination. This framework effectively captures feature representations of normal sounds from both raw audio and Mel spectrograms. Moreover, we introduce a TMixup spectrogram augmentation technique to enhance sensitivity to subtle and localized temporal patterns that are often overlooked. Extensive experiments on the DCASE 2020 Challenge Task 2 dataset demonstrate the superior detection performance of TLDiffGAN, as well as its strong capability in anomalous time-frequency localization.

💡 Research Summary

The paper introduces TLDiffGAN, a novel unsupervised anomalous sound detection (ASD) framework that combines a latent diffusion‑GAN (LDGAN) backbone with a parallel raw‑waveform encoder branch and a temporal‑mixup (TMixup) augmentation module. Existing generative ASD methods either rely solely on mel‑spectrograms—losing fine‑grained waveform information—or suffer from reconstruction blur (autoencoders), training instability and mode collapse (GANs), or over‑generalization (diffusion models). TLDiffGAN addresses these issues through three complementary components.

LDGAN Backbone – The generator operates in a low‑dimensional latent space and performs stepwise denoising, as in latent diffusion models, but is guided by an adversarial discriminator at each diffusion step. The generator loss combines the standard diffusion noise‑prediction loss (L_noise) with a statistical matching loss (L_stat) that aligns deep discriminator features of real and generated spectrograms. The discriminator is regularized with spectral normalization and a gradient‑penalty term, stabilizing adversarial training while encouraging sharper, more realistic spectrogram synthesis.
Parallel Raw‑Waveform Encoder – To recover information discarded by the time‑frequency transform, the second branch extracts high‑level embeddings directly from raw audio using pretrained self‑supervised audio models (AST, ATST, BEATs, EAT). These models, originally trained on large‑scale audio corpora, capture long‑range dependencies and subtle transient events typical of industrial machinery sounds. The embeddings (Z_wave) are concatenated with LDGAN’s mel‑spectrogram embeddings (Z_mel) to form a joint feature space Z.
TMixup Temporal Augmentation – TMixup enhances sensitivity to localized temporal anomalies. Three pooling operations (max, average, power‑average) are weighted by learnable parameters (softmax‑normalized) to produce a pooled representation. A sigmoid‑activated temporal attention map is thresholded (τ sampled from U(0.2, 0.5)) to generate a binary mask that selects high‑attention time frames. Within these regions, a Mixup operation blends the original spectrogram with a masked version using a mixing coefficient λ drawn from a Beta(α, α) distribution. This localized augmentation forces the model to focus on the boundary of the normal data distribution, improving discrimination of subtle, transient anomalies.

The detection stage combines two complementary scores: (i) a reconstruction‑based score (s_r) computed as the Euclidean distance between latent representations of the input and its LDGAN reconstruction, and (ii) an embedding‑based score derived from classical outlier detectors (K‑NN, LOF, GMM, SOS) applied to the joint feature space. For each machine type, the best‑performing detector (based on validation AUROC and partial AUROC) is selected, and the final anomaly score is the corresponding one.

Experiments were conducted on the DCASE 2020 Task 2 dataset (MIMII + ToyADMOS), covering six machine types (Fan, Pump, Slider, Valve, ToyCar, ToyConveyor). Input log‑mel spectrograms have dimensions 128 × 313; training uses Adam (lr = 1e‑4), batch size 512, 150 epochs, with λ_stat = 1.0 and λ_GP = 10. Evaluation metrics are AUROC and pAUROC (FPR ∈

TLDiffGAN: A Latent Diffusion-GAN Framework with Temporal Information Fusion for Anomalous Sound Detection

💡 Research Summary

Comments & Academic Discussion

Leave a Comment