TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation
In recent years, much speech separation research has focused primarily on improving model performance. However, for low-latency speech processing systems, high efficiency is equally important. Therefore, we propose a speech separation model with significantly reduced parameters and computational costs: Time-frequency Interleaved Gain Extraction and Reconstruction network (TIGER). TIGER leverages prior knowledge to divide frequency bands and compresses frequency information. We employ a multi-scale selective attention module to extract contextual features while introducing a full-frequency-frame attention module to capture both temporal and frequency contextual information. Additionally, to more realistically evaluate the performance of speech separation models in complex acoustic environments, we introduce a dataset called EchoSet. This dataset includes noise and more realistic reverberation (e.g., considering object occlusions and material properties), with speech from two speakers overlapping at random proportions. Experimental results showed that models trained on EchoSet had better generalization ability than those trained on other datasets compared to the data collected in the physical world, which validated the practical value of the EchoSet. On EchoSet and real-world data, TIGER significantly reduces the number of parameters by 94.3% and the MACs by 95.3% while achieving performance surpassing the state-of-the-art (SOTA) model TF-GridNet.
💡 Research Summary
The paper introduces TIGER (Time‑frequency Interleaved Gain Extraction and Reconstruction), a lightweight speech‑separation network designed for low‑latency, low‑power applications, and a new dataset called EchoSet that more faithfully reproduces real‑world acoustic conditions.
Motivation. Recent speech‑separation research has largely prioritized performance, often at the cost of massive model size and computational demand. This makes deployment on edge devices (smartphones, hearing aids, IoT sensors) impractical. Moreover, most benchmark corpora (WSJ0‑2mix, WHAMR!) lack realistic reverberation and noise diversity, leading to poor generalisation in the field.
Model Architecture. TIGER operates in the time‑frequency domain. An STFT encoder converts the mixture into a complex spectrogram X∈ℂ^{F×T}. The core novelty is a band‑split front‑end: the full frequency axis is divided into K sub‑bands of non‑uniform width G_k, reflecting prior knowledge about the perceptual importance of different bands. Each sub‑band’s real and imaginary parts are concatenated, normalized, and projected via a 1‑D convolution (kernel = 1) to a common channel dimension N, yielding Z∈ℝ^{N×K×T}. This reduces the spectral resolution while preserving the most informative components.
The separator consists of several Frequency‑Frame Interleaved (FFI) blocks that share parameters. Each FFI block contains two parallel paths: a Frequency Path and a Frame Path. Both paths apply the same two modules in sequence:
- Multi‑Scale Selective Attention (MSA). MSA first down‑samples the processing dimension (frequency for the Frequency Path, time for the Frame Path) through a stack of 1‑D convolutions with stride 2, generating multi‑scale feature maps E_d. After average‑pooling all scales to a common resolution, they are summed into a global feature G. A selective‑attention (SA) sub‑module then fuses local and global information: sigmoid‑scaled global features weight the local maps, and the weighted local maps are added back to the global context. This design captures both fine‑grained details and broad contextual cues while keeping the parameter count low.
- Full‑Frequency‑Frame Attention (F³A). F³A performs a self‑attention across the sub‑band dimension. Queries, Keys, and Values are generated by 1×1 2‑D convolutions, then the time dimension T and channel dimension E are flattened so that each sub‑band attends to every other sub‑band across the full temporal span. The resulting K×K attention matrix redistributes information among bands, effectively compensating for the earlier spectral compression.
After the two paths, their outputs are summed (with residual connections) and passed to the next FFI block. Because the blocks share weights, the overall model remains extremely compact.
Mask Generation and Reconstruction. The separator outputs a set of masks M_i∈ℝ^{F×T} for each speaker. Element‑wise multiplication with the original spectrogram X yields separated complex spectra H_i, which are transformed back to waveforms via inverse STFT.
EchoSet Dataset. To address the gap between synthetic benchmarks and real environments, the authors built EchoSet. It uses physics‑based room‑acoustic simulation that accounts for object occlusion, material absorption, and diverse room geometries. Two speakers are mixed with random overlap ratios, and a variety of background noises are added. This results in a corpus that exhibits a much broader distribution of reverberation times, signal‑to‑noise ratios, and spatial characteristics than existing datasets.
Experiments. TIGER was evaluated on Libri2Mix, LRS2‑2Mix, and EchoSet, and compared against TF‑GridNet (the current SOTA), TDANet, SepFormer, and other baselines. Results show:
- Parameter reduction: 94.3 % fewer parameters than TF‑GridNet.
- Computational reduction: 95.3 % fewer multiply‑accumulate operations (MACs).
- Performance: On EchoSet, TIGER improves SI‑SDR by ~5 dB over TF‑GridNet; on real‑world recordings, it achieves the highest PESQ and STOI scores among all tested models.
The gains are especially pronounced as the acoustic conditions become more challenging, confirming that the band‑split + FFI design preserves essential information while discarding redundancy.
Implications and Future Work. TIGER demonstrates that careful exploitation of prior knowledge (frequency importance) and a dual‑path attention mechanism can yield a model suitable for on‑device, real‑time speech separation without sacrificing quality. The EchoSet dataset also provides a valuable benchmark for future research aiming at robust, generalizable separation. Potential extensions include handling more than two speakers, integrating multi‑mic arrays, and applying quantization or neural‑architecture‑search techniques to further tailor the model for specific hardware constraints.
In summary, the paper makes two major contributions: a novel, highly efficient TF‑domain separation architecture (TIGER) and a realistic, richly varied dataset (EchoSet) that together push the field toward practical, deployable speech‑separation solutions.
Comments & Academic Discussion
Loading comments...
Leave a Comment