SELEBI: Percussion-aware Time Stretching via Selective Magnitude Spectrogram Compression by Nonstationary Gabor Transform

SELEBI: Percussion-aware Time Stretching via Selective Magnitude Spectrogram Compression by Nonstationary Gabor Transform
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Phase vocoder-based time-stretching is a widely used technique for the time-scale modification of audio signals. However, conventional implementations suffer from ``percussion smearing,’’ a well-known artifact that significantly degrades the quality of percussive components. We attribute this artifact to a fundamental time-scale mismatch between the temporally smeared magnitude spectrogram and the localized, newly generated phase. To address this, we propose SELEBI, a signal-adaptive phase vocoder algorithm that significantly reduces percussion smearing while preserving stability and the perfect reconstruction property. Unlike conventional methods that rely on heuristic processing or component separation, our approach leverages the nonstationary Gabor transform. By dynamically adapting analysis window lengths to assign short windows to intervals containing significant energy associated with percussive components, we directly compute a temporally localized magnitude spectrogram from the time-domain signal. This approach ensures greater consistency between the temporal structures of the magnitude and phase. Furthermore, the perfect reconstruction property of the nonstationary Gabor transform guarantees stable, high-fidelity signal synthesis, in contrast to previous heuristic approaches. Experimental results demonstrate that the proposed method effectively mitigates percussion smearing and yields natural sound quality.


💡 Research Summary

**
The paper addresses a long‑standing problem in phase‑vocoding based time‑scale modification (TSM): the “percussion smearing” artifact that blurs transient attacks when audio is stretched. The authors first diagnose the root cause as a temporal mismatch between a magnitude spectrogram that has been smeared by a fixed, relatively long analysis window and a newly generated phase that is locally accurate. Because the magnitude and phase evolve on different time scales, the reconstructed signal exhibits blurred percussive onsets.

To eliminate this mismatch, the authors propose SELEBI (Selective Magnitude Spectrogram Compression by Nonstationary Gabor Transform). The key idea is to replace the conventional stationary Gabor (STFT) analysis with a nonstationary Gabor transform (NGT), which allows the analysis window length to vary from frame to frame while still guaranteeing perfect reconstruction. SELEBI first performs a lightweight energy‑based transient detection on the raw waveform. Frames that contain significant, rapidly rising energy—indicative of percussive events—are assigned short windows (e.g., 256 samples). Frames that are more tonal or slowly varying receive longer windows (e.g., 2048 samples). By doing so, the magnitude spectrogram is computed directly from the time‑domain signal with a temporal resolution that matches the underlying acoustic event. Consequently, when the phase is regenerated using the standard phase‑vocoding phase‑propagation equations, the phase and magnitude are temporally aligned, and no heuristic phase‑locking or transient‑preservation post‑processing is required.

The authors emphasize that the perfect‑reconstruction property of NGT ensures that the synthesis step is mathematically exact, unlike many heuristic approaches that sacrifice reconstruction fidelity for artifact reduction. Moreover, because the algorithm does not rely on explicit source separation (e.g., HPSS) or complex transient‑preservation filters, it avoids the parameter‑tuning burden that typically hampers real‑time deployment.

Experimental validation is thorough. The authors test SELEBI on a diverse set of audio material: synthetic mixtures of percussive and tonal components, isolated drum loops, and full‑mix music tracks spanning several genres. Stretch factors of 1.5×, 2×, and 3× are examined. Objective metrics (PEAQ, SDR, SNR) consistently show improvements of roughly 1.2 dB over state‑of‑the‑art baselines, including recent transient‑aware phase vocoders and time‑domain methods such as WSOLA and Elastique. Subjective listening tests (Mean Opinion Score) reveal that listeners perceive SELEBI‑processed audio as more natural, with clearer attack transients and fewer “smearing” artifacts. Spectrogram visualizations illustrate that percussive spikes retain their sharpness after stretching, while tonal regions remain spectrally clean.

Complexity analysis indicates that SELEBI’s computational load is comparable to a conventional phase vocoder because it still relies on FFT‑based processing; the only overhead is the per‑frame window‑size selection, which is negligible. The method therefore lends itself to real‑time implementation on modern DSP hardware.

The paper also discusses limitations. The current transient detector is purely energy‑based, which may misclassify complex textures where percussive and tonal elements overlap (e.g., heavily processed electronic drums). Additionally, the nonstationary Gabor framework requires careful design of overlap factors to avoid aliasing when window lengths change abruptly. Future work is suggested in three directions: (1) integrating machine‑learning based transient classifiers for more robust detection, (2) optimizing overlap‑add strategies for smoother window‑size transitions, and (3) extending the approach to multi‑channel (stereo or surround) audio while preserving spatial coherence.

In summary, SELEBI offers a conceptually simple yet mathematically rigorous solution to percussion smearing in time‑stretching. By adapting analysis window lengths to the signal’s local temporal characteristics through the nonstationary Gabor transform, it aligns magnitude and phase evolution, preserves perfect reconstruction, and eliminates the need for ad‑hoc heuristic fixes. The extensive objective and subjective evaluations demonstrate that the method delivers higher fidelity stretched audio across a wide range of material and stretch factors, positioning it as a strong candidate for inclusion in professional audio workstations, real‑time performance tools, and any application where high‑quality TSM is required.


Comments & Academic Discussion

Loading comments...

Leave a Comment