Phase vocoder-based time-stretching is a widely used technique for the time-scale modification of audio signals. However, conventional implementations suffer from ``percussion smearing,'' a well-known artifact that significantly degrades the quality of percussive components. We attribute this artifact to a fundamental time-scale mismatch between the temporally smeared magnitude spectrogram and the localized, newly generated phase. To address this, we propose SELEBI, a signal-adaptive phase vocoder algorithm that significantly reduces percussion smearing while preserving stability and the perfect reconstruction property. Unlike conventional methods that rely on heuristic processing or component separation, our approach leverages the nonstationary Gabor transform. By dynamically adapting analysis window lengths to assign short windows to intervals containing significant energy associated with percussive components, we directly compute a temporally localized magnitude spectrogram from the time-domain signal. This approach ensures greater consistency between the temporal structures of the magnitude and phase. Furthermore, the perfect reconstruction property of the nonstationary Gabor transform guarantees stable, high-fidelity signal synthesis, in contrast to previous heuristic approaches. Experimental results demonstrate that the proposed method effectively mitigates percussion smearing and yields natural sound quality.
T IME stretching, a process that modifies the time-scale of a signal without altering its pitch, is a fundamental tool in modern music production, with applications ranging from audio remixing to transcription [1], [2]. The ideal goal is to obtain a time-stretched signal that perfectly preserves the musical characteristics of the original, such as its timbre, clarity, and dynamics. To achieve high-quality results, a wide variety of time-stretching methods have been proposed [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14].
The phase vocoder (PV) [3], [4] is one of the most widely used techniques for time-stretching. This approach operates in the time-frequency (T-F) domain: It computes the short-time Fourier transform (STFT), generates a new phase appropriate Manuscript received XXXX XX, XXXX; revised XXXXX XX, XXXX; accepted XXXXX XX, XXXX. Date of publication XXXXX XX, XXXX; date of current version XXXXX XX, XXXX. The associate editor was XXXXX XXXX. Natsuki Akaishi and Kohei Yatabe are with Tokyo University of Agriculture and Technology, Tokyo 184-8588, Japan (e-mail: natsu61aka@gmail.com; yatabe@go.tuat.ac.jp). Nicki Holighaus is with Acoustics Research Institute, Austrian Academy of Sciences, Wohllebengasse 12-14, 1040 Vienna, Austria (nicki.holighaus@oeaw.ac.at) . Block diagrams of the basic method, PV with identity phase locking [15] (left), and the proposed method (right). By leveraging NSDGT, the proposed method synthesizes the target signal from T-F representations in which both the magnitude and phase spectrograms of percussive components are highly concentrated in the time direction.
for the modified time-scale, and resynthesizes the signal using the overlap-add (OLA) technique with an extended hop size. Although the PV is a well-established technique that has been continuously refined [5], [10], its fundamental logic builds on the sinusoidal signal model, i.e., it assumes that the signal is a sum of sinusoids. This model is effective for tonal content because time-shifts are represented by a well-defined, predictable phase shift. However, it is ill-suited for percussive components, which cannot be represented by a sum of few sinusoids. Therefore, time-shifting of percussive components cannot be modeled as such a phase shift. As a result, the PV often introduces a well-known artifact known as percussion smearing. As illustrated in Fig. 1 (top-right), the stretched percussive components suffer significant degradation, losing their original temporal characteristics due to undesired leakage.
To alleviate this, various “percussion-aware” methods have been proposed. However, they either rely on an a priori separation of the signal into tonal and percussive components for individual processing [10], [11] (Class A), which invariably introduces artifacts due to imperfect separation, or they suffer, to various degrees, from magnitude-phase mismatch [9], [12], [13], [14] (Class B), inhibiting the elimination of percussion smearing.
In this paper, we address the fundamental limitation of Class B methods 1 : the inconsistency between magnitude and phase. While these methods often succeed in preserving the phase relationships required for transients (i.e., vertical phase coherence), the corresponding magnitude spectrograms are inevitably smeared due to the time-stretching process. This results in percussive components being spread across a dilated time interval, which contradicts the localized phase information. We posit that this mismatch is the primary cause of percussion smearing, as illustrated in Fig. 1 (left). To address this, we propose computing spectrograms with improved temporal localization specifically where percussive components are present, thereby aligning the magnitude representation with the localized phase (Fig. 1, right). While our previous work [16] attempted to mitigate this issue through heuristic time-frequency bin shifting, the proposed method provides a theoretical foundation for energy preservation and stable synthesis.
To achieve this magnitude squeezing in a mathematically rigorous manner, this paper proposes a method named SELEBI (SELEctive window compression with staBle Inversion). Our method leverages the Nonstationary Gabor Transform (NSDGT) [17], a time-frequency representation framework that allows for adaptive, non-uniform windowing and sampling while retaining perfect reconstruction properties. We exploit this flexibility by adaptively assigning shorter windows to percussive regions, thereby directly obtaining a magnitude spectrogram with a desired temporal resolution. Crucially, the underlying mathematical framework of the NSDGT guarantees that this adaptive processing results in a stable and high-fidelity synthesis.
The rest of the paper is organized as follows. Section II reviews the fundamentals of the NSDGT and PV-based time stretching. Section III discusses the core component of the proposed method: Sharply resolving percussive events to reduce p
This content is AI-processed based on open access ArXiv data.