Musical Metamerism with Time--Frequency Scattering
The concept of metamerism originates from colorimetry, where it describes a sensation of visual similarity between two colored lights despite significant differences in spectral content. Likewise, we propose to call ``musical metamerism’’ the sensation of auditory similarity which is elicited by two music fragments which differ in terms of underlying waveforms. In this technical report, we describe a method to generate musical metamers from any audio recording. Our method is based on joint time–frequency scattering in Kymatio, an open-source software in Python which enables GPU computing and automatic differentiation. The advantage of our method is that it does not require any manual preprocessing, such as transcription, beat tracking, or source separation. We provide a mathematical description of JTFS as well as some excerpts from the Kymatio source code. Lastly, we review the prior work on JTFS and draw connections with closely related algorithms, such as spectrotemporal receptive fields (STRF), modulation power spectra (MPS), and Gabor filterbank (GBFB).
💡 Research Summary
The paper introduces the concept of “musical metamerism,” drawing an analogy to visual metamerism in colorimetry, and proposes a systematic method to synthesize auditory stimuli that are perceptually indistinguishable despite having markedly different waveforms. The authors leverage Joint Time‑Frequency Scattering (JTFS), a non‑linear, multi‑resolution transform implemented in the open‑source Python library Kymatio, to achieve this goal without any manual preprocessing such as transcription, beat tracking, or source separation.
The technical core consists of two scattering stages. In the first stage, a bank of Morlet wavelets ψλ(t) (center frequency λ, quality factor Q) is convolved with the input signal x(t); the complex modulus yields the first‑order tensor U₁(t,λ)=|x∗ψλ|(t). The second stage applies a second set of Morlet wavelets—one in the time domain with modulation rates α (Hz) and another in the log‑frequency domain with modulation scales β (cycles per octave)—to U₁, producing the second‑order tensor U₂(t,λ,α,β)=||x∗ψλ|∗ψα∗ψβ|(t,λ). Both tensors are then locally averaged with Gaussian low‑pass filters φ_T (temporal scale T) and φ_F (frequency scale F), yielding S₁ and S₂. This averaging guarantees invariance to translations up to T seconds and pitch transpositions up to F octaves, aligning with the “contour hypothesis” that listeners rely on coarse spectro‑temporal energy distributions rather than precise time‑frequency details.
Reconstruction is cast as an optimization problem: starting from a colored Gaussian noise y₀ whose power spectrum matches S₁(x), the algorithm iteratively updates yₙ+1 = yₙ + μₙ ∇E(yₙ), where the loss E(y) = ‖S₁(x)−S₁(y)‖² + ‖S₂(x)−S₂(y)‖². Gradient computation is performed via reverse‑mode automatic differentiation provided by Kymatio. The back‑propagation proceeds from the second‑order coefficients to the first‑order coefficients and finally to the waveform domain, using Hermitian adjoints of the linear operators (FFT‑based convolutions) and the complex modulus non‑linearity. The authors adopt a “bold driver” schedule for the learning rate and set the momentum term to zero in their experiments.
Implementation details are thoroughly described. The first filterbank mixes a constant‑Q region (high frequencies) with a constant‑bandwidth region (low frequencies) using a geometric progression of center frequencies until an “elbow” point, after which the progression becomes arithmetic. The second filterbank is split into a temporal scattering bank (Q₂) and a frequency scattering bank (Q_f), each with its own number of octaves J and quality factor Q. The code snippets illustrate how xi and sigma are computed, how the “elbow” is handled, and how filters are yielded. Convolution is performed in the Fourier domain using complex‑valued element‑wise multiplication (cdgmm) followed by subsampling, which is implemented as reshaping and summation in the frequency domain.
A notable design choice is the width‑first traversal of scattering paths in the second layer. The outer loop iterates over deep paths (n₂) while the inner loop processes shallow paths (n₁), ensuring that all first‑order outputs needed for log‑frequency scattering are resident in memory. Additionally, the authors implement “wavelet spinning” to include negative frequencies when α=0, thereby handling real‑valued low‑pass filtered scalograms.
The paper situates JTFS among related auditory representations. Compared to Spectro‑Temporal Receptive Fields (STRF), JTFS provides a richer set of statistics by combining multi‑scale wavelet analysis with non‑linear modulus operations. Unlike the Modulation Power Spectrum (MPS), which averages energy, JTFS retains phase information through its complex modulus and subsequent back‑propagation. The Gabor filterbank shares similar filter design, but JTFS augments it with hierarchical scattering paths and explicit low‑pass averaging to enforce invariance.
Experimental validation consists of visual inspection of spectrograms and quantitative loss values, demonstrating that synthesized metamers match the target JTFS coefficients within numerical tolerance. By varying T and F, the authors show controllable trade‑offs between perceptual fidelity and allowable transformations.
In conclusion, the authors provide a fully automated pipeline for generating musical metamers that respect perceptual invariances without any handcrafted preprocessing. This capability opens the door to creating minimal auditory pairs for psychophysical studies of music cognition, enabling systematic probing of how listeners encode contour, timbre, rhythm, and pitch. Future work is suggested in the form of human listening experiments, real‑time synthesis optimization, and comparative studies with deep learning based audio encoders.
Comments & Academic Discussion
Loading comments...
Leave a Comment