Spectral analysis for nonstationary audio

1 Spectral analysis for nonstationary audio Adrien Meynard and Bruno T orr ´ esani Abstract A new approach for the analysis of nonstationary signals is proposed, with a focus on audio applications. Follo wing earlier contributions, nonstationarity is modeled via station arity- breaking operators acting on Gaussian stationary random signals. The f ocus is on time warping and amplitude modulation, and an approximate max imum-likelihood approach based on suitable approximations in the wave le t transform domain is developed. This paper provides theoretical analysis of the a pproximations, and introduces JEF AS, a corresponding estimation algorithm. The latter is tested and valida ted on synthetic as well as real audio signal. Index T erms Nonstationary signals, deformation, wavelet analysis, time warping, local spectrum , Doppler effect I . I N T R O D U C T I O N Nonstationarit y is a key feature of acoustic signals, in particular audio signals. For example, a lar ge part of informat ion carried by musical and spe e ch signals is e ncoded by t heir nonst ationary nat ur e. This is also the case for environment sounds (for e xample, nonstationarit y of car noises or wind informs about speed variations), and many animal s (e.g. bats and dolphins) use nonstationary signals for localization and communication. Beyond acoustics, amplitude and frequency modulation are of p rime importance in many domains such a s telecommunication. While stationarity can be given rigor ous deﬁnitions, nonstationarity is a very wide concept, as there are inﬁnitely many ways to de part from stationarity . The theory of random signals and pr ocesses (see [1], [2] and r efer ences therein) gives a clear meaning to the notion of statio narity . In the context of time series anal ysis, Priestley [2], [3] was A. Me y nard and B. T orr ´ esani are with Aix Marsei l le Univ , CNRS, Centrale Marsei lle, I2M, M arseille, Fr ance. 2 one of the ﬁrst to develop a systematic theory of nonstationary pr ocesses, intro ducing the class of locally stationary pro cesses and the notion of evolutionary spe ctr um. A similar appro ach was followed in [4], who pr oposed a wa vele t-based appr oach to covariance estimation for locally stationary proc esses (see also [5]). An a lternate theory of locally stationary time series was d e veloped by Dahlha us [6] (see also [7] for a corr e sponding stationarity test). In a diff erent context, frequency-mo dulated stationary signal wer e considered in [8], [9], and time warping models wer e analyzed in [10]. I n several of these approaches, wavelet, time-fr equency and similar repr esentations happ e n to play a key role for the characterization of nonstationarit y . In a de terministic setting, a popular nonstationarity model expresses the signal as a sum of K sinusoidal components y ( t ) = ∑ K k = 1 A k ( t ) cos ( 2 π φ k ( t ) ) . This model has been lar gely used in speech processing since early works by McAulay and Quatieri [11] (see [12] and refer ences therein for more recent developments, and [13], [14] for prob- abilistic approaches). The instantaneous frequencies φ ′ k of each mode give important information about the physical phenomenon. Under smoothness assumptions on func- tions A k and φ ′ k , techniques such as ridge/multiridge detection (see [1 5 ] and refer ence s ther ein), synchro squeezing or reassignment have been developed to extract theses quan- tities fr om a single signal observation (see [16], [17], [18] for recent accounts). In sound processing, signals often possess a harmonic str uctur e , which corresponds to a special case of the above model where each instantaneous frequency φ ′ k is multiple of a fundamental frequency φ ′ 0 : φ ′ k ( t ) = ( k + 1 ) φ ′ 0 ( t ) . I n the special case A k ( t ) = α k A 0 ( t ) , we can de scribe such signals as a stationary signal x ( t ) = ∑ K k = 1 α k cos ( 2 π k t + ϕ k ) modiﬁed by time warping and amplitude modulation: y ( t ) = A 0 ( t ) x ( φ 0 ( t ) ) . A major limit of this model is that e a ch component is pur ely sinusoidal while audio signals often contain br oadband information. However , sounds originating fr om physical phenomena can of- ten be modele d as stationary signals, deformed by a stationarit y-br eaking operator (e . g. time warping, amplitude modulation). For example , sounds generated by a variable- speed engine or any stationary sound deformed by Doppler e f fect can be described as such. A stochastic time warping model has been intr oduced in [19], [20], whe r e wavelet-based approximation and estimation techniques were developed. In [9], [21 ], an approximat e maximum-likelihood approach was propo sed for the joint estimation of the time warping and p ower spectrum of the unde rlying Gaussian stationary signal, 3 exploiting similar ap p r oximations. In this paper , we build on results of [9], [21] which we extend and improv e in several ways. W e develop an appro ximate maximum likelihood method for e stimating jointly time warping and amplitude modulation (not present in [9], [21]) fr om a single r ealization. While the overall structur e of the algorithm is similar , we formulate the pr oblem as a continuo us parameter estimatio n problem, which avoids quantization effect s present in the earlier a p p roaches, and allows computing a Cram ´ er- Rao bound for assessing the precision of the estimate. After completing the estimation, the in verse deformation can be a pplied to the input signal, which yields a n estimate for the power spectrum. The outline of the paper is as follows. A fter giving some deﬁnitions and notations in Section II, we detail in Section III the nonstationary signal models we consider , and specify the assumptions made on underlying stationary signals. W e also ana lyze the effect of time warping and amplitude modulation in the wavelet domain, which we exploit in designing the estimation p rocedur e. W e ﬁnally propose an a lternate estima- tion algorithm and analyze the expected performances of the corresponding estimator . Section IV is devoted to numerical r esults on synthetic and real signals. W e also shortly describe in this section an extension published in [22] involving simultaneously time warping and fr equency modulation. Mathematical developments are given as supple- mentary material, together with additional examples. I I . N O TA T I O N S A N D B A C K G R O U N D A. Random signals and stationarity Thr oughout this pap er , we will work in the framework of the theory of random signals. Signals of inter est will be modeled as realizations of random processes t ∈ R → X t ∈ C . Signals of interest are r eal-valued, however we will work with complex- valued functions since complex-valued wavelet transforms will be used. In this pape r , the random processes will be de noted by uppercase le tters while their realizations will be denoted by lowerc ase letters. The random processes will be assumed to have null mean ( E { X t = 0 } for all t ) and be second-or der , i.e. they have a well-deﬁned covariance kernel E  X t X t ′  . A particularly inter esting class of such stochastic pro cesses is the class of second order (or weakly) stationary pro cesses, for which C X ( t − t ′ ) ∆ = E  X t X t ′  4 is a function of t − t ′ only . U nder these assumptions, the W iener-Khinchin theorem states that the covariance kernel may be expr essed as the in verse Fourier transf orm of a nonnegative measure d η X , which we will assume to be continuous with r espect to the Lebesgue measure: d η X ( ν ) = S X ( ν ) d ν , for some nonnegative L 1 function S X called the power spectrum. W e then write C X ( t ) = Z S X ( ν ) e 2i π ν t d ν . W e refer to textbooks such as [1], [2] for a mor e complete mathematical account of the theory , and to [21] for an extension to the setting of distribution theory . B. Elementary operators Our approach rests on nonstationary models obtained by deformatio ns of stationary random signals. W e will mainly use as elementary operators the pointwise multiplica- tion A α , translation T τ , dilation D s , a nd frequency modulation M ν deﬁned as follows: A α x ( t ) = α x ( t ) , T τ x ( t ) = x ( t − τ ) , D s x ( t ) = q − s / 2 x ( q − s t ) , M ν x ( t ) = e 2i π ν t x ( t ) . wher e α , τ , s , ν ∈ R and q > 1 is a ﬁxed number . The amplitude modulation commutes with the other thr ee operators, which satisfy the commutatio n rules T τ D s = D s T q − s τ , T τ M ν = e − 2i π ν τ M ν T τ , M ν D s = D s M ν q s . C. Wavelet transform Our analysis relies heavily on transforms such as the continuous wavelet transform (and discretized versions). In particular , the wavelet transform of a signal X : t ∈ R → X t is deﬁned as: W X ( s , τ ) ∆ = h X , ψ s τ i , with ψ s τ = T τ D s ψ . (1) wher e ψ is the analysis wavelet, i.e. a smooth function with fast decay away from the origin. It may be shown that, for suitable choices of ψ , the wa vele t transform is invertible (see [15 ]), but we will not use that pr operty here. Notice that r ealizations of a continuous time random process generally do not decay at inﬁnity . H owever , for 5 a suitably smooth and localized wavelet ψ , the wavelet transform can still be well deﬁned (see [15], [21] for mor e details). In such a situation the wavelet transform of X is a two-dimensional random ﬁeld, which we analyze in the next section. Besides, in this pape r the analysis wavelet ψ is complex-valued and be longs to the space H 2 ( R ) =  ψ ∈ L 2 ( R ) : supp ( ˆ ψ ) ⊂ R +  . In that framework, a useful property is that, if X is a r eal, zero-mean, Gaussian proc ess, then W X is a complex, zero-mean, cir cular , Gaussian random ﬁeld. Classical choices of wavele ts in H 2 ( R ) are (analytic) derivative of Gaussian ψ k (which has k vanishing moments), and the sharp wavelet ψ ♯ (with inﬁnitely many vanishing moments) intr oduced in [22]. These can be deﬁned in the p ositive Fourier domain by ˆ ψ k ( ν ) = ν k e − k ν 2 /2 ν 2 0 , ˆ ψ ♯ ( ν ) = ǫ δ ( ν , ν 0 ) δ ( ν 1 , ν 0 ) , ν > 0 (2) and vanish on the negative Fourier half axis. Here ν 0 is the mode of ˆ ψ and ν 1 is a cutof f frequency . ǫ is a prescr ibed numerical tolerance chosen so that ˆ ψ ♯ ( ν 1 ) = ǫ , and the diverg ence δ is deﬁned by δ ( a , b ) = 1 2  a b + b a  − 1. The q uality factor of ψ ♯ , i.e. the center fr equency to bandwidth ratio, can be expressed as Q = 1/ p C ( C + 4 ) , where C = − δ ( ν 1 , ν 0 ) ln 2/ ln ǫ > 0 (see supplementary material for mor e details). Remark 1: In (1), the scale constant q > 1 acts a unit selector for the scale s . For example, in musical terminology , q = 2 me a n s that s is measured in octaves, whe r eas for q = 2 1/12 , s is measured in semitones. D. Am plitude modulation, tim e warping The nonstationary signals under consideration a re obtained as linear d e formations of stationary random signals. Deformations of inter est here are amplitude modulations and time warpings. Amplitude modulations are pointwise multiplications by smooth functions, A a : A a x ( t ) = a ( t ) x ( t ) , (3) wher e a ∈ C 1 is a r e al valued function, such that 0 < c a ≤ a ( t ) ≤ C a < ∞ , ∀ t , (4) for some constants c a , C a ∈ R ∗ + . T ime warpings are compositions with smooth and monotonic functions, D γ : D γ x ( t ) = q γ ′ ( t ) x ( γ ( t )) , (5) 6 wher e γ ∈ C 2 is a strictly increasing smooth function, satisfying the contr ol condi- tion [21] 0 < c γ ≤ γ ′ ( t ) ≤ C γ < ∞ , ∀ t , (6) for some const ants c γ , C γ ∈ R ∗ + . Amplitude modulations constitute a simple model for nonstationarity . While there exists a well established state of the art for demodulation algorithms in deterministic settings (in particular in telecommunications), the sto chastic case has attract ed less attention. In the r ecent literature, one may mention [10], where the so-called DEMON spectrum is proposed for amplitude modulation estimation. This problem will be tackled here using wavelet transfor m. T ime warping is an important transformation which has been exploited in various contexts, starting fr om Doppler effect , which we brieﬂy address at the end of this paper , but also speech processing , bioacoustics (see [23] and refer ences ther ein) and mor e generally in diverse ﬁelds such as chemistr y or bioinformatics. The refer ence algorithm for time warping estimation is DWT (Dynamic T ime W arping, see [24] for a r eview), which has been successfully applied to speech processing , in particular speech r ecognition. However , DTW is essentially a template matching algorithm, and does not address the problem considered here, where no templa te is available. Closer to our point of view are a p p roaches based upon transforms such as the Harmonic transform [25] or the Fan-Chirp transform [26] (see [27] for an application to speech analysis/synthesis). These mainly involve computing a time-fr eque ncy r epr esentation of a warped copy of the input signal. Even though this does not seem to be strictly necessary , the warping function generally belongs to a parametrized family (for e xample quadratic), and has to be estimated. The app lication domain seems to be limited so far to locally harmonic, deterministic signal models, and the estimation of the warping function str ongly r elies on these assumptions. W avelet transform and scalogram ar e also natural repr esentations for estimating warping. A ctually , interpreting (normalized) time slices of the signal scalogram as probability distributions naturally suggest s to compute a time-depende nt average scale, dir ectly related to the value of the warping function. This is the approach we will use as baseline approach. However , our approximatio ns below allow us to give a mor e precise meaning to this remark; in addition, scalogram lacks the phase information 7 which turns out to be quite r elevant and yield more pr ecise estimations. The approach we develop below exploits complex valued wavelet transform, and combines amplitude modulation a n d time warping in the framework of a generic stochas- tic signal model, without any harmonicity assumption. This allows us to set the cor - r esponding estimation pr oblems as statistical inference pr oblems, and use tools fr om estimation theory . I I I . J O I N T E S T I M AT I O N O F T I M E W A R P I N G A N D A M P L I T U D E M O D U L AT I O N A. Model and approximations Let us ﬁrst describe the deformation model we will mainly be using in the following. Assume one is given a (unique) r ealization of a random signal of the form Y = A a D γ X (7) wher e X is a stationary zero -mean r eal random process with (unknown) power spectrum S X . The goal is to estimate the deformation functions a and γ from this r ealization of Y , exploiting the stationarit y of X . Remark 2: The stationarity assumption is not sufﬁc ient to yield unambiguous esti- mates, as afﬁ ne functions γ ( t ) = λ t + µ do not break stationarity: for any stationary X , D γ X is stationary too. Therefor e, the warping function γ can only be estimated up to an afﬁne function, as analyzed in [20], [21]. Similarly , the amplitude function a can only be e stimated up to a constant factor . Key ingr edients here are the smoothness of the functions a and γ , a nd their slow variations. This allows us to perform a local analysis using smooth and localized test functions, on which the a ction of A a and D γ can be approx imated by their so-called tangent operators g A τ a and f D τ γ (see [20], [9], [21 ], [28]). Given a test function g located near t = τ (i.e. decaying fast enough as a function of | t − τ | ), T aylor expansions near t = τ yield A a g ( t ) ≈ g A τ a g ( t ) , with f A τ a ∆ = A a ( τ ) , (8) D γ g ( t ) ≈ f D τ γ g ( t ) , with f D τ γ ∆ = T τ D − log q ( γ ′ ( τ ) ) T − γ ( τ ) . (9) Ther efor e, we appr oximate the wavele t transf orm of Y by W Y ( s , τ ) ≈ f W Y ( s , τ ) ∆ = D g A τ a f D τ γ X , T τ D s ψ E , i.e. f W Y ( s , τ ) = a ( τ ) W X  s + log q ( γ ′ ( τ ) ) , γ ( τ )  . (10) 8 Here, we have used the standard commutation rules of translation and dilation opera- tors given in Section II-B. The result below prov ides a quantitative assessment of the quality of the approx ima- tion. Her eafter , we denote by k f k ∞ = ess sup t | f ( t ) | the essential absolute supr e mum of a function f . Theorem 1: Let X be a second order z e r o-mean stationary random process, let Y be the nonstationary process deﬁned in (7). Let ψ be a smooth test function, localized in such a wa y that | ψ ( t ) | ≤ 1 / ( 1 + | t | β ) for some β > 2. Let W Y be the wavelet transform of Y , f W Y its appr oximation given in (10), and let ε = W Y − f W Y denote the approximatio n err or . Assume ψ and S X ar e such that I ( ρ ) X ∆ = r Z ∞ 0 ξ 2 ρ S X ( ξ ) d ξ < ∞ , where ρ = β − 1 β + 2 . Then the approximation erro r ε is a second or der , two-dimensional complex random ﬁeld, and E n | ε ( s , τ ) | 2 o ≤ C 2 a q 3 s  K 1 k γ ′′ k ∞ + K 2 q µ s k γ ′′ k ρ ∞ + K 3   a ′   ∞  2 wher e K 1 = βσ X 2 ( β − 2 ) √ c γ , K 2 = I ( ρ ) X  π 2  ρ q C γ 4 3 ρ , K 3 = p C γ βσ X ( β − 2 ) c a , µ = β − 4 β + 2 , σ 2 X being the variance of X . The proof, which is a n e xtension of the one given in [21], rest s on T aylor approxima- tions of γ ′ and a in the neighbor hood of t = τ and subsequent integral majorizations, is given as supplementary material. Remark 3: The assumption on β ensures that the parameters belong to the following intervals: 1 /4 < ρ < 1 and − 1 /2 < µ < 1. Therefor e, the variance of the appro ximation err or tends to zero when the scales ar e small (i.e. s → − ∞ ). Besides, the err or is inversely pr oportional to the speed of variations of γ ′ and a . This is consistent with the appr oximations of the deformation operators by their tangent operators made in equations (8) and (9). 9 Fr om now on, we will assume the above appro ximations ar e valid, and work on the appro ximate random ﬁelds. The problem is then to estimate jointly a , γ from f W Y , which is a zer o-mean random ﬁeld with covariance E n f W Y ( s , τ ) f W Y ( s ′ , τ ′ ) o = C ( s , s ′ , τ , τ ′ ) (11) wher e the kernel C reads C ( s , s ′ , τ , τ ′ ) = a ( τ ) a ( τ ′ ) q s + s ′ 2 q γ ′ ( τ ) γ ′ ( τ ′ ) Z ∞ 0 S X ( ξ ) × ˆ ψ  q s γ ′ ( τ ) ξ  ˆ ψ  q s ′ γ ′ ( τ ′ ) ξ  e 2i π ξ ( γ ( τ ) − γ ( τ ′ ) ) d ξ . (12) B. Estimation 1) Estim ation procedure: Our goal is to estimate both deformation functions γ a nd a fr om the approximated wavelet transform f W y of a realization y of Y , assuming the latter is a reliable approx imation of the true wavelet tr ansform. From now on, we additionally assume that X is a Gaussian random pr ocess. Ther efor e, f W Y is a z er o-mean cir cular Gaussian random ﬁeld and its pr obability density function is characterized by the covariance matrix. However , equa tion (12) shows that besides deformation functions the covariance also depends on the power spectr um S X of the underlying stationary signal X , which is unknown too. Ther efor e, the evaluation of the maximum likelihood estimate for a and γ requir es a guess for S X . This constraint naturally brings the estimation strategy to an alternate algorithm. In [22], a n estimate for S X was obtained at each iteration using a W elch periodogram on a “stationarized” signal A − 1 ˜ a D − 1 ˜ γ Y , ˜ a and ˜ γ being the current estimates for the deformation functions a and γ . W e use her e a simpler estimate, computed directly from the wavele t coefﬁcient s. The two steps of the estimation algorit hm are detailed below . Remark 4: The a lternate likelihood maximization strategy is r eminiscent of the Expec- tation-Maximizatio n (EM) algorithm, the power spectrum be ing the nuisance parameter . However , while it would be desirable to apply directly the EM paradigm (whose conver- gence is pr oven) to our pr oblem, the dimensionality of the latter (and the corresponding size of covariance matrices) forces us to make additional simpliﬁcations that depart fr om the EM scheme. Therefo r e we turn to a simpler appr oach with several dime nsion r eduction steps. 10 (a) Deformation estim ation. Assume that the power spectrum S X is known (in fact, only an estimate ˜ S X is known). Thus, we are able to write the likelihood corr esponding to the observations of the wavelet coefﬁc ients. Then the maximum likelihood estimator is implemented to determine the unknown functions γ an d a . The wavele t transform (1) is computed on a reg ular time-scale grid Λ = s × τ , δ s being the scale sampling step and F s the time sampling fr equency . The sizes of s and τ ar e r espectively denoted by M s and N τ . Considering the covariance expr ession (12) we want to estimate the vector of pa- rameters Θ = ( θ 1 , θ 2 , θ 3 ) ∆ = ( a ( τ ) 2 , log q ( γ ′ ( τ ) ) , γ ( τ ) ) . Let W y = f W y ( Λ ) denote the discr etized transform and let C W ( Θ ) be the corr esponding covariance matrix. The re- lated log-likelihoo d is L ( Θ ) = − 1 2 ln | det ( C W ( Θ )) | − 1 2 C W ( Θ ) − 1 W y · W y . (13) The matrix C W ( Θ ) is a matrix of size M s N τ × M s N τ , which is generally huge. For instance, for a 5 seconds long signal, sampled at fr equency F s = 44.1 kHz, when the wavelet transform is computed on 8 scales, the matrix C W ( Θ ) has about 3.1 trillion elements which makes it numerically intractable. In addition, due to the redundancy of the wavelet transform, C W ( Θ ) turns out to be singular , and likelihood evaluation is impossible. T o overco me these issues, we use a block-diagonal regularization of the covariance matrix, obtained by forc ing to zeros entries corresponding to differ ent time indices. In other wor ds, we disregar d time correlatio ns in the wavelet domain, which amounts to considering ﬁxed time vector w y , τ n = f W y ( s , τ n ) as independe nt cir cular Gaussian vectors with zer o-mean a n d covariance matrix C ( Θ n ) i j = θ n ,1 C 0 ( θ n ,2 ) i j , 1 ≤ i , j ≤ M s , (14) wher e C 0 ( θ n ,2 ) i j = q ( s i + s j ) / 2 Z ∞ 0 S X ( q − θ n ,2 ξ ) ˆ ψ ( q s i ξ ) ˆ ψ ( q s j ξ ) d ξ . (15) In this situation, the regularized log likelihood L r splits into a sum of inde pendent terms L r ( Θ ) = ∑ n L ( Θ n ) , 11 wher e Θ n = ( θ n ,1 , θ n ,2 ) ∆ = ( θ 1 ( n ) , θ 2 ( n ) ) corr esponds to the amplitude and warping parameters at ﬁxed time τ n = τ ( n ) . Notice that, in such a formalism, θ n ,3 = γ ( τ n ) does not appear a nymore in the covariance expressio n. Thus, we are led to maximize independently for e ach n L ( Θ n ) = − 1 2 ln | det ( C ( Θ n )) | − 1 2 C ( Θ n ) − 1 w y , τ n · w y , τ n . (16) For simplicity , the estimation p rocedur e is d one by a n iterative algorithm (given in mor e details in part II I-B2), which rests on two main steps. First, the log-likelihood is maximized with respect to θ n ,2 using a gradient ascent method, for a ﬁxed value of θ n ,1 . Second, for a ﬁxed θ n ,2 , an estimate for θ n ,1 is dir ectly obtained which reads ˜ θ n ,1 = 1 M s C − 1 0 ( θ n ,2 ) w y , τ n · w y , τ n . (17) (b) Spec trum estimation. Assume the amplitude modulation and time-warping param- eters θ 1 and θ 2 ar e known (in fact, only estimates ˜ θ 1 and ˜ θ 2 ar e known) . For any n we can compute the wa vele t transfo rm 1 θ 1/2 n ,1 f W y ( s − θ n ,2 , τ n ) = W x ( s , γ ( τ n ) ) , (18) For ﬁxed scale s m , w x , s m ∆ = W x ( s m , γ ( τ )) ∈ C N τ is a zero-mean random circular Gaussian vector with time independent variance (as a realization of the wavelet transform of a stationary process) . Hence, the empirical variance is an unbiased estimator of the variance. W e then obtain the so-called wavele t spectr um S X , ψ ( q − s m ω 0 ) ∆ = E ( 1 N τ k ψ k 2 2 k w x , s m k 2 ) (19) = 1 k ψ k 2 2 Z ∞ 0 S X ( ξ ) q s m   ˆ ψ ( q s m ξ )   2 d ξ , (20) wher e ω 0 is the central frequency of | ˆ ψ | 2 . S X , ψ is a narro wband version of S X center ed ar ound fr equency ν m = q − s m ω 0 . Be side s, the bandwidth of the ﬁlter is proportio nal to the frequency ν m . This motivates the introduct ion of the following e stimator ˜ S X of S X ˜ S X ( q − s m ω 0 ) ∆ = 1 N τ k ψ k 2 2 k w x , s m k 2 . (21) Finally , the e stimate ˜ S X is extended to all ξ ∈ [ 0, F s /2 ] by linear interpolat ion. 12 2) Algori thm: The estimation pro cedur e is implemented in an iterative alternate op- timization algorithm. This algorithm whose pseudo-code is given as Algorithm 1 is named Joint Estimation of Frequency , Am plitude, and Spectrum (JEF AS). The initialization needs a n initial guess for the power spectrum S X of X . W e use the spectrum estima- tor (21) applied to the observation Y . After k iterations of the algorit hm, estimates ˜ Θ ( k ) n and ˜ S ( k ) X for Θ n and S X ar e available. Hence we can only evaluate the plug-in estimate ˜ C ( k ) 0 of C 0 , obtained by r eplacing the power spectrum with its estimate in the covariance matrix (15). This yields an approximate expression L ( k ) for the log-likelihood, which is used in place of L in (16) for maximum likelihood estimation. The inﬂuence of such approximations on the performances of the algorithm ar e discussed in section III-C. T o assess the conver gence of the algorithm, the relative update of the parameters is chosen as stopping criterion:    ˜ θ ( k ) j − ˜ θ ( k − 1 ) j    2 2 ,    ˜ θ ( k − 1 ) j    2 2 < Λ , for j = 1, 2 , (22) wher e 0 < Λ < 1 is a user deﬁned thr eshold. Finally , after conver gence of the algor ithm to the estimated value ˜ Θ ( k ) , log q ( γ ′ ) and a 2 ar e estimated thr ough time by cubic spline interpolation. B e sides, γ is given by numerical integratio n assuming that γ ( 0 ) = 0. Remark 5: T o contr ol the variances of the estimators, and the comput ational cost, two diff er ent discret izations of the scale axis are used for ˜ θ 1 or ˜ θ 2 . Indeed, the computation of the log-likelihood involves the evaluation of the inverse covariance ma trix. In [21], a sufﬁcient condition for invertibility was given in the presence of noise. The major consequence induced by this condition is that when δ s is close to ze ro (i.e . the sampling period of scales is small), the covariance matrix could not be numerically invertible. The scale discretization must then be suf ﬁciently coarse to ensur e good conditioning for the matrix. While this condition can be reasonably fulﬁlled to estimate θ n ,2 without impairing the p e rformances of the estimator , it cannot be applied to the estimation of θ n ,1 because of the inﬂuence of M s on its Cram ´ er -Rao bound (see section III -C below). The choice we made is to ma ximize L ( Θ n ) for θ n ,2 with w y , τ n corr e sponding to a coarse sampling s p which is a subsampled version of the original vector s , the scale sampling step and the size of s p being respect ively p δ s and ⌊ M s / p ⌋ for some p ∈ N ∗ . 13 Algorithm 1 JEF A S (Joint Estimation of Fr equency , Amplitude and Spectr um) Initialization: Compute an e stimate ˜ S Y of the power spectru m of Y as an initial guess ˜ S ( 0 ) X for S X . Initialize the estimator of the squared amplitude modulation with ˜ θ ( 0 ) n ,1 = 1, ∀ n . Compute the wavelet transform W y of y . k : = 1 while criterion (22) is false and k ≤ k m a x do • For each n , subsample w y , τ n on scales s p , and estimate ˜ θ ( k + 1 ) n ,2 by maximizing the appro ximate log-likelihood L ( k )  ˜ θ ( k ) n ,1 , θ n ,2  in (16). • For each n , e stimate ˜ θ ( k + 1 ) n ,1 by ma ximizing the approximat e log-likelihood L ( k )  θ n ,1 , ˜ θ ( k + 1 ) n ,1  with r espect to θ n ,1 in (16). Or , in absence of noise, dir ectly app ly equation (17) using the regularized covarianc e matrix given by (23). • Construct the estimated wa vele t transform W x of the underlying stationary signal by interpolation fr om W y and ˜ θ ( k ) with equation (18). Estimate the corr esponding power spectrum ˜ S ( k + 1 ) X with (21). • k : = k + 1 end while • Compute ˜ a and ˜ γ by interpolation fr om ˜ Θ ( k ) . While L ( Θ n ) is ma ximize d for θ n ,1 on the original ﬁne sampling s , a regularization of the covariance matrix has to be done to ensure invertibility . The reg ularized matrix is constr ucted by replacing covariance matrix C 0 ( θ n ,2 ) in (15) by its r egularized version C 0, r ( θ n ,2 ) , given by C 0, r ( θ n ,2 ) = ( 1 − r ) C 0, r ( θ n ,2 ) + r I , (23) for some r egularization parameter 0 ≤ r ≤ 1. Remark 6: After conver gence of the estimation algorithm, the estimated functions ˜ a and ˜ γ allow constr ucting a “stationarized” signal ˜ x = D ˜ γ − 1 A ˜ a − 1 y . ˜ x is an e stimation of the original underlying stationary signal x . Furthermor e, the W elch periodogram [29] may be computed from ˜ x to obtain an estimator of S X whose bias does not depend on fr equency (unlike the estimator used within the iterative algorithm). 14 Remark 7: In order to accelerate the spe e d of the algorithm, the estimation can be done only on a subsampled time grid. The main effect of this choice on the algorithm concerns the ﬁnal estimation of a and γ which is mor e sensitive to the interpolation operation. In the follo wing section, we analyze quantities that enable the evaluation of the expected performances of the e stimators, and their inﬂuence on the a lgorithm. The r eader who is not directly inter ested in the statistical backgro und may skip this section and jump dir ectly to the numerical results in part IV. C. Pe rformances of the es timators and the algorithm (a) Bi as. For θ n ,1 , the e stimator is unbiased when the actual values of θ n ,2 and S X ar e known. In our case, the bias b ( k ) n ,1 ( θ n ,1 ) = E n ˜ θ ( k ) n ,1 o − θ n ,1 is written as b ( k ) n ,1 ( θ n ,1 ) = θ n ,1 M s T race  ˜ C ( k ) 0  ˜ θ ( k ) n ,2  − 1 C 0 ( θ n ,2 ) − I  . (24) As expected, the better the covariance matrix estimation, the lower the bias    b ( k ) n ,1    . For θ n ,2 , as we do not have a closed-form expr ession for the estimator we a re not able to give a n e xpression of the bia s. Nevertheless, if we assume that the two other true variables are known, as a maximum likelihood estimator we make sure that ˜ θ n ,2 is asymptotically unbiased (i.e. ˜ θ n ,2 → θ n ,2 when M s → ∞ ). Regarding S X , equation (19) shows that the estimator yields a smoothed, thus biased version of the spectrum. Propo sition 1 below shows that the estimated spectrum con- ver ges to this biased version when the deformation parameters converge to their actual values. Proposition 1: Le t ψ ∈ H 2 ( R ) be an analytic wa vele t such that ˆ ψ is bounded and   ˆ ψ ( u )   = O u → ∞ ( u − η ) with η > 2. Let ϕ 1 and ϕ 2 be bounded functions deﬁned on R + by ϕ 1 ( u ) = u   ˆ ψ ( u )   2 and ϕ 2 ( u ) = u 2   ˆ ψ ( u )   . Assume S X is such that J X = Z ∞ 0 ξ − 1 S X ( ξ ) d ξ < ∞ . Let S ( k ) X denote the estimation of the spectru m after k iterations of the algorithm. Le t b ( k ) S X denote the bia s deﬁned for all m ∈ [ [ 1, M s ] ] by b ( k ) S X ( m ) = E n ˜ S ( k ) X ( q − s m ω 0 ) o − S X , ψ ( q − s m ω 0 ) . 15 Assume ther e exists a constant c θ 1 > 0 such that θ ( k ) n ,1 > c θ 1 , ∀ n , k . Then    b ( k ) S X    ∞ ≤ J X k ψ k 2 2  K ′ 1    θ 1 − ˜ θ ( k ) 1    ∞ + K ′ 2    ˜ θ ( k ) 2 − θ 2    ∞  , (25) wher e K ′ 1 = k ϕ 1 k ∞ c θ 1 < ∞ , K ′ 2 = ln ( q )  k ϕ 1 k ∞ + 2 k ˆ ψ ′ k ∞ k ϕ 2 k ∞  < ∞ . The p r oof of the Propositio n is given in supplementary materials. Remark 8: If θ ( k ) 1 → θ 1 and θ ( k ) 2 → θ 2 as k → ∞ , we have E n ˜ S ( k ) X ( ν m ) o → k S X , ψ ( ν m ) , as expected. Formula (25) enables the contr ol of the spectrum bias at frequencies ν m = q − s m ω 0 only . Notice also that the requir eme nt J X < ∞ forces S X to vanish at zero fr equency . (b) V ariance. The Cram ´ er-Rao lower bound (CRLB) gives the minimum variance that can be attained by unbiased estimators. The Slepian-Bangs formula (see [30]) directly gives the following C R LB for component θ n , i CRLB ( θ n , i ) = 2 T race (  C ( Θ n ) − 1 ∂ C ( Θ n ) ∂ θ n , i  2 ) ! − 1 . This bound gives information a bout the variance of the estimator at convergence of the algorithm, i.e. when both S X and the other parameters are well estimated. Applying this formula to θ n ,1 gives E n  ˜ θ n ,1 − E  ˜ θ n ,1   2 o ≥ C RLB ( θ n ,1 ) = 2 θ 2 n ,1 M s . This implies that the number of scales M s of the wavelet transform must be la r ge enough to yield an estimator with sufﬁ ciently small variance. For θ n ,2 , no closed-form expr ession is available for the CRLB. Therefor e, the evaluation of this bound and its comparison with the variance of the estimator ˜ θ n ,2 can only be based on numerical results, see section IV. (c) Robustness to noise. Assume now observations are corrupted by a random Gaussian white noise W with variance σ 2 W (supposed to be known): Y = A a D γ X + W . (26) 16 The estimator ˜ θ n ,1 is not robus t to noise. I nd e ed, if the ma ximum likelihood estimator of model (7) in the pr esence of such white noise, a new term b ( k ) n ,1 | W ( θ n ,1 ) must be adde d to the bias e xpression (24), which becomes b ( k ) n ,1 | W ( θ n ,1 ) = 1 M s T race  ˜ C ( k ) 0  ˜ θ ( k ) n ,2  − 1 C wn  , wher e ( C wn ) i j = σ 2 W q ( s i + s j ) / 2 R ∞ 0 ˆ ψ ( q s i ξ ) ˆ ψ ( q s j ξ ) d ξ . In practice, this term can take lar ge values, ther efor e noise has to be taken into account. T o do so, the covariance matrix is now writt en as C ( Θ n ) i j = q s i + s j 2 Z ∞ 0 ( θ n ,2 S X ( q − θ n ,1 ξ ) + σ 2 W ) ˆ ψ ( q s i ξ ) ˆ ψ ( q s j ξ ) d ξ (27) and the likelihood is modiﬁed accor dingly . Formula (17) is no longer true and no closed- form expr ession can be derived anymore, the maximum likelihood estimate ˜ θ n ,1 must be computed by a numerical scheme (here we use a simple gradient ascent). The estimator ˜ θ n ,2 is very r obust to noise. Inde ed, equation (27) shows that the only change in the covariance matrix formula is to r eplace the p ower spectrum S X by S Z = S X + σ 2 W θ n ,2 . The additive constant term does not impair the estimator as long as it is small in comparison with the maximum values of S X . Mor eover , the estimator ˜ S X is modiﬁed because when computing 1 θ 1/2 n ,1 f W y ( s − θ n ,2 , τ n ) on scale s m , we compute: w z , s m = w x , s m + w w ∗ , s m , wher e w w ∗ , s m = 1 θ 1/2 1 f W w ( s m − θ 2 , τ ) is the wavelet transform of a white noise modulated in a m p litude by a − 1 . Thus a constant term ˜ σ W independent of frequency is added to the new spectr um estimator ˜ S Z , so that E  ˜ S Z  = S X , ψ + ˜ σ 2 W wher e ˜ σ 2 W = σ 2 W 1 N τ N τ ∑ n = 1 1 θ n ,1 . D. Ex tension: estimation of other deformations T o describe other nonstationary behaviors of audio signals, other operators can be investigated. For example, combination of time warping and frequency modulation can be consider ed, as was done in [22], we shortly account for this case her e for the sake of completeness. Let α ∈ C 2 be a smooth function, and set M α : M α x ( t ) = e 2i π α ( t ) x ( t ) , (28) 17 The deformation model in [22] is of the form Y = A a M α D γ X . (29) T o perform joint estimation of amplitude and frequency modulation and time warping for ea ch time, a suitable time-scale-frequency transform V is intr oduced, deﬁne d as V X ( s , ν , τ ) = h X , ψ s ν τ i , with ψ s ν τ = T τ M ν D s ψ . I n that case, appr oximation r esults similar to The orem 1 can be obtained from which the corresponding log-likelihood can be written. At ﬁxed time τ , the estimation strategy is the same as befor e, but the p a rameter space is of higher dimension, and the extra parameter θ 3 = α ′ ( τ ) complicates the log- likelihood maximization. In p a rticular , the choice of the discretization of the two scale and frequency variables s and ν inﬂuences pe rformances of the estimator , in particular the Cram ´ er -R ao bound. I V . N U M E R I C A L R E S U LT S W e now turn to numerical simulations and a p p lications. A ma in ingr edient is the choice of the wavele t transform. Here we shall always use the sharp wavelet ψ ♯ deﬁned in (2) and set the scale constant to q = 2. W e syst ematically compar e our appro ach to simple estimators for amplitude modu- lation and time warping, commonly used in applications, d e ﬁned below . The approach of [20] was also impleme nted, but we couldn’t get satisfactor y r esults with that ap- pr oach. • Amplitude modulation: we use a s baseline estimator of a ( τ n ) 2 the average ener gy ˜ θ ( B ) n ,1 deﬁned as follows: ˜ θ ( B ) n ,1 = 1 M s k w y , τ n k 2 . This amounts to replace the estimated covariance matrix in (17) by the identity matrix. Notice that ˜ θ ( B ) n ,1 does not depend on the time warping e stimator , and can be computed d irectly on the observat ion. • T ime warping: the baseline estimator ˜ θ ( B ) n ,2 is the scalogram scale center of mass deﬁned as follows: ˜ θ ( B ) n ,2 = C 0 + 1 k w y , τ n k 2 M s ∑ m = 1 s [ m ] | w y , τ n [ m ] | 2 . C 0 is chosen such that ˜ θ ( B ) 2 is a zer o-mean vecto r . 18 Estimation Amplitude T ime method modulation warping Baseline 0.2015 0.0232 JEF AS 0.0701 0.0005 T AB LE I E S T I M AT I O N M E A N S Q U A R E E R R O R S F O R B O T H D E F O R M A T I O N S Numerical evaluation is performed on both synthetic signals and deformations and real audio signals. A. Synthetic signal W e ﬁrst evaluate the performances of the algorithm on a synthetic signal. This allows us to compar e variance and bias with their theor etical values. The simulated signal has length N τ = 2 16 samples, sampled at F s = 8 kHz (mean- ing the signal duration is t F = ( N τ − 1 ) / F s ≈ 8.2 s). The spe ctrum S X is written as S X = S 1 + S 2 wher e S l ( ν ) = 1 + cos  2 π ( ν − ν ( l ) 0 ) / ∆ ( l ) ν  if | ν − ν ( l ) 0 | < ∆ ( l ) ν /2 and vanishes elsewhere (for l ∈ { 1, 2 } ). The amplitude modulation a is a sine wave a ( t ) = a 0 ( 1 + a 1 cos ( 2 π t / T 1 ) ) , whe re a 0 is chosen such that t − 1 F R t F 0 a 2 ( t ) d t = 1. The time warping function γ is such that log q ( γ ′ ( t ) ) = Γ + cos ( 2 π t / T 2 ) e − t / T 3 , whe r e Γ is chosen such that t − 1 F R t F 0 γ ′ ( t ) d t = 1 . JEF AS is implemented in the MA TLAB/Octave scientiﬁc envir onment. Dimensions wer e set as M s = 106 and p = 7. The wavelet transform is computed using the sharp wavelet with ln ( ǫ ) = − 25 corr esponding to a quality factor Q = 6. In this prob lem, the algorithm took 67 seconds to converge on a standard de sktop computer (CPU Intel Cor e @ 3.20 GHz × 4, 7.7 GB R AM). Results ar e shown in Fig. 1 and compared with baseline e stimations. For the sake of visibility , the baseline estimator of the amplitude modulation (which is very oscillato ry) is not displayed, but nume rical assessments are pr ovided in T able I , which gives MSEs for the differ ent estimations. JEF AS is clearly more pr ecise than the baseline algorithm, furthermor e its precision is well accounted for by the Cram ´ er -Rao bound: in Fig. 1, the estimate is essentially contained within the 9 5 % conﬁdence interval pr ovided by the CR L B (assuming Gaussianity and unbiasedness). 19 0 1 2 3 4 5 6 7 8 Time (s) 0 0.5 1 1.5 2 2.5 3 Amplitude modulation 0 1 2 3 4 5 6 7 8 Time (s) -0.5 0 0.5 1 Time warping Fig. 1. Joint amplitude modulation/time warping estimation on a synthetic si gnal. T op: amplitude mo d ulation estimation ( a 1 = 0.4 and T 1 = t F /3). Bottom: warping estimation ( T 2 = t F /2 and T 3 = t F /2). The le ft hand side of Fig. 2 displa ys the estimated spectrum given by formula (21). The a gr eement with the actual spectrum is very good, with a slight enlargement effect due to ﬁltering by | ˆ ψ | 2 . The right hand side of Fig. 2 gives the evolution of the stopping criterion (22) with iterations. Numerical results show that time warping estimation conver ges faster than amplitude modulation estimation. N e vertheless, when ﬁxing a stopping criterio n to 0.1 % only 7 iterations are necessary for JEF AS to conver ge. B. Application to d olphin s ound spectral analysis After studying the inﬂuence of the various parameters, we now turn to real-world audio examples. First, we analyze a recor ding of a two seconds long dolphin vocalization sound, described in [23]. The wavelet transform of this signal in Fig. 3 shows that the warping model (7) ﬁts well this kind of signal, except for transient clicks that are not 20 400 600 800 1000 1200 1400 1600 Frequency (Hz) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Spectrum 1 2 3 4 5 6 7 It e r a t i o n 1 0 -4 1 0 -3 1 0 -2 1 0 -1 1 0 0 R e l a t i v e u p d a t e T = 0 . 1 % Ti m e w a r p i n g Am p l i t u d e m o d u l a t i o n S p e ct r u m Fig. 2. Left: Spectrum estimation ( ν ( 1 ) 0 = 600 Hz, ∆ ( 1 ) ν = 200 Hz, ν ( 2 ) 0 = 1.2 kHz, ∆ ( 2 ) ν = 400 Hz): actual (dash-dot blue line) and estimated (solid red line) s pectra. Right: Relative update evolution. accounted for . JEF AS allows the estimation of the spectrum of the underlying stationary signal. On the top-right of Fig. 3, we display the wavelet transform of the signal obtained by application of the inverse de formations estimated by JEF AS. Notice that the pr esence of clicks slightly disturbs the stationarization p rocess. Nonetheless, it makes sense to estimate a power spectrum from this signal, since the time de pendence of its wa vele t transform is ne gligible with respect to its scale dependence. The estimated spectra from the original signal an d fro m the estimated underlying stationary signal are displayed on the midd le and the bottom of Fig. 3. Thanks to JEF AS, the harmonic str uctur e clearly appears (bottom plot). W e believe the application of JEF AS to these types of sounds can potentially bring new insights in bioacoustic applications. C. Ap plication to Doppler estim ation Finally , we analyze a sound which is a r ecor d ing (fr om a ﬁxed location) fr om a racing car , moving with constant speed. The car engine sound is then deformed by the Doppler effect , which r esults in time warping, as explained below . Besides, as the car is moving, 21 0 0.5 1 1.5 2 Time (s) 5 4 3 2 1 0.5 0.2 Frequency (kHz) 0 0.5 1 1.5 2 Time (s) 5 4 3 2 1 0.5 0.2 10 -6 10 -4 10 -2 10 0 Estimated spectrum 0 5 10 15 Frequency (kHz) 10 -6 10 -4 10 -2 10 0 Estimated spectrum Fig. 3. Dol p hin sound spectral analysis. T op le ft: log-scalogram of the original signal. T op ri ght: log - scalograms of unwarped and unmodulated sig nals. Middle: estimated spectrum from the or iginal signal. Bo ttom: spectrum from the es timated underly ing s tationary s ignal. 22 the closer the car to the micr ophone, the larger the amplitude of the recor ded sound. Thus, our model ﬁts well this signal. The wavele t transforms of the original signal and the two estimations of the un- derlying stationary signal ar e shown in Fig. 4. W hile the e stimation of time warping only correct s the displaceme nt of wavelet coef ﬁcients in the time-scale domain, the joint estimation of time warping a nd amplitude modulation also approximately corr ects nonstationary variations of the amplitudes. The p hysical relevance of the estimated time warping function can be veriﬁed. Indee d, denote by V the (constant) speed of the car and by c the sound velocity . Fixing the time origin to the time at which the car passes in front of the observer at distance d , the time warping functio n due to Doppler ef fect can be shown to be γ ′ ( t ) = c 2 c 2 − V 2 1 − V 2 t p d 2 ( c 2 − V 2 ) + ( c V t ) 2 ! . (30) W e p lot in Fig. 4 (botto m right) the estimation ˜ γ ′ compar ed with its theor etical value wher e d = 5 m and V = 54 m/s. Clearly the estimate is close to the corr e sponding theor etical curve obtained with these data, which ar e ther e for e r ealistic values. Nevertheless, a closer look at scalograms in Fig. 4 shows that the amplitude correctio n is still not perfect, due to the presence of noise, and the fact that the model r emains too simple : the amplitude modulation actually dep e nds on frequency , which is not accounted for . V . C O N C L U S I O N S W e have d iscussed in this paper e xtensions of methods a nd algorithms described earlier in [9], [28], [21], [22] for the joint estimation of deformation operator and power spectrum for deformed stationary signals, a pr oblem a lr eady address ed in [20] with a diff er ent approach. B esides some impr ovements on the estimation algorithm itself, the main impr ovements described in this paper concern the follow ing two points 1) the extension of the algorithm to the joint estimation of deformations including amplitude modulation to the model and its estimation ([22] was limited to time warping and combinations of time warping with frequency modulation, and in - vestigated generalized wa vele t transfo rms); 23 0 0.5 1 1.5 2 Time (s) 5 4 3 2 1 0.5 0.2 Frequency (kHz) 0 0.5 1 1.5 2 Time (s) 5 4 3 2 1 0.5 0.2 Frequency (kHz) 0.2 0.4 0.6 0.8 1 Time (s) 0.85 0.9 0.95 1 1.05 1.1 1.15 Time warping function Fig. 4. Doppler estimation. T op: lo g -scalograms of the or iginal (left) and unwarped and unmodulated (rig ht) signals. Bottom left: log-scalog ram of the unwarped signal. Bottom right: Estimated time warping com p ared with the theoretical value given in (30). 2) a statistical study of the estimators and of the performances of JEF AS algorithm, with pr ecise mathematical statements . The propo sed approach was validated on numerical simulations and applications to two case studies: spectral estimation fr om non-stationar y d olphin vocalization, and Doppler estimation. The r esults presented here show that the pro posed extensions yield a signiﬁcant impr ovement in terms of precision, and a better theoretical contro l. In particular , the con- tinuous parameter estimation procedur e avoids quantization effect s that were pr esent in [21] where the pa rame ter space was discr ete and the estimation based on e xhaustive sear ch. It also allows the derivation of precisio n estimates, in particular a Cram ´ er-Rao bound. Numerical r esults show that the introductio n of amplitude modulation also impr oves results. Finally , regar ding the approach of [20], its domain of validity see ms to be limited to small-scale (i.e. high-fr equency) signals, which is not the case her e. Contrary to [20], our approach is based on (appro ximate) maximum likelihood estima- 24 tion in the Gaussian framework. Because of our choice to disregar d time corr ela tions, the estimates obtained her e generally pr esent spurious ﬂuctuations, which can be smoothed out by appropriat e ﬁltering. A natural extension of our approach would be to intro duce a smoothness p rior that should avoid such ﬁltering steps when necessary . W e believe that being able to estimate precisely warping functions can be valuable in a variety of audio a p plications. Speech applications have alr eady been studied, we ma y also mention bioacoustic signals, for example to r eﬁne the fr equency excursio n indices used to assess vocal pe rformances of songbir ds (see [31] and r e fer ences ther ein). Quite obviously , contro lling warping and spectrum opens new perspectives in sound design, for example for cro ss-synthesis . Futur e work will apply the aforementio ned methods to a task of blind source separation of nonstationary signals. The code and da tasets used to pr oduce the numerical r esults of this paper , and other audio examples (female voice, wind, etc.) are available at the web site https://gith ub.com/AdMey nard/JEFAS Mor e details on the sharp wavele t and proo fs of The or em 1 and Pr oposition 1 are given as supplemen tary material, that also includes another case study (application to wind sound). A C K N O W L E D G M E N T W e wish to thank M. Kowalski and R. Kronland-Martinet for fruitful discussions. W e also thank the anonymous reviewers for many valuable comments and suggestions, and for bringing very interesting r efer ences to our attention. W e also thank the Centre de Recherches Math ´ ematiques of U niversit ´ e de Montr ´ e al, where part of this r esearc h has been done, for kind hospitality . R E F E R E N C E S [1] L. H. Koopmans, The spectral analysis of time series , 2nd ed., ser . Probability and mathematical statistics, Z. W . Birnbaum and E. Lukacs, Eds . Elsevier , 1995, vol. 22. [2] M. B. Priestley , Spectral anal ysis and time series , ser . Probability and mathematical statistics. Aca demic Press, 1982, no. v . 1-2. [3] ——, Non-linear and n on-stationary time series an alysis . Academic Press, 1988. [4] S. Mallat, G. Papanicolaou, and Z. Zhang, “Adaptive covariance estimation of l ocally stationary processes, ” Ann. S tatist. , vol. 26, no. 1, pp. 1–47, 02 1998. 25 [5] G. P . Nason, R. von Sachs, and G. Kroisandt, “W avelet processes and adaptive es timation of the evolutionary wavelet spectrum,” Journal of the Royal Statistical Society . Series B (Statistical Methodology) , vol. 62, no. 2, pp. 271–29 2, 2000. [6] R. Dahlha us, “A lik elihood approximation for locally stationary processes,” An n. St atist. , vol. 28, no. 6, pp. 1762–1 794, 12 2000. [7] R. von Sachs and M. H. Neumann, “A wavelet-based test for stationarity ,” Jou rnal of T ime Series Analysis , vol . 21, no. 5, pp. 597–613, 2000. [8] S. Wisdom, L. Atlas, and J. Pitton, “Extending coherence time for analysis of modulated random processes , ” in 2014 I EEE Intern ational Conference on Acoustics, Speech and Sign al Processing (ICASSP ) , M ay 2014, pp. 340–344. [9] H. Ome r and B. T orr ´ esani, “Estimation of frequency modul ations on wideband sig nals; applications to audio signal analysis ,” in Proceedings of the 10th International Con ference on Sampling Theory and Applications (SampT A) , G. Pfander , Ed. Eurasip Open Library , 2013, pp. 29–32. [10] S. T . W isdom, “Improved statistical signal processing of nonstationary r andom processes using time-warping,” Master ’s thesis, School of E lectrical Engineeri ng, University of Washington, USA, March 2014. [11] R. J. McAulay and T . F . Quatieri, “Speech analysis/sy nthesis based on a sinusoidal representation,” IEEE T ransactions on Acoustics, Speech, Signal Processing , vol. 34, pp. 744–754, August 1986. [12] M. Lagrange, S. Marchand, and J. B. Rault, “Enhancing the tracking of partials for the sinusoidal modeling of polyphonic sounds,” IEE E T ransactions on Audio, Speech, and L anguage Processing , vol . 15, no. 5, pp. 1625–1634, July 2007. [13] P . Depalle, G. Garc´ ıa, and X. Rod e t, “T racking of partials for additive so und synthesis using hidden markov models,” in Proceedings of the IEEE International C onference on Acoustics, S peech, and Signal Proc essing (ICASSP-93) , vol. 1, April 1993, pp. 225–22 8. [14] S. Godsill and M. Davy , “Bayesian harmonic models for musical pitch estimation and analysis ,” in The IE EE International C onference on Acoustics, Speech, and Signal Processing (ICAS SP 02) , 2002, p p. 1769–177 2. [15] R. Carmona, W . L. Hwang, and B. T orr ´ esani, Practical time-frequency analysis: Gabor and Wavelet T ransforms With an Implementation in S , C. K. Chui, Ed. Academic Press, 1998. [16] F . Auger , P . Flandrin, Y .-T . Lin, S. McLaughlin, S. Mei g nen, T . Ober lin, and H.-T . W u, “T ime - frequency reassignment and synchrosqueezing: An overview ,” IEEE Signal P rocessing M agazine , vol. 30, no. 6, pp. 32–41, 2013. [17] H.-T . W u, “Instanta neous frequency and wave shape function (i),” Applied and Computati onal Harmonic Analysis , vol. 35, pp. 181–1 99, 2013. [18] C. - Y . Lin, L. Su, and H.-T . Wu, “W ave-shape f unction analysis, ” Journal of Fou rier An alysis and Applications , vol. 24, no. 2, pp. 451–5 05, Apr 2018. [19] M. Clerc and S. Mallat, “The texture gr adient equation for recovering shape from texture,” IEEE T ransactions on Pattern Analysis and Machine Intel ligence , vol. 24, no. 4, pp. 536–549 , 2002. [20] —— , “Estimating deformations of stationary process es,” Ann. Stati st. , vol. 31, no. 6, pp. 1772–182 1, 12 2003. [21] H. Omer and B. T orr ´ esani, “T ime-frequency and time-scale analysis of deformed s tationary processes , with application to non-stationary so und modeling,” A pplied and Computational Harmonic An alysis , vol. 43, no. 1, pp. 1 – 22, 2017. [22] A. M e ynard and B. T or r ´ esani, “Spectral estimation for non-stationary s ignal classes ,” in Sampling Theory and Applications , ser . Proceedings of SampT A17, T allinn, Estonia, Jul. 2017. 26 [23] D. Stowell, Computational Analysis of Soun d Scenes and Events . Springer , 2018, ch. Computational Bioacoustic Scene Analysi s, pp. 303–333 . [24] P . Senin, “Dynamic time warping alg orithm review ,” Information a nd Computer Science Department, University of Hawaii at Manoa, Honolulu, USA, T ech. Re p., 2008. [25] F . Zhang, G. Bi, and Y .-Q. Chen, “Harmonic transform,” IEE P roceed ings - V ision, Image an d Signal Processing , vol. 151, no. 4, pp. 257–263 , Aug 2004. [26] L . W eruaga and M. K ´ epe s i, “The fan-chirp transform for non-stationary harmonic signals,” S ignal P rocessing , vol. 87, p. 15041 522, 2007. [27] R. Dunn and T . F . Quatieri, “Sinewave analysis /synthesis based on the f an-chirp transform,” i n Proceedings if the 2007 IEEE Workshop on Applications of Sign al Processing to Audio and Acoustics , 2007, pp. 247–2 50. [28] H. Omer , “Mod ` ele s de d ´ e formation de processus stochastiques g ´ en ´ er alis ´ es. application ` a l’estimation des non stationnarit ´ es dans les signaux audi o.” Ph.D. disser tation, Aix -Marseille Universi t ´ e, 2015. [29] P . W elch, “The use of fast Fourier transform for the es timation of p o wer spectra: A method based on time averaging over short, modiﬁed periodog rams,” IEEE T ransactions on Audio and Electroac oustics , vol. 15, no. 2, pp. 70–73, Jun 1967. [30] P . Stoica and R. Mos e s, Introduction to Spectral Analysis . Prentice Hall, 1997. [31] J. Podos, D. L. Moseley , S. E. Goo d win, J. McClure, B. N. T aft, A. V . H. Strauss, C. Rega-Brodsky , and D . C. Lahti, “A ﬁne-scale, broadly applicable index of vocal perfor mance: frequency excursion,” Animal Behaviour , vol. 116, pp. 203–212, 2016.

Spectral analysis for nonstationary audio

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment