Rectified binaural ratio: A complex T-distributed feature for robust sound localization

Most existing methods in binaural sound source localization rely on some kind of aggregation of phase-and level-difference cues in the time-frequency plane. While different ag-gregation schemes exist, they are often heuristic and suffer in adverse no…

Authors: Antoine Deleforge (PANAMA), Florence Forbes (MISTIS)

Rectified binaural ratio: A complex T-distributed feature for robust   sound localization
RECTIFIED BINA URAL RA TIO: A COMPLEX T -DISTRIBUTED FEA TURE FOR R OBUST SOUND LOCALIZA TION Antoine Delefor ge ∗ and Flor ence F orbes † ∗ Inria Rennes - Bretagne Atlantique † Inria Grenoble - Rh ˆ one-Alpes (firstname.lastname@inria.fr) ABSTRA CT Most existing methods in binaural sound source localization rely on some kind of aggre gation of phase- and le vel- differ- ence cues in the time-frequency plane. While different ag- gregation schemes e xist, they are often heuristic and suf fer in adverse noise conditions. In this paper, we introduce the r ec- tified binaural r atio as a ne w feature for sound source local- ization. W e sho w that for Gaussian-process point source sig- nals corrupted by stationary Gaussian noise, this ratio follows a complex t-distribution with explicit parameters. This new formulation provides a principled and statistically sound way to aggregate binaural features in the presence of noise. W e subsequently deri v e two simple and efficient methods for ro- bust relati ve transfer function and time-delay estimation. Ex- periments on heavily corrupted simulated and speech signals demonstrate the robustness of the proposed scheme. Index T erms — Complex Gaussian ratio; t-distribution; relativ e transfer function; binaural; sound localization 1. INTR ODUCTION The most widely used features for binaural (two micro- phones) sound source localization are the measured time delays and lev el differences between the two microphones. For a single source signal in the absence of noise, these features correspond in the frequency domain to the ratio of the Fourier transforms of the right- and the left-microphone signals. This ratio is called the r elative transfer function (R TF) [1], and only depends on the source’ s spatial charac- teristics, e.g . , its position relative to the microphones. The log-amplitudes and phases of the R TF are referred to as interaur al level differ ences (ILD) and interaur al phase differ- ences (IPD) in the binaural literature. Many binaural sound source localization methods rely on some kind of aggrega- tion of these cues over the time-frequency plane [2 – 8]. The generalized cross-correlation (GCC) method [2] consists of weighting the cross-power spectral density (CPSD) of two signals in order to estimate their delay in the time-domain (CPSD phases and IPD are the same). A successful GCC method is the phase transform (PHA T), in which IPD cues are equally weighted. The popular sound localization method PHA T -histogram aggregates these cues using histograms [3]. In [5], a heuristic binaural cue weighting scheme based on signals’ onsets is proposed. In [4], both ILD and IPD cues are modeled as real Gaussians and their frequency-dependent variances are estimated through an expectation-maximization (EM) procedure referred to as MESSL. A number of exten- sions of MESSL hav e later been dev eloped [6 – 8], including one using t-distributions for ILD and IPD cues instead of Gaussian distributions [6]. While all these methods rely on a weighting scheme of binaural cues, none of these schemes is based on the sta- tistical properties of the source and noise signals. Though, intuitiv ely , a lo w signal-to-noise-ratio (SNR) at microphones means that a specific cue is less reliable, while a high SNR means that this cue should be giv en more weight. In this paper , we prove that the ratio of two complex circular- symmetric Gaussian v ariables follo ws a comple x t-distrib ution with e xplicit parameter expressions. In particular , for the bin- aural recording of a Gaussian-process source corrupted by stationary Gaussian noise, we show that the mean of the mi- crophone signals’ ratio does not only depend on the clean ratio b ut also on the source and noise statistics. This observa- tion naturally leads to the definition of a ne w binaural feature referred to as the rectified binaural ratio (RBR). The explicit distribution of RBR features pro vides a principled and statis- tically sound way of weighting and aggregating them. Based on this, we deri ve two simple and ef ficient methods for rela- tiv e transfer function and time-delay estimation, and test their robustness on hea vily corrupted binaural signals. 2. A COMPLEX-T MODEL FOR BINA URAL CUES In the complex short-time Fourier domain, we consider the following model for a binaural setup recording a static point sound source in the presence of noise:  m 1 ( f , t ) = h 1 ( f , θ ) s ( f , t ) + n 1 ( f , t ) m 2 ( f , t ) = h 2 ( f , θ ) s ( f , t ) + n 2 ( f , t ) , or equiv alently m ( f , t ) = h ( f , θ ) s ( f , t ) + n ( f , t ) . (1) Here, ( f , t ) is the frequency-time indexing, θ is a vec- tor of source spatial parameters, e .g. , the source position, m ( f , t ) = [ m 1 ( f , t ) , m 2 ( f , t )] > ∈ C 2 denotes the micro- phone signals, s ( f , t ) ∈ C denotes the source signal of inter- est, n ( f , t ) = [ n 1 ( f , t ) , n 2 ( f , t )] > ∈ C 2 denotes the noise signals and h ( f , θ ) = [ h 1 ( f , θ ) , h 2 ( f , θ )] > ∈ C 2 denotes the acoustic transfer function from the source to the micro- phones. The function h ( f , θ ) is of particular interest because it depends on the source position θ b ut does not depend on the time-varying source and noise signals. Under noise-free and non-vanishing source assumptions, i.e. n ( f , t ) = 0 and s ( f , t ) 6 = 0 , it is easily seen that the binaural ratio m 2 ( f , t ) /m 1 ( f , t ) is equal to h 2 ( f , θ ) /h 1 ( f , θ ) = r ( f , θ ) , which only depends on the source position. This ratio can hence be used for sound source localization. The quantity r ( f , θ ) is called r elative transfer function (R TF) [1]. Its log-amplitudes and phases are respecti vely referred to as interaural lev el and phase dif ferences (ILD and IPD). In practical situations including noise, the ratio m 2 ( f , t ) / m 1 ( f , t ) does no longer depend on θ only , b ut also on the source and noise signals s ( f , t ) and n ( f , t ) . These signals are assumed independent, and we consider the following proba- bilistic models: P ( s ( f , t )) = C N 1 ( s ( f , t ); 0 , σ 2 s ( f , t )) , (2) P ( n ( f , t )) = C N 2 ( n ( f , t ); 0 , R nn ( f )) , (3) where C N p denotes the p -v ariate complex circular-symmetric normal distribution, or comple x-normal . Its density is [9]: C N p ( x ; c , Σ ) = 1 π p | Σ | exp  − ( x − c ) H Σ − 1 ( x − c )  , where {·} H denotes the Hermitian transpose. W e assume that R nn ( f ) is known and constant ov er time, i.e. , noise signals are stationary . Howe ver , they are not necessarily pairwise in- dependent and may thus include other point sources. On the other hand, the source signal is a Gaussian process with time- varying v ariance σ 2 s ( f , t ) . This general model is widely used in audio signal processing, in particular for sound source sep- aration, e.g. , [10]. W e now introduce the uni variate complex t-distribution denoted C T 1 : C T 1 ( y ; µ, λ 2 , ν ) = 1 π λ 2  1 + | y − µ | 2 ν λ 2  − (1+ ν ) , (4) where µ ∈ C , λ 2 ∈ R + and ν ∈ R + are respectiv ely referred to as the mean, spread and degrees of freedom parameters. This definition follows a construction of multiv ariate exten- sions for the t-distribution [11] applied to the complex plane. In the real case, the t-distribution arises from the ratio of a Gaussian ov er the square root of a Chi-square distribution. In the complex case, we alternativ ely show the following result: Theorem 1 Let m = [ m 1 , m 2 ] > be a vector in C 2 following a complex-normal distrib ution such that P ( m ) = C N 2  m ; 0 ,  σ 2 m 1 ρσ m 1 σ m 2 ρ ∗ σ m 1 σ m 2 σ 2 m 2  . Then the ratio variable y = m 2 /m 1 follows a complex-t dis- tribution suc h that P ( y ) = C T 1  y ; σ m 2 σ m 1 ρ ∗ , σ 2 m 2 σ 2 m 1 (1 − | ρ | 2 ) , 1  . (5) Here, ρ = E { m 1 m ∗ 2 } / ( σ m 1 σ m 2 ) is the correlation coeffi- cient between m 1 and m 2 and ( . ) ∗ denotes the complex con- jugate. This result is consistent with that in [12] but we pro- vide a simpler proof with better insight in Appendix A.2. Theorem 1 can be directly applied to obtain an explicit dis- tribution for the binaural ratio m 2 ( f , t ) /m 1 ( f , t ) under the model defined by (1), (2) and (3). Howe v er , both the mean and the spread of this distribution depend on the noise cor- relation and v ariances as well as the transfer functions in a way which is difficult to handle. W e will therefore design a more con venient and somewhat more natural binaural feature by first whitening the noise signals in each observed vectors m ( f , t ) , i.e. , making them independent and of unit variance. Since R nn ( f ) is positive semi-definite, it has a unique posi- tiv e semi-definite square root R nn ( f ) 1 / 2 . If R nn ( f ) is further in vertible 1 , we can define: Q ( f ) = R nn ( f ) − 1 / 2 . (6) By left-multiplication of (1) by Q ( f ) we obtain Q ( f ) m ( f , t ) = Q ( f ) h ( f , θ ) s ( f , t ) + Q ( f ) n ( f , t ) , (7) m 0 ( f , t ) = h 0 ( f , θ ) s ( f , t ) + n 0 ( f , t ) , (8) where n 0 ( f , t ) follows the standard bi v ariate complex-normal C N 2 ( 0 , I 2 ) . Note that h 0 ( f , θ ) can only be identified up to a multiplicativ e complex scalar constant because the same ob- servations are obtained by dividing corresponding source sig- nals by this constant. Hence, we can assume without loss of generality that h 0 1 ( f , θ ) = 1 and h 0 2 ( f , θ ) = r 0 ( f , θ ) , where r 0 ( f , θ ) is the relativ e transfer function (R TF) after whitening. It follows that, m 0 1 ( f , t ) = s ( f , t ) + n 0 1 ( f , t ) , σ 2 m 0 1 ( f , t ) = σ 2 s ( f , t ) + 1 and σ 2 m 0 2 ( f , t ) = | r 0 | 2 σ 2 s ( f , t ) + 1 . Moreo ver , since Q ( f ) is in vertible, the original R TF can be obtained from r 0 ( f , θ ) as the ratio of vector Q ( f ) − 1 [1 , r 0 ( f , θ )] > . W e can now use Theorem 1 to obtain that y 0 ( f , t ) = m 0 2 ( f , t ) /m 0 1 ( f , t ) follows the complex-t distrib ution: C T 1 σ 2 s ( f , t ) 1 + σ 2 s ( f , t ) r 0 ( f , θ ) , σ 2 m 0 2 ( f , t ) + σ 2 s ( f , t ) (1 + σ 2 s ( f , t )) 2 , 1 ! . (9) Interestingly , it turns out that the distribution of a binaural ratio under white Gaussian noise is not centered on the ac- tual R TF r 0 ( f , θ ) ; but rather on a scaled version of it which depends on the instantaneous source v ariance. This suggests 1 For the case where R nn ( f ) is non-inv ertible, see Appendix A.1. to use the following more natural feature that we refer to as r ectified binaural r atio (RBR): y ( f , t ) = 1 + σ 2 s ( f , t ) σ 2 s ( f , t ) · m 0 2 ( f , t ) m 0 1 ( f , t ) . (10) This feature has the following distrib ution: P ( y ( f , t )) = C T 1  y ( f , t ); r 0 ( f , θ ) , λ 2 ( f , t ) , 1  , (11) where λ 2 ( f , t ) = σ 2 m 0 2 ( f , t ) + σ 2 s ( f , t ) σ 4 s ( f , t ) , (12) which is centered on the R TF r 0 ( f , θ ) . The spread parameter λ 2 ( f , t ) is also important because it models the uncertainty or “reliability” associated to each RBR feature: the larger is λ 2 ( f , t ) , the less reliable is y ( f , t ) . Since the noise variance is fixed to 1, we see in (12) that λ 2 ( f , t ) tends to 0 when the SNR at ( f , t ) tends to infinity , while λ 2 ( f , t ) tends to infinity when the SNR approaches 0, which matches intuition. 3. P ARAMETER ESTIMA TION 3.1. Spr ead parameter W e consider the general case of time-varying source variances σ 2 s ( f , t ) . This is more challenging than a stationary model but also more realistic since typical audio signals such as speech or music are often sparse and impulsiv e in the time-frequenc y plane. In this case, the calculation of RBR features (10) and of their spread parameter (12) requires the knowledge of in- stantaneous source and microphone variances at each ( f , t ) . A number of w ays can be en visioned to estimate them. In this paper , we use the perhaps most straightforward approach: the instantaneous microphone v ariances σ 2 m 0 1 ( f , t ) and σ 2 m 0 2 ( f , t ) are approximated by their observed magnitudes | m 0 1 ( f , t ) | 2 and | m 0 2 ( f , t ) | 2 . More accurate estimates could be obtained using, e.g . , a sliding averaging windo w in the time-frequency plane as in [10]. Howe ver , this simple scheme showed good performance in practice. It leads to the following straightfor- ward estimate for σ 2 s ( f , t ) : b σ 2 s ( f , t ) =  | m 0 1 ( f , t ) | 2 − 1 if | m 0 1 ( f , t ) | 2 > 1 , 0 otherwise , (13) from which we deduce b λ 2 ( f , t ) using (12). b σ 2 s ( f , t ) = 0 leads to b λ 2 ( f , t ) = + ∞ , corresponding to a missing data at ( f , t ) . 3.2. Unconstrained R TF Once the spread parameter is estimated, we are left with the estimation of r 0 ( f , θ ) which is the mean of the complex t- distribution (12). The equiv alent characterization of the t- distribution as a Gaussian scale mixture leads naturally to an EM algorithm that con ver ges under mild conditions to the maximum likelihood [13]. Introducing an additional set of latent variables u = { u ( f , t ) , f = 1 : F , t = 1 : T } , we can write (11) equiv alently as: P ( y ( f , t ) | u ( f , t )) = C N 1 ( y ( f , t ); r 0 ( f , θ ) , λ 2 ( f , t ) u ( f , t ) ) , (14) P ( u ( f , t )) = G (1 , 1) , (15) where G denotes the Gamma distribution. At each iteration ( q ) , the M-step updates r 0 ( f , θ ) as a weighted sum of the y ( f , t ) ’ s while the E-step consists of updating the weights de- fined as ω ( q ) f t = 1 2 b λ − 2 ( f , t ) · E [ u ( f , t ) | y ( f , t ); r 0 ( q ) ( f , θ )] : M-step: r 0 ( q +1) ( f , θ ) = ( P T t =1 ω ( q ) f t y ( f , t )) / ( P T t =1 ω ( q ) f t ) , E-step: ω ( q +1) f t =  b λ 2 ( f , t ) + | y ( f , t ) − r 0 ( q +1) ( f , θ ) | 2  − 1 . The initial weights ω (0) f t can be set to 1, although our exper- iments sho wed that random initializations usually con ver ged to the same solution. Con ver gence is assumed reached when r 0 ( f , θ ) v aries by less than 0 . 1% at a gi ven iteration. In prac- tice, the algorithm con ver ged in less than 100 iterations in nearly all of our experiments. Once an estimate b r 0 ( f , θ ) is obtained, the non-whitened R TF b r ( f , θ ) is calculated as the ratio of vector Q ( f ) − 1 [1 , b r 0 ( f , θ )] > . 3.3. Acoustic space prior on the R TF In practice, when a sound source emits in a real room, the R TF can only take a restricted set of v alues belonging to the so-called acoustic space manifold of the system [8]. Hence, a common approach is to search for the optimal r 0 among a finite set of K possibilities corresponding to different loca- tions of the source, namely r 0 ∈ R 0 = { r 0 1 , . . . , r 0 K } where r 0 k ( f ) = r 0 ( f , θ k ) . From a Bayesian perspective, this corre- sponds to a mixture-of-Dirac prior on r 0 ( f , θ ) . Considering the observed features y , we then look for the r 0 b k that maxi- mizes the log-likelihood of y as induced by (11). T aking the logarithm of (4), this amounts to minimize: b k = argmin k =1: K P T t =1 P F f =1 log  b λ 2 ( f , t ) + | y ( f , t ) − r 0 k ( f ) | 2  . W e recover the robustness property that a data point with high spread has less impact on the estimation of r 0 . 4. EXPERIMENT AL RESUL TS 4.1. R TF estimation W e first ev aluate the R TF estimation method described in Sec- tion 3.2 through extensi ve simulations. 160,000 binaural test signals are generated according to model (1), (2) and (3), un- der a wide range of noise and source statistics. Each gener- ated complex signal corresponds to T = 20 time samples in a gi ven frequency . The variances of source signals are time- varying and uniformly drawn at random. Sparse source sig- nals are simulated by setting their variance to 0 with a 50% 0 10 20 30 10 − 4 10 − 3 10 − 2 10 − 1 10 0 10 1 Input S NR ( dB) Mea n square d err or on th e R TF Dense source s igna ls RBR (prop os ed) Mean Rat io Mean IP D/ILD Random 0 10 20 30 10 − 4 10 − 3 10 − 2 10 − 1 10 0 10 1 Input S NR ( dB) Sparse source s igna ls Fig. 1 . Mean squared error of dif ferent R TF estimation methods for various SNRs. probability at each sample. For each test signal, the noise variances and correlation are uniformly dra wn at random, and the R TF r is drawn from a standard comple x-normal distrib u- tion. The proposed method is compared to tw o baseline meth- ods. The first one (Mean ratio) takes the mean of the comple x microphone ratios m 2 ( f , t ) /m 1 ( f , t ) ov er the T samples of each signal. The second one (Mean ILD/IPD) calculates the mean ILD and IPD as follows: ( ILD = 1 T P T t =1 log  m 2 ( f ,t ) m 1 ( f ,t )  , IPD = 1 T P T t =1 m 2 ( f ,t ) / | m 2 ( f ,t ) | m 1 ( f ,t ) / | m 1 ( f ,t ) | . (16) The R TF is then estimated as exp( ILD ) · IPD. This latter type of binaural cue aggregation is common to many methods, in- cluding [3, 5, 8]. For fairness of comparison, the samples iden- tified as missing by our method according to (13) are ignored by all 3 methods. Mean squared errors for various signal-to- noise ratios (SNR) and for both dense (left) and sparse (right) source signals are showed in Fig. 1. As an indicator of the error upper-bound, the results of a method generating random R TF estimates (Random) are also shown. Except at low SNRs ( ≤ -5dB) where all 3 methods yield estimates close to ran- domness, the proposed method outperforms both the others. In particular , for SNRs larger than 15 dB, the mean squared error is decreased by se veral orders of magnitudes and the RBR features performed best in 92% of the tests. T wo facts may explain these results. First, as showed in (9), the mi- crophone ratio is a biased estimate of the R TF under white noise conditions. This bias is further amplified for arbitrary noise statistics. Second, the baseline methods, as many ex- isting methods in the literature, aggregate binaural cues with binary weights: each sample is classified as either missing or not. In contrast, the explicit spread parameter (12) av ailable for rectified binaural ratios enables to weight observations in a statistically sound way . − 3 0 − 2 5 − 2 0 − 1 5 − 1 0 − 5 0 5 1 0 0 5 1 0 1 5 Mean tim e- del ay erro r ( sam ples ) RBR (p rop osed) PHA T-histogram Random − 3 0 − 2 5 − 2 0 − 1 5 − 1 0 − 5 0 5 1 0 0 2 0 4 0 6 0 8 0 1 0 0 Input SNR (dB ) Percentage of c orre ct delays RBR (p rop osed) PHAT-histogram Random Fig. 2 . Comparing time-delay estimation results of RBR and PHA T using 1 second noisy speech signals (200 test signals per SNR value). 4.2. T ime difference of arrival estimation Under free-field conditions, i.e. , direct single-path propaga- tion from the sound source to the microphones, localizing the source is equiv alent to estimating the time difference of ar- riv al (TDO A) between microphones. Indeed, for far enough sources, we hav e the relation τ ≈ d cos( θ ) F s /C where τ is the delay in samples, d the inter-microphone distance, θ the source’ s azimuth angle, F s the frequency of sampling, and C the speed of sound. In the frequency domain, the R TF then has the explicit expression r ( f , τ ) = exp( − 2 π iτ ( f − 1) /F ) where F is the number of positive frequencies and f = 1 : F is the frequenc y index. Let R be the discrete set of R TFs cor - responding to delays of − τ max to + τ max samples, and R 0 the corresponding set after whitening, i.e. , containing ratios of Q ( f )[1 , r ( f , τ )] > . Given a noisy binaural signal, the method of Section 3.3 can be applied to select the most likely R TF r 0 in R 0 and deduce the corresponding TDOA. 4 , 000 test sig- nals are generated using random 1 second speech utterances from the TIMIT dataset [14] sampled at F s = 16,000 Hz. A binaural signal with a random delay of − 20 to +20 samples between microphones is generated, before applying the short- time Fourier transform ( 64 ms windows with 50% ov erlap). This yields F = 512 positi ve frequencies and T = 32 time samples. These signals are finally corrupted by random ad- ditiv e stationary noise of kno wn statistics in the frequency domain using the same procedure as in Section 4.1. The pro- posed RBR-based approach is compared to the sound source localization method PHA T -histogram 2 [3]. Results are dis- played in Fig. 2. For SNRs higher than -6 dB, the proposed RBR method yields less than 0 . 4% incorrect delays, versus 10 . 1% for PHA T -histogram on the same signals. RBR’ s a ver - age computational time is 80 ± 6 ms per second of signal on a common laptop, which is about 3 times faster that PHA T - histogram using our Matlab implementations. 5. CONCLUSION W e explicitly expressed the probability density function of the ratio of two microphone signals in the frequency domain in the presence of a Gaussian-process point source corrupted by stationary Gaussian noise. This statistical frame work en- abled us to model the uncertainty of binaural cues and was ef- ficiently applied to rob ust R TF and TDO A estimation. Future work will include e xtensions to multiple sound source separa- tion and localization following ideas in [4], and to more than two microphones following ideas in [15]. The flexibility of the proposed framew ork may also allo w the inclusion of a v a- riety of priors on the R TFs such as Gaussian mixtures, as well as the handling of v arious types of noise and source statistics. A. APPENDIX A.1. Non-in v ertible noise covariance If the noise signals n 1 ( f , t ) and n 2 ( f , t ) in (1) have a de- terministic dependency , R nn ( f ) is rank-1 and non-in v ertible. This is an important special case which may occur in prac- tice when, e.g. , the noise is a point source. Since R nn ( f ) − 1 / 2 is then not defined, we replace the whitening matrix in (6) by Q ( f ) =  1 /σ 2 n 1 ( f ) 0 1 /σ 2 n 2 ( f ) − 1 /σ 2 n 2 ( f )  , where σ 2 n 1 ( f ) and σ 2 n 1 ( f ) denote the variances of n 1 ( f , t ) and n 2 ( f , t ) . It then follows that n 0 2 ( f , t ) = 0 and that n 0 1 ( f , t ) follows the stan- dard complex-normal distribution C N (0 , 1) . All subsequent deriv ations in the paper remain unchanged, with the exception of (12) which becomes λ 2 ( f , t ) = σ 2 m 0 2 /σ 4 s . A.2. Pr oof of Theorem 1 W e first pro ve the result for ρ = 0 , i.e. , when m 1 and m 2 are independent. Since [ m 1 , m 2 ] T is jointly circu- lar symmetric complex Gaussian, it follo ws that m 1 and m 2 are also complex Gaussian with m 1 ∼ C N 1 (0 , σ 2 m 1 ) , m 2 ∼ C N 1 (0 , σ 2 m 2 ) [16], and S 2 = 2 | m 1 | 2 /σ 2 m 1 follows a Chi-square distribution with 2 degrees of freedom [9]. These properties generalize their counterparts in the real case and can be easily checked by using the characterization of com- plex Gaussians as real Gaussians on the real and imaginary parts [16]. W e can now use the property of circular symmet- ric Gaussians that states that if Y is C N (0 , Σ) then Y and Y e iφ hav e the same distribution for all φ . W e deduce from this property that y = m 2 /m 1 and z = m 2 / | m 1 | have the 2 W e used the PHA T -histogram implementation of Michael Mandel, avail- able at http://blog.mr-pc.or g/2011/09/14/messl-code-online/. same distrib ution. Then, σ m 1 z = m 2 p 2 /S 2 is distrib uted as a complex Gaussian over the square root of an independent scaled Chi-square distribution, which is one of the character- ization of the complex t-distribution [11, Section 5.12] . It follows that σ m 1 z ∼ C T 1 (0 , σ 2 m 2 , 1) . Therefore y follo ws C T 1 (0 , σ 2 m 2 /σ 2 m 1 , 1) which corresponds to Theorem 1 for ρ = 0 . For the general case, we multiply m by matrix A =  1 0 − ρ ∗ σ m 1 /σ m 2  so that f m = A m is complex Gaussian with covariance matrix A Σ A H = σ 2 m 1  1 0 0 1 − | ρ | 2  . W e deduce from the previous case that e y = e m 2 / e m 1 follows C T 1 (0 , 1 − | ρ | 2 , 1) . W e finally obtain Theorem 1’ s result by noting that e y = ( σ m 1 /σ m 2 ) y − ρ ∗ . REFERENCES [1] S. Gannot, D. Burshtein, and E. W einstein, “Signal enhance- ment using beamforming and nonstationarity with applica- tions to speech, ” IEEE T rans. Signal Process. , vol. 49, no. 8, pp. 1614–1626, 2001. [2] C. Knapp and G. C. Carter , “The generalized correlation method for estimation of time delay , ” IEEE T r ans. Acoust., Speech, Signal Pr ocess. , v ol. 24, no. 4, pp. 320–327, 1976. [3] P . Aarabi, “Self-localizing dynamic microphone arrays, ” IEEE T r ans. Syst., Man, Cybern. C , vol. 32, no. 4, pp. 474–484, 2002. [4] M. I. Mandel, R. J. W eiss, and D. P . Ellis, “Model-based expectation-maximization source separation and localization, ” IEEE T rans. Acoust., Speech, Signal Pr ocess. , vol. 18, no. 2, pp. 382–394, 2010. [5] J. W oodruf f and D. W ang, “Binaural localization of multiple sources in rev erberant and noisy environments, ” IEEE T rans. Acoust., Speech, Signal Pr ocess. , vol. 20, no. 5, pp. 1503– 1512, 2012. [6] Z. Zohny and J. Chambers, “Modelling interaural level and phase cues with student’ s t-distribution for robust clustering in MESSL, ” in International Conference on Digital Signal Pr ocessing (DSP) . IEEE, 2014, pp. 59–62. [7] M. I. Mandel and N. Roman, “Enforcing consistency in spec- tral masks using markov random fields, ” in EUSIPCO . IEEE, 2015, pp. 2028–2032. [8] A. Deleforge, F . Forbes, and R. Horaud, “ Acoustic space learning for sound-source separation and localization on bin- aural manifolds, ” International journal of neural systems , v ol. 25, no. 01, pp. 1440003, 2015. [9] D. R. Fuhrmann, “Comple x random variables and stochastic processes, ” The Digital Signal Pr ocessing Handbook , pp. 60– 1, 1997. [10] E. V incent, S. Arberet, and R. Gribonv al, “Underdetermined instantaneous audio source separation via local gaussian mod- eling, ” in Independent Component Analysis and Signal Sepa- ration , pp. 775–782. Springer , 2009. [11] S. K otz and S. Nadarajah, Multivariate t Distributions and their Applications , Cambridge, 2004. [12] R. J. Baxle y , B. T . W alkenhorst, and G. Acosta-Marum, “Complex Gaussian ratio distrib ution with applications for er - ror rate calculation in fading channels with imperfect CSI, ” in Global T elecommunications Confer ence (GLOBECOM) . IEEE, 2010, pp. 1–5. [13] G. McLachlan and D. Peel, “Robust mixture modelling using the T distribution, ” Statistics and computing , v ol. 10, pp. 339– 348, 2000. [14] J. S. Garofolo, L. F . Lamel, W . M. Fisher, J. G. Fiscus, and D. S. Pallett, “The DARP A TIMIT acoustic-phonetic con- tinuous speech corpus CD-R OM, ” T ech. Rep. NISTIR 4930, National Institute of Standards and T echnology , Gaithersb urg, MD, 1993. [15] A. Deleforge, S. Gannot, and W . Kellermann, “T owards a generalization of relativ e transfer functions to more than one source, ” in EUSIPCO . IEEE, 2015, pp. 419–423. [16] R. Gallager , “Circularly-symmetric Gaussian random vec- tors, ” preprint, 2008.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment