Speakers Localization Using Batch EM In Unfolding Neural Network

UNFOLDED EXPECT A TION-MAXIMIZA TION NEURAL NETWORK FOR SPEAKER LOCALIZA TION Rina V eler and Shar on Gannot Faculty of Engineering, Bar -Ilan Univ ersity , Ramat-Gan, Israel {velerri,sharon.gannot}@biu.ac.il ABSTRA CT W e propose an interpretable Unfolded Expectation-Maximization (EM) Network for robust speaker localization. By embedding the iterativ e EM procedure within an encoder–EM–decoder architec- ture, the method mitigates initialization sensiti vity and improves con vergence. Experiments sho w superior accuracy and robustness ov er the classical Batch-EM in rev erberant conditions. Index T erms — Sound source localization; Unfolding neural network; Expectation Maximization; Pair -wise relative phase ratio. 1. INTRODUCTION Sound Source Localization (SSL) is a fundamental task in modern audio applications, from autonomous robots to consumer electron- ics. Estimating the Direction of Arrival (DOA) in realistic en vi- ronments is challenging due to rev erberation and noise, and tradi- tional methods, e.g., Steered Response Po wer with Phase T rans- form (SRP-PHA T), Multiple Signals Classiﬁcation (MUSIC), and Maximum Likelihood Estimation (MLE), often struggle, especially with multiple sources. This has motivated a shift toward Deep Neu- ral Network (DNN)-based approaches, now widely used for robust SSL in adverse acoustic conditions [ 1 , 2 ]. Howe ver , purely data- driv en DNNs lack interpretability , whereas classical iterative algo- rithms remain more transparent. Algorithm unrolling (or unfolding) bridges this gap [ 3 , 4 ] by casting algorithmic iterations as differ - entiable network layers, yielding parameter-ef ﬁcient, interpretable models that require less training data. W e adopt unfolded networks to estimate static multi-speaker positions using an MLE formula- tion based on a Complex GMM (CGMM). The observations are Pair -Wise Relativ e Phase Ratios (PRPs), with each Gaussian mean representing a location-dependent PRP vector . W e inte grate the clas- sical EM procedure—previously used for PRP clustering [ 5 ]—into an unfolded architecture, yielding a robust and interpretable estima- tion framework. This follows prior success with EM-based unrolling in image clustering and segmentation [ 6 , 7 ]. 2. PROBLEM FORMULA TION Consider an array of M microphone pairs acquiring S speakers with ov erlapping activities in a rev erberant enclosure. The analysis is carried out in the Short-Time Fourier transform (STFT) domain. W e operate under the W -Disjoint Orthogonality (WDO) assumption, where each T ime-Frequency (TF) bin is solely dominated by a sin- gle source [ 8 ]. The number of speak ers S ≥ 1 , is assumed a priori known. Therefore, the signal measured at the j th microphone of the pair m , where j = 1 , 2 and m = 1 , . . . , M , is modeled by: z m,j ( t, k ) = a j sm ( t, k ) · v s ( t, k ) + n j m ( t, k ) (1) where t = 0 , . . . , T − 1 is the time-frame index and k = 0 , . . . , K − 1 denotes the frequency bin. The Acoustic Transfer Functions (A TFs) describing the propagation from the position of speaker s to microphone j of pair m are denoted by a j sm ( t, k ) , and n j m ( t, k ) denotes additive noise. In low-re verberation environments, the A TF can be approximated by the direct-path component, where the speaker and microphone positions, p s and p j m , respectively , deter- mine both the amplitude attenuation and the phase shift. Rather than using the raw measurements z m,j ( t, k ) directly , we emplo y the PRP as the localization feature. The PRP v ector is formed by concatenat- ing the normalized complex-v alued ratios from all M microphone pairs: ϕ ( t, k ) = [ ϕ 1 ( t, k ) , . . . , ϕ M ( t, k )] ⊤ , (2) where the individual PRP for pair m is deﬁned as ϕ m ( t, k ) ≜ z m, 2 ( t, k ) z m, 1 ( t, k ) · | z m, 1 ( t, k ) | | z m, 2 ( t, k ) | . (3) The resulting v ector ϕ ( t, k ) constitutes the observed data in the sub- sequent EM-based formulation. W e adopt the CGMM probability model from [ 5 ], with several modiﬁcations. In our formulation, the number of Gaussian components is set to the number of speakers S , whereas in [ 5 ] it corresponds to the number of candidate source po- sitions in the environment, typically a much lar ger set. In that formu- lation, speaker positions are inferred as the means of the Gaussians with the highest mixture weights. Under the WDO assumption in the STFT domain, each TF bin of ϕ ( t, k ) is associated with a single activ e source, leading to the following probabilistic model: ϕ ( t, k ) ∼ X s ψ s · N c  ϕ ( t, k ); ˜ ϕ k ( p s ) , Σ s  , (4) where ψ s denotes the prior probability of speaker s being activ e and Σ s is the covariance matrix. The mean of each Gaussian, ˜ ϕ k ( p s ) , represents the expected PRP generated by a speaker at position p s . This expected mean is deriv ed analytically from the T ime Dif ference of Arriv al (TDOA). For pair m , the predicted phase ratio is: ˜ ϕ k m ( p s ) ≜ exp  − j · 2 π k K | p s − p 2 m | − | p s − p 1 m | c · T s  . (5) Follo wing [ 5 ], we assume independence of the PRP features across microphone pairs. W e further simplify each cov ariance matrix to re- ﬂect spatially-white noise, i.e., Σ s = σ 2 s I M . Finally , the Probability Density Function (p.d.f.) of the entire observation set can be written as f ( ϕ ) = Y t,k " X s ψ s Y m N c  ϕ m ( t, k ); ˜ ϕ k m ( p s ) , σ 2 s  # , (6) where we assume independence of the PRP measurements across all time frames and frequency bins. Let θ = h ψ ⊤ , ( σ 2 ) ⊤ , ˜ ϕ ⊤ i denote the set of unknown parame- ters. The goal is to solve the MLE problem, typically addressed via the EM algorithm: { ˆ ψ , ˆ σ 2 , ˆ ˜ ϕ } = arg max ˜ ϕ , ψ , σ 2 log f  ϕ ; ˜ ϕ , ψ , σ 2  . (7) 3. PROPOSED METHOD In this section, we described the proposed method from the conv en- tional EM iterations to the unfolded network. The EM Algorithm: The EM algorithm requires specifying the ob- served data, the hidden variables, and the parameters to be estimated. W e deﬁne the hidden variable x ( t, k , s ) as an indicator assigning each TF bin to a single source at a given position. Giv en the hidden data, the likelihood of the observ ations is f ( x , ϕ ; θ ) = Y t,k X s ψ s x ( t, k , s ) Y m N c  ϕ m ( t, k ); ˜ ϕ k m ( p s ) , σ 2 s  . (8) For implementing the E-step it is sufﬁcient to ev aluate µ ( ℓ − 1) ( t, k , s ) ≜ E { x ( t, k , s ) | ϕ ( t, k ); θ ( ℓ − 1) } gi ven by: ψ ( ℓ − 1) s Q m N c  ϕ m ( t, k ); ˜ ϕ k m ( p s ) ( ℓ − 1) , ( σ 2 s ) ( ℓ )  P s ψ ( ℓ − 1) s Q m N c  ϕ m ( t, k ); ˜ ϕ k m ( p s ) ( ℓ − 1) , ( σ 2 s ) ( ℓ − 1)  (9) Maximizing ( 9 ) w .r .t. the parameters θ ( ℓ ) constitutes the M-step: ψ ( ℓ ) s = P t,k µ ( ℓ − 1) ( t, k , s ) T · K (10a) σ 2 s ( ℓ ) = P t,k,m µ ( ℓ − 1) ( t, k , s ) | ϕ m ( t, k ) − ˜ ϕ k m ( p s ) ( ℓ − 1) | 2 M · P t,k µ ( ℓ − 1) ( t, k , s ) (10b) ˜ ϕ k m ( p s ) ( ℓ ) = P t µ ( ℓ − 1) ( t, k , s ) · ϕ m ( t, k ) P t µ ( ℓ − 1) ( t, k , s ) . (10c) The ﬁnal position estimate is obtained by searching for the position p in the room (whose dimensions are assumed known) that mini- mizes: ˆ p s = arg min p X k,m    ˜ ϕ k m ( p s ) ( ℓ ) − ˜ ϕ k m ( p )    2 , ∀ s, (11) where ˜ ϕ k m ( p ) is computed for each candidate location p using ( 5 ). T o enhance EM robustness in multiple-speak er scenarios, we set the total number of CGMM clusters to S + 1 , i.e., the number of speak ers plus one. The additional cluster serves as an outlier cluster , absorb- ing TF bins that do not correspond to any activ e speaker , thereby prev enting corruption of the true speaker position estimates. After con vergence, this outlier cluster is remov ed by identifying the clus- ter with the highest variance. Fig. 1 : Batch EM unfolding neural network architecture. Unfolding Batch EM Neural Network: W e propose an unfolded Neural Network (NN) architecture to mitigate the EM algorithm’ s sensitivity to initialization and local optima. The key idea is to em- bed the iterativ e EM procedure between an encoder and a decoder . In the Batch-EM Unfolded Network, a Fully Connected (FC) en- coder maps a candidate room position to an initial PRP vector that initializes the CGMM means. The unfolded EM layers iteratively reﬁne this estimate, and the ﬁnal PRP is passed to a FC decoder that outputs the speaker positions ( Figure 1 ). The model is trained using a combined loss consisting of a position error term and a PRP cosine-distance term, enforcing consistency between the reﬁned PRP and the theoretical PRP at the true positions. These two losses are weighted as L = (1 − λ ) MSE ( ˆ p , p true ) + λ  1 − CosSim ( ˆ ϕ , ϕ true )  , (12) where ˆ p and p true denote the estimated and true speaker positions, and ˆ ϕ and ϕ true the corresponding PRP vectors. 4. EXPERIMENT AL STUDY Data Generation: W e used a synthetic dataset derived from the W all Street Journal (WSJ) corpus [ 9 ], simulating two static speakers in random rectangular rooms (height: 2.2–2.6 m; width/length: 5–7 m). The recording setup included eight microphone pairs (intra-pair distance: 0.2 m). The network’ s robustness was e valuated under var - ious interference conditions: anechoic and reverberant environments with T 60 = 0 . 2 s, temporal overlaps of 25% , 50% , and 75% , Signal- to-Interference Ratio (SIR) levels of 0 and 5 dB, and additive white noise at SNR = 30 dB. The dataset consisted of 8 , 000 training and 2 , 000 v alidation samples. Network Architectur e and T raining: W e employ a Batch-EM Un- folded Network in which the EM procedure is unrolled into 70 lay- ers. Initialization is performed by mapping random room locations to an initial PRP vector using a FC encoder , with ψ (0) s = 1 S and σ 2 s (0) = 1 for all s . The unfolded EM layers iterativ ely reﬁne this estimate, and the ﬁnal PRP is mapped to the speakers’ positions by an FC decoder . All FC layers use Rectiﬁed Linear Unit (ReLU) ac- tiv ation. Because the PRP is complex-v alued, the FC layers operate on concatenated real and imaginary parts and then reshape the out- put back to a complex vector . The model is trained using the loss in ( 12 ) with λ = 0 . 25 . Results and Discussion: W e e valuated the proposed approach against the Batch-EM algorithm with exhausti ve spatial grid search, using 100 unseen speech mixtures. Performance was assessed in terms of localization accuracy and rob ustness across different acous- tic and interference conditions. T able 1 summarizes the results, av- eraged ov er all SIR and ov erlap settings for each environment. Met- rics include the Root Mean Square Error (RMSE) between estimated and actual speaker positions and the percentage of samples with po- sitional error above 0 . 5 m. As shown in T able 1 , the Batch-EM T able 1 : A verage Localization Performance: Unfolded Network vs. Batch EM Baseline, A veraged ov er all SIR and Overlap Conditions. Rever . Level Unfolded EM Network Batch EM (Baseline) RMSE (m) Error > 0 . 5 m ( % ) RMSE (m) Error > 0 . 5 m ( % ) T 60 = 0 s 0.31 15.5 0.25 12.5 T 60 = 0 . 2 s 0.37 22 0.66 56 method performs slightly better in clean conditions, since the de- terministic Scan-to-Locate procedure yields an almost exact solu- tion when reﬂections are limited. The unfolded network introduces small approximation errors due to the FC mappings. In rev erberant settings, howe ver , the proposed Unfolded EM Network offers sub- stantially better generalization and robustness. It reduces the RMSE by about 39% relativ e to the Batch-EM baseline and signiﬁcantly lowers the rate of large localization errors. This improv ement re- sults from the network’ s learned mapping, which compensates for rev erberation-induced distortion in the PRP features. 5. CONCLUSION W e de veloped a novel Unfolded EM Network architecture for speaker localization by embedding the iterativ e EM procedure within a learnable network. Our results conﬁrmed that the network achieved signiﬁcantly superior robustness in rev erberant en vironments, sho w- ing an approximate 40% reduction in RMSE compared to the Batch- EM baseline. 6. REFERENCES [1] Pierre-Amaury Grumiaux, Sr ¯ dan Kiti ´ c, Laurent Girin, and Alexandre Guérin, “ A survey of sound source localization with deep learning methods, ” J. Acoust. Soc. Am. , vol. 152, no. 1, pp. 107–151, July 2022. [2] Soumitro Chakrabarty and Emanuel A. Habets, “Multi-speaker doa estimation using deep con volutional networks trained with noise signals, ” IEEE Journal of Selected T opics in Signal Pr o- cessing , vol. 13, no. 1, pp. 8–21, 2019. [3] Karol Gregor and Y ann LeCun, “Learning fast approximations of sparse coding, ” in Proceedings of the International Confer- ence on Machine Learning (ICML) . 2010, pp. 399–406, A CM. [4] V . Monga, Y . Li, and Y . C. Eldar , “ Algorithm unrolling: Inter- pretable, efﬁcient deep learning for signal and image process- ing, ” IEEE Signal Pr ocessing Magazine , vol. 38, no. 2, pp. 18– 44, 2021. [5] Ofer Schwartz and Sharon Gannot, “Speaker tracking using recursiv e em algorithms, ” IEEE/ACM T ransactions on Audio, Speech, and Language Pr ocessing , vol. 22, no. 2, pp. 392–402, Feb . 2014. [6] Klaus Gref f, Sjoerd v an Steenkiste, and Juer gen Schmidhu- ber , “Neural expectation maximization, ” in Pr oceedings of the 31st Confer ence on Neural Information Pr ocessing Systems (NeurIPS) , 2017, pp. 6694–6704. [7] Y annan Pu, Jian Sun, Niansheng T ang, and Zongben Xu, “Deep expectation-maximization netw ork for unsupervised image seg- mentation and clustering, ” Image and V ision Computing , vol. 135, pp. 104717, 2023. [8] Scott Rickard and Ozgiir Y ilmaz, “On the approximate W- disjoint orthogonality of speech, ” in IEEE International Con- fer ence on Acoustics, Speec h, and Signal Processing (ICASSP) , 2002, vol. 1, pp. 529–532. [9] Douglas B. Paul and Janet M. Baker , “The design for the wall street journal-based csr corpus, ” in Pr oceedings of the workshop on Speech and Natural Language . A CL, 1992, pp. 357–362, As- sociation for Computational Linguistics.

Speakers Localization Using Batch EM In Unfolding Neural Network

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment