Tracking Multiple Audio Sources with the von Mises Distribution and Variational EM

Tracking Multiple Audio Sources with the von Mises Distribution and   Variational EM
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper we address the problem of simultaneously tracking several moving audio sources, namely the problem of estimating source trajectories from a sequence of observed features. We propose to use the von Mises distribution to model audio-source directions of arrival with circular random variables. This leads to a Bayesian filtering formulation which is intractable because of the combinatorial explosion of associating observed variables with latent variables, over time. We propose a variational approximation of the filtering distribution. We infer a variational expectation-maximization algorithm that is both computationally tractable and time efficient. We propose an audio-source birth method that favors smooth source trajectories and which is used both to initialize the number of active sources and to detect new sources. We perform experiments with the recently released LOCATA dataset comprising two moving sources and a moving microphone array mounted onto a robot.


💡 Research Summary

The paper tackles the challenging problem of simultaneously tracking several moving audio sources by estimating their trajectories from a sequence of observed direction‑of‑arrival (DOA) measurements. Recognizing that DOA is inherently a circular variable, the authors model both the observations and the latent source directions with the von Mises distribution, which is the natural analogue of a Gaussian on the unit circle. The observation model assigns a von Mises likelihood to measurements that belong to a source and a uniform distribution to clutter. Source dynamics are also modeled as a first‑order Markov process with von Mises transition densities, characterized by a concentration parameter κ_d.

A naïve Bayesian filtering approach quickly becomes intractable because the number of possible data‑association hypotheses grows exponentially with time. To overcome this, the authors introduce a variational approximation that factorizes the joint posterior over latent source states sₜ and association variables zₜ as q(sₜ) q(zₜ). This factorization yields a variational Expectation‑Maximization (VEM) algorithm consisting of three steps repeated at each time frame:

  1. E‑S step – updates q(sₜₙ) for each source n. Because of the conjugacy of the von Mises distribution, q(sₜₙ) remains a von Mises distribution with updated mean μₜₙ and concentration κₜₙ. These updates combine the predicted prior (derived from the previous posterior) with a weighted sum of current observations, where the weights αₜₘₙ are the posterior probabilities that observation m originates from source n.

  2. E‑Z step – updates the association posterior q(zₜₘ) for each observation m. The posterior probability αₜₘₙ is proportional to the prior source weight πₙ multiplied by a term βₜₘₙ that captures the compatibility between the observation and the current source estimate, taking into account the observation confidence ωₜₘ and the von Mises concentration κ_y.

  3. M step – re‑estimates the model parameters. The source prior probabilities πₙ are updated as normalized sums of the αₜₘₙ, while the concentration parameters κ_y (observation noise) and κ_d (state dynamics) are optimized via gradient ascent on the expected complete‑data log‑likelihood.

Because each q(sₜₙ) stays within the von Mises family, the algorithm avoids the exponential growth of mixture components that plagues exact Bayesian filtering. The computational complexity per frame is linear in the number of sources and observations, making the method suitable for real‑time applications.

To handle a varying number of sources, the authors propose a “source birth” mechanism. Observations that are currently assigned to clutter (z = 0) are collected over the last L = 2 frames to form candidate DOA sequences. For each candidate, a marginal likelihood τ_j is computed analytically using von Mises integrals. If the maximum τ_j exceeds a predefined threshold τ₀, a new source is instantiated with its posterior initialized as a von Mises distribution derived from the candidate sequence. This probabilistic birth process enables the tracker to automatically detect and initialize new speakers while maintaining smooth trajectories for existing ones.

The experimental evaluation uses the LOCATA Challenge Task 6 dataset, which contains recordings of two moving speakers captured by a moving four‑microphone array mounted on a humanoid robot. DOA estimates are supplied by an existing online localization method, and each estimate is accompanied by a confidence weight. Performance is measured in terms of Miss Detection (MD) rate, False Alarm (FA) rate, and Mean Absolute Error (MAE) in degrees, after solving the permutation ambiguity between estimated and ground‑truth tracks.

Three baselines are compared: a von Mises Probability Hypothesis Density (PHD) filter (vM‑PHD) and two Gaussian‑Mixture trackers (GM‑ZO with zero‑order dynamics and GM‑FO with first‑order dynamics). The proposed vM‑VEM method achieves 23.9 % MD, 5.9 % FA, and a MAE of 2.6°, outperforming the PHD filter (33.4 % MD, 9.5 % FA, 4.5°) and showing comparable or better results than the Gaussian trackers (GM‑FO: 22.3 % MD, 6.3 % FA, 3.2°; GM‑ZO: 27.0 % MD, 10.8 % FA, 4.7°). Notably, despite using a zero‑order motion model, the von Mises‑based tracker matches the performance of the first‑order Gaussian tracker, highlighting the suitability of the von Mises distribution for circular data.

In summary, the paper makes four key contributions: (1) a principled von Mises Bayesian model for circular DOA data, (2) a tractable variational EM algorithm that jointly estimates source states and data associations, (3) a probabilistic source‑birth scheme for handling a time‑varying number of speakers, and (4) a thorough experimental validation on a realistic robot‑mounted microphone array scenario. The approach offers a solid foundation for future work in multi‑modal sensor fusion, real‑time speech processing, and any application where circular measurements and dynamic data association are central challenges.


Comments & Academic Discussion

Loading comments...

Leave a Comment