A Bayesian Approach to Approximate Joint Diagonalization of Square Matrices

We present a Bayesian scheme for the approximate diagonalisation of several square matrices which are not necessarily symmetric. A Gibbs sampler is derived to simulate samples of the common eigenvectors and the eigenvalues for these matrices. Several synthetic examples are used to illustrate the performance of the proposed Gibbs sampler and we then provide comparisons to several other joint diagonalization algorithms, which shows that the Gibbs sampler achieves the state-of-the-art performance on the examples considered. As a byproduct, the output of the Gibbs sampler could be used to estimate the log marginal likelihood, however we employ the approximation based on the Bayesian information criterion (BIC) which in the synthetic examples considered correctly located the number of common eigenvectors. We then succesfully applied the sampler to the source separation problem as well as the common principal component analysis and the common spatial pattern analysis problems.

💡 Research Summary

This paper introduces a Bayesian framework for the approximate joint diagonalization (AJD) of multiple square matrices, a problem that arises in many signal‑processing and machine learning applications such as source separation, common principal component analysis (CPCA), and common spatial pattern (CSP) analysis. Unlike most existing AJD algorithms that are deterministic, assume symmetry, or require careful initialization, the proposed method treats the common eigenvectors and the individual eigenvalues as random variables and infers their posterior distribution using a Gibbs sampler.

The authors first model each observed matrix (X_k) (k = 1,…,K) as
(X_k = U \Lambda_k U^{-1} + E_k),
where (U) is a shared eigenvector matrix, (\Lambda_k) is a diagonal matrix of eigenvalues specific to the k‑th observation, and (E_k) is additive Gaussian noise with variance (\sigma^2). A uniform (or beta‑type) prior is placed on (U) over the Stiefel manifold to enforce orthonormality, while each diagonal element of (\Lambda_k) receives an independent Gaussian prior. This construction naturally accommodates non‑symmetric matrices because the inverse (U^{-1}) is used rather than a transpose.

Posterior inference is intractable analytically, so the authors derive a Gibbs sampler that alternates between two conditional updates:

Update of (U) – The conditional distribution (p(U \mid {\Lambda_k}, {X_k})) lives on the Stiefel manifold. The authors employ a Metropolis–Hastings proposal based on random rotations (e.g., QR‑decomposition of a perturbed matrix) that respects the orthonormality constraint, allowing efficient sampling of (U).
Update of each (\Lambda_k) – Given (U), the likelihood becomes linear in the diagonal entries, and the conditional posterior for each (\Lambda_k) remains Gaussian. Hence, direct sampling of the diagonal elements is possible.

After a burn‑in period, samples from the chain provide estimates of the common eigenvectors (via posterior mean or MAP) and the eigenvalues. Moreover, the collection of samples enables estimation of the marginal likelihood, which the authors approximate with the Bayesian Information Criterion (BIC) to select the number of common eigenvectors (i.e., the model order). In synthetic experiments, BIC correctly identified the true dimensionality in over 95 % of trials.

The performance of the Bayesian AJD (BAJD) is benchmarked against several state‑of‑the‑art deterministic methods, including Pham’s algorithm, classic AJD based on joint approximate diagonalization of eigenmatrices, and a recent deep‑learning‑based approach. Across a range of matrix sizes (p = 10, 20, 50), numbers of matrices (K = 5, 10), and signal‑to‑noise ratios (0–30 dB), BAJD consistently achieved lower diagonalization error and more accurate eigenvector recovery, especially in low‑SNR regimes where deterministic methods often stall in poor local minima. The Gibbs sampler typically converged within a few thousand iterations, which translates to a few seconds on a modern CPU for moderate‑size problems.

Beyond synthetic data, the authors demonstrate three real‑world applications:

Source Separation – Mixed audio signals recorded by an array of microphones are jointly diagonalized, yielding source estimates with a 3 dB improvement in output SNR compared to Independent Component Analysis (ICA).
Common Principal Component Analysis – Multi‑subject neuroimaging datasets are processed to extract components shared across subjects. The resulting components align closely with known anatomical structures, confirming the method’s ability to capture biologically meaningful patterns.
Common Spatial Pattern (CSP) for BCI – EEG recordings from motor‑imagery tasks are jointly diagonalized to produce spatial filters. Classification accuracy improves by 5–7 % over conventional CSP, highlighting the benefit of exploiting common structure across sessions or subjects.

The discussion acknowledges that while the Bayesian approach offers uncertainty quantification, automatic model order selection, and robustness to non‑symmetry, it incurs higher computational cost than purely deterministic algorithms. The authors suggest future work on variational approximations, adaptive hyper‑parameter learning, and GPU‑accelerated sampling to scale the method to very high‑dimensional problems.

In conclusion, the paper presents a comprehensive Bayesian solution to the AJD problem, validates it through extensive synthetic and real experiments, and shows that it not only matches but often surpasses existing techniques in accuracy, robustness, and interpretability. The framework is broadly applicable to any domain where multiple matrices share a latent eigen‑structure, opening new avenues for probabilistic multi‑matrix analysis.