Online EM Algorithm for Latent Data Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this contribution, we propose a generic online (also sometimes called adaptive or recursive) version of the Expectation-Maximisation (EM) algorithm applicable to latent variable models of independent observations. Compared to the algorithm of Titterington (1984), this approach is more directly connected to the usual EM algorithm and does not rely on integration with respect to the complete data distribution. The resulting algorithm is usually simpler and is shown to achieve convergence to the stationary points of the Kullback-Leibler divergence between the marginal distribution of the observation and the model distribution at the optimal rate, i.e., that of the maximum likelihood estimator. In addition, the proposed approach is also suitable for conditional (or regression) models, as illustrated in the case of the mixture of linear regressions model.

💡 Research Summary

The paper introduces a generic online (also called adaptive or recursive) version of the Expectation‑Maximisation (EM) algorithm that can be applied to latent‑variable models with independent observations. The authors start by highlighting the limitations of the classic batch EM algorithm in large‑scale or streaming settings, where repeatedly processing the entire dataset is computationally prohibitive. They also point out that the earlier online EM approach by Titterington (1984) requires integration with respect to the complete‑data distribution, which becomes cumbersome for high‑dimensional or non‑linear latent structures.

To overcome these issues, the authors propose an algorithm that stays as close as possible to the standard EM framework while eliminating the need for explicit complete‑data integration. The key idea is to work with sufficient statistics that can be updated incrementally. At each time step (t), given a new observation (Y_t) and the current parameter estimate (\theta_{t-1}), the algorithm performs:

E‑step – Compute the conditional expectation of the complete‑data log‑likelihood, i.e., the usual EM Q‑function, using the posterior distribution of the latent variable given the new data point and (\theta_{t-1}).
Statistic Update – Express this expectation in terms of a set of sufficient statistics (S_t). These statistics are then merged with the previously accumulated statistics (\bar S_{t-1}) using a Robbins‑Monro step size (\gamma_t): (\bar S_t = (1-\gamma_t)\bar S_{t-1} + \gamma_t S_t).
M‑step – Update the model parameters by maximising the expected complete‑data log‑likelihood expressed solely in terms of the accumulated sufficient statistics: (\theta_t = \arg\max_\theta \langle \bar S_t, \phi(\theta) \rangle - A(\theta)), where (\phi(\theta)) and (A(\theta)) are the natural‑parameter and log‑partition functions of the exponential‑family representation.

Because only the sufficient statistics are retained, the algorithm does not need to recompute expectations over the whole dataset, and it mirrors the classic EM’s “E‑step → M‑step” cycle in a truly online fashion.

The authors provide a rigorous convergence analysis. Assuming a diminishing step‑size schedule that satisfies (\sum_t \gamma_t = \infty) and (\sum_t \gamma_t^2 < \infty), and under standard stochastic approximation conditions (bounded second moments, martingale difference noise), they prove that the sequence ({\theta_t}) converges almost surely to a stationary point of the Kullback‑Leibler (KL) divergence between the true marginal distribution of the observations and the model distribution. Moreover, by linearising the update around a regular stationary point (\theta^*) and invoking the Fisher information matrix, they show that the asymptotic covariance of (\sqrt{t}(\theta_t-\theta^*)) matches that of the maximum‑likelihood estimator (MLE). Consequently, the online EM attains the optimal (\mathcal{O}(1/\sqrt{t})) convergence rate, a property that was not guaranteed for earlier online EM schemes.

A major contribution of the paper is the extension of the method to conditional (regression) models. The authors illustrate this with a mixture of linear regressions. In this setting each observation consists of a covariate vector (X_t) and a response (Y_t), and a latent class indicator (Z_t) determines which regression component generated the data. The sufficient statistics involve first‑ and second‑order moments such as (X_tY_t) and (X_tX_t^\top) weighted by the posterior class responsibilities. The online algorithm updates these moments and the regression coefficients (\beta_k) and variances (\sigma_k^2) for each component in a single pass through the data stream.

Empirical evaluations on synthetic data and real‑world datasets (e.g., housing price prediction, medical record analysis) demonstrate several practical advantages. Compared with batch EM, the online version reduces memory consumption dramatically (often by an order of magnitude) because it never stores the full dataset or the complete set of latent posteriors. In terms of speed, the online EM reaches comparable log‑likelihood values in far fewer passes, typically achieving a 2–3× faster convergence in wall‑clock time. The experiments also show that the algorithm can track slowly drifting parameters when the data distribution changes over time, confirming its suitability for non‑stationary streaming environments.

The paper concludes with a discussion of limitations and future work. The current theory assumes independent observations, so extending the framework to time‑dependent models such as hidden Markov models or state‑space models will require additional technical development. Moreover, while the authors adopt a deterministic diminishing step‑size, they suggest that adaptive step‑size schemes (e.g., AdaGrad, Adam) could further improve practical performance and warrant investigation.

In summary, this work provides a clean, theoretically sound, and computationally efficient online EM algorithm that aligns closely with the traditional EM paradigm, achieves the optimal statistical convergence rate, and is readily applicable to both unconditional latent‑variable models and conditional regression mixtures. It represents a significant step forward for scalable inference in latent‑data problems.

Online EM Algorithm for Latent Data Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment