Learning Hidden Markov Models using Non-Negative Matrix Factorization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The Baum-Welsh algorithm together with its derivatives and variations has been the main technique for learning Hidden Markov Models (HMM) from observational data. We present an HMM learning algorithm based on the non-negative matrix factorization (NMF) of higher order Markovian statistics that is structurally different from the Baum-Welsh and its associated approaches. The described algorithm supports estimation of the number of recurrent states of an HMM and iterates the non-negative matrix factorization (NMF) algorithm to improve the learned HMM parameters. Numerical examples are provided as well.

💡 Research Summary

**
The paper introduces a novel learning algorithm for Hidden Markov Models (HMMs) that departs fundamentally from the traditional Baum‑Welch (EM) approach. Instead of operating directly on raw observation sequences, the method first constructs high‑order statistics in the form of a prefix‑suffix count matrix Rₚ,ₛ, where p denotes the length of a prefix and s the length of a suffix. Normalizing each row of Rₚ,ₛ yields a stochastic matrix Fₚ,ₛ whose entry Fₚ,ₛ(u,v) estimates the conditional probability of observing suffix V after having observed prefix U.

The key insight is that Fₚ,ₛ can be factorized as a product of two non‑negative, row‑stochastic matrices C (size Mᵖ × N) and D (size N × Mˢ):

Fₚ,ₛ ≈ C · D

Here N is the (unknown) number of hidden states. The rows of D correspond to the probability distributions over s‑length suffixes generated by each hidden state (P(· | Sᵢ, s, λ)), while the rows of C give the posterior probabilities of being in each state given a particular prefix (P(Sᵢ | U, p, λ)). This decomposition is closely related to the concept of positive rank (prank) of a non‑negative matrix; theoretically, prank(Fₚ,ₛ) equals the minimal number of states required to reproduce the observed statistics. Because computing prank is NP‑hard, the authors propose a practical surrogate: they compute the singular value decomposition (SVD) of Fₚ,ₛ (or of a related matrix) and look for a pronounced gap between the N‑th and (N+1)-th singular values. This provides a lower bound on the true model order.

Given an estimate of N, the algorithm proceeds with Non‑Negative Matrix Factorization (NMF) of Fₚ,ₛ using the I‑divergence (a generalization of Kullback‑Leibler divergence) as the loss function. Iterative multiplicative updates produce locally optimal factors C and D. From D, the state‑specific suffix distributions are directly available. To recover the transition matrices A^{(k)} (k = 1,…,M), the authors marginalize over the first symbol of each suffix, forming a matrix H from D by aggregating columns that correspond to (s‑1)-length suffixes. The relationship

D_{·,1:s‑1} = A^{(1)} · H

holds for each state, allowing the entries of A^{(1)} to be obtained by solving a set of linear equations (typically via L₁‑norm minimization or linear programming to cope with finite‑sample noise). The remaining emission matrices A^{(k)} are derived analogously.

The algorithm is iterative: after obtaining an initial HMM λ = {A^{(k)}}, the model is used to recompute the expected C₀ and D₀ (via the same formulas that generated them from λ), and NMF is run again starting from these refined factors. Repeating steps 3–5 progressively reduces the I‑divergence between the empirical Fₚ,ₛ and its factorization, thereby improving the parameter estimates.

The authors demonstrate the method on deterministic HMMs (each state emits at most one transition with a given observable) and compare the results with those from Baum‑Welch. Experiments show that (i) the log‑likelihood of the model learned via NMF is comparable to that of Baum‑Welch, (ii) the iterative refinement consistently lowers the divergence, and (iii) the approach requires only the summary statistics Fₚ,ₛ, not the full observation sequence, leading to lower memory consumption.

Key advantages highlighted include:

Direct estimation of the number of recurrent states through rank analysis of Fₚ,ₛ.
Exploitation of sparse, high‑order statistics, which can be stored efficiently even for long sequences.
Transparent interpretation of the factors C and D, giving immediate insight into state‑specific observation distributions.

The paper also acknowledges limitations: NMF converges only to local optima and is sensitive to initialization; the size of Fₚ,ₛ grows exponentially with p and s, necessitating careful choice of these parameters and efficient sparse matrix handling; and the gap between positive rank and actual state count may cause over‑ or under‑estimation of N in noisy settings. The authors suggest future work on robust initialization schemes, scalable sparse‑matrix algorithms, and Bayesian or cross‑validation techniques for model order selection.

Overall, the work presents a compelling alternative to EM‑based HMM learning, leveraging non‑negative matrix factorization to uncover hidden state structure from high‑order observation statistics, offering both theoretical insight and practical benefits.

Learning Hidden Markov Models using Non-Negative Matrix Factorization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment