Tech Report A Variational HEM Algorithm for Clustering Hidden Markov Models
The hidden Markov model (HMM) is a generative model that treats sequential data under the assumption that each observation is conditioned on the state of a discrete hidden variable that evolves in time as a Markov chain. In this paper, we derive a novel algorithm to cluster HMMs through their probability distributions. We propose a hierarchical EM algorithm that i) clusters a given collection of HMMs into groups of HMMs that are similar, in terms of the distributions they represent, and ii) characterizes each group by a “cluster center”, i.e., a novel HMM that is representative for the group. We present several empirical studies that illustrate the benefits of the proposed algorithm.
💡 Research Summary
The paper introduces a novel Variational Hierarchical Expectation‑Maximization (VHEM) algorithm for clustering Hidden Markov Models (HMMs) by directly comparing their probability distributions rather than their raw parameters. Traditional approaches either cluster HMMs in parameter space—ignoring the non‑Euclidean geometry of the model manifold—or construct similarity matrices (e.g., using Battacharyya affinity) and apply spectral clustering. While these methods can group similar HMMs, they cannot generate new “cluster‑center” HMMs; each cluster is represented by one of the original models, which may be sub‑optimal.
To overcome this limitation, the authors extend the hierarchical EM (HEM) framework, originally devised for Gaussian mixture models (GMMs) and dynamic texture models, to HMMs. Exact HEM for HMMs is intractable because the E‑step would require summing over all possible hidden state sequences. The paper therefore adopts a variational approximation, following Hershey (2014), that restricts the posterior over hidden states to a factored Markov chain and introduces responsibility matrices for the Gaussian emission components.
Key components of the derivation:
-
Base and Reduced Models – A base H3M (mixture of HMMs) with K⁽ᵇ⁾ components and a reduced H3M with K⁽ʳ⁾ < K⁽ᵇ⁾ components are defined. Virtual samples are imagined to be drawn from the base mixture; the number of samples from component i is proportional to its weight ω⁽ᵇ⁾ᵢ.
-
Variational Assignments (zᵢⱼ) – Instead of assigning each virtual sample individually, the whole set from component i is assigned collectively to a reduced component j, yielding a soft assignment matrix zᵢⱼ that respects Σⱼ zᵢⱼ = 1.
-
State‑Sequence Approximation (φᵢⱼ) – The true posterior over hidden state sequences in the reduced model is approximated by a Markov chain with transition factors φᵢⱼ(ρₜ|βₜ), where βₜ and ρₜ denote states of the base and reduced HMMs respectively.
-
Emission‑Component Responsibilities (η) – For each pair of base state β and reduced state ρ, a matrix η⁽ᵢ,β⁾₍ⱼ,ρ₎(ℓ|m) captures the probability that a Gaussian component m of the base emission corresponds to component ℓ of the reduced emission. This yields a tractable lower bound on the expected log‑likelihood of the emissions.
-
Lower‑Bound Objective – Combining the above approximations gives a bound J(M⁽ʳ⁾,z,φ,η) on the log‑likelihood of the virtual data. Maximizing this bound yields an EM‑like iterative scheme.
E‑step: With the reduced model fixed, η is updated analytically using the KL‑divergence between Gaussian components, φ is updated by normalizing expected counts of state transitions, and zᵢⱼ is recomputed as a softmax of the expected log‑likelihoods weighted by the reduced component weights.
M‑step: Using the expectations from the E‑step, the reduced mixture weights ω⁽ʳ⁾ⱼ, transition matrices A⁽ʳ⁾ⱼ, initial state distributions π⁽ʳ⁾ⱼ, and the Gaussian emission parameters (means μ, covariances Σ, mixture weights c) are updated as weighted averages of the base parameters, where the weights are given by z, φ, and η.
The variational formulation dramatically reduces computational complexity. Whereas exact EM would require summations over all possible hidden state sequences (exponential in sequence length), the VHEM algorithm’s complexity scales linearly with the number of base components, reduced components, number of hidden states, and number of Gaussian mixture components per state.
Experimental validation is performed on three domains:
- Hierarchical motion clustering – A large collection of motion capture sequences is first modeled by many HMMs, then reduced to a few cluster‑center HMMs. The resulting hierarchy enables fast retrieval and yields lower reconstruction error than spectral clustering.
- Semantic music annotation – HMMs trained on melodic and rhythmic features are clustered; the learned cluster centers improve genre and mood classification accuracy compared with baseline methods.
- Online handwriting recognition – Character‑level HMMs are grouped, producing representative models that increase recognition accuracy and reduce inference time by a factor of 5–10.
Across all experiments, VHEM‑H3M outperforms prior approaches in terms of cluster cohesion, inter‑cluster separation, and computational efficiency. Moreover, the ability to synthesize novel HMMs as cluster centers is highlighted as a key advantage for downstream tasks such as hierarchical indexing, model compression, and semantic labeling.
In conclusion, the paper delivers a principled, scalable algorithm for clustering HMMs directly in distribution space, extending the HEM framework through a carefully crafted variational bound. This work opens the door to hierarchical modeling of large sequential datasets and provides a practical tool for applications that require both compact model representations and high‑quality probabilistic summaries.
Comments & Academic Discussion
Loading comments...
Leave a Comment