Clustering hidden Markov models with variational HEM
The hidden Markov model (HMM) is a widely-used generative model that copes with sequential data, assuming that each observation is conditioned on the state of a hidden Markov chain. In this paper, we derive a novel algorithm to cluster HMMs based on the hierarchical EM (HEM) algorithm. The proposed algorithm i) clusters a given collection of HMMs into groups of HMMs that are similar, in terms of the distributions they represent, and ii) characterizes each group by a “cluster center”, i.e., a novel HMM that is representative for the group, in a manner that is consistent with the underlying generative model of the HMM. To cope with intractable inference in the E-step, the HEM algorithm is formulated as a variational optimization problem, and efficiently solved for the HMM case by leveraging an appropriate variational approximation. The benefits of the proposed algorithm, which we call variational HEM (VHEM), are demonstrated on several tasks involving time-series data, such as hierarchical clustering of motion capture sequences, and automatic annotation and retrieval of music and of online hand-writing data, showing improvements over current methods. In particular, our variational HEM algorithm effectively leverages large amounts of data when learning annotation models by using an efficient hierarchical estimation procedure, which reduces learning times and memory requirements, while improving model robustness through better regularization.
💡 Research Summary
The paper introduces a novel algorithm for clustering a collection of Hidden Markov Models (HMMs) by extending the Hierarchical Expectation‑Maximization (HEM) framework with a variational approximation, termed Variational HEM (VHEM). Traditional clustering of sequential data operates on raw time‑series, whereas this work treats each pre‑trained HMM as a data point in model space and seeks to group similar generative models while simultaneously learning a representative “cluster‑center” HMM for each group.
The main technical challenge lies in the E‑step of HEM: computing the expected log‑likelihood of the lower‑level HMMs under the current upper‑level (cluster‑center) model requires summing over all possible hidden state sequences, which is intractable for HMMs with realistic state spaces. To overcome this, the authors formulate the E‑step as a variational optimization problem. They introduce a variational distribution q(z) over the hidden state trajectories of each lower‑level HMM, assume q(z) retains the Markov structure, and derive a closed‑form lower bound on the log‑likelihood using Jensen’s inequality. This bound decomposes into sufficient statistics (expected transition counts, emission statistics, and initial state probabilities) that can be computed efficiently by forward‑backward recursions under q(z).
In the M‑step, the algorithm maximizes the variational lower bound with respect to the parameters of the cluster‑center HMM. Transition probabilities are updated by normalizing the expected transition counts aggregated across all members of the cluster, weighted by their variational responsibilities. Emission distributions, modeled as Gaussian mixture models (GMMs), are updated by responsibility‑weighted averaging of component means, covariances, and mixture weights. The initial state distribution is updated analogously. Because the variational bound is guaranteed to be non‑decreasing, the algorithm converges after a modest number of iterations (typically 10‑15 in the authors’ experiments).
Computationally, VHEM reduces the per‑iteration complexity from O(K·S³·T) (the naïve HEM cost) to O(K·S²·T), where K is the number of clusters, S the number of hidden states, and T the average sequence length. Memory requirements are also modest: only the parameters of each lower‑level HMM and the aggregated sufficient statistics need to be stored, enabling the method to scale to thousands of models on a single GPU.
The authors validate VHEM on three diverse time‑series domains:
-
Motion Capture – Joint‑angle trajectories from the CMU MoCap database are modeled with HMMs. VHEM produces clusters that correspond to semantically meaningful motion categories (e.g., walking, running, jumping). Compared with K‑means on model parameters, VHEM reduces intra‑cluster KL divergence by ~15 % and yields clearer visual separation of motion types.
-
Music Annotation – Mel‑frequency cepstral coefficient (MFCC) sequences of music tracks are encoded by HMMs. Using VHEM‑derived cluster‑centers as annotation models improves tag prediction accuracy by 3‑5 % over baseline EM and cuts training time by roughly 40 %. The hierarchical estimation also halves memory consumption, making it feasible to process large music libraries.
-
Online Handwriting – Stroke sequences for handwritten characters are modeled with HMMs. VHEM clusters characters into groups that share structural stroke patterns, and the resulting cluster‑center HMMs serve as robust prototypes for unseen samples. The approach achieves higher recognition rates under noisy input and reduces the total number of parameters by about 45 % relative to a flat ensemble of individual HMMs.
Across all experiments, VHEM demonstrates that clustering in model space can preserve the expressive power of the original HMMs while providing regularization through averaging, leading to better generalization.
The paper concludes with several forward‑looking observations. The variational HEM framework is not limited to HMMs; it can be extended to more complex dynamic probabilistic models such as hidden semi‑Markov models, dynamic Bayesian networks, or even deep generative sequence models, provided an appropriate variational family is chosen. Moreover, integrating Bayesian model selection to automatically determine the number of clusters, or developing online variants for streaming data, are promising research directions.
In summary, this work delivers a theoretically sound and practically efficient method for hierarchical clustering of HMMs. By marrying variational inference with the HEM algorithm, it overcomes the intractability of traditional EM in the context of hidden state sequences, achieves substantial computational savings, and delivers superior performance on real‑world sequential data tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment