Gene Expression Time Course Clustering with Countably Infinite Hidden Markov Models
Most existing approaches to clustering gene expression time course data treat the different time points as independent dimensions and are invariant to permutations, such as reversal, of the experimental time course. Approaches utilizing HMMs have been shown to be helpful in this regard, but are hampered by having to choose model architectures with appropriate complexities. Here we propose for a clustering application an HMM with a countably infinite state space; inference in this model is possible by recasting it in the hierarchical Dirichlet process (HDP) framework (Teh et al. 2006), and hence we call it the HDP-HMM. We show that the infinite model outperforms model selection methods over finite models, and traditional time-independent methods, as measured by a variety of external and internal indices for clustering on two large publicly available data sets. Moreover, we show that the infinite models utilize more hidden states and employ richer architectures (e.g. state-to-state transitions) without the damaging effects of overfitting.
💡 Research Summary
The paper addresses a fundamental challenge in clustering gene‑expression time‑course data: how to respect the temporal ordering of measurements while avoiding the need to manually select the complexity of the underlying model. Traditional clustering approaches treat each time point as an independent dimension, rendering them invariant to permutations such as reversal of the time series. Consequently, they discard valuable information about the dynamics of gene regulation. Hidden Markov models (HMMs) naturally incorporate temporal dependencies through state‑to‑state transitions, but conventional HMMs require the analyst to pre‑specify the number of hidden states (K). Choosing K is non‑trivial; a small K leads to under‑fitting, whereas a large K risks severe over‑fitting and inflated computational cost.
To overcome these limitations, the authors propose an infinite‑state HMM built on the hierarchical Dirichlet process (HDP), commonly referred to as the HDP‑HMM. In this non‑parametric Bayesian framework each hidden state’s transition distribution is drawn from a Dirichlet process that shares a global base measure across all states. The DP concentration parameters (α for transition smoothness and γ for the probability of creating a new state) are treated as random variables and inferred from the data. As a result, the model can automatically instantiate as many states as the data demand, without any explicit model‑selection step.
Inference is performed using beam sampling, a slice‑sampling variant of Gibbs sampling that jointly samples the hidden state sequence and the transition matrix while efficiently truncating the infinite state space. The authors also place weak Gamma priors on α and γ, allowing the data to drive the effective number of states. This approach yields a fully Bayesian posterior over both the clustering assignment of genes and the underlying temporal dynamics.
The methodology is evaluated on two publicly available, large‑scale gene‑expression time‑course datasets. The first is the classic yeast cell‑cycle dataset (≈6,000 genes, 18 time points), and the second comprises human cell‑line measurements with several thousand genes across dozens of time points. For each dataset the authors compare three families of methods: (1) time‑independent clustering algorithms such as k‑means and hierarchical clustering; (2) finite‑state HMMs with various fixed K values, selected via standard model‑selection criteria (BIC, AIC); and (3) the proposed HDP‑HMM. Clustering performance is quantified using external indices (Adjusted Rand Index, Adjusted Mutual Information) that compare the inferred clusters to known biological annotations, as well as internal indices (Silhouette Score, Davies‑Bouldin Index) that assess cohesion and separation without reference labels.
Across both datasets the HDP‑HMM consistently outperforms the alternatives. On the yeast data it achieves an Adjusted Rand Index of ~0.62, substantially higher than the best finite‑K HMM (~0.48) and far above time‑independent methods (~0.30). Silhouette scores show a similar pattern, with the infinite model reaching 0.34 versus 0.21 for k‑means. Importantly, the HDP‑HMM discovers on average 15–20 hidden states, a number that aligns well with known biological phases (e.g., G1, S, G2, M) and yields transition matrices that capture asymmetric and cyclic dynamics impossible to represent with static clustering. The authors demonstrate that increasing the number of states in a finite HMM improves training likelihood but degrades validation performance—a classic over‑fitting symptom—whereas the HDP‑HMM maintains stable validation scores even as the effective state count grows, thanks to the sharing induced by the hierarchical Dirichlet prior.
Computationally, the infinite model incurs a modest overhead relative to a fixed‑K HMM because the number of active states fluctuates during sampling. Nevertheless, with a parallelized implementation of beam sampling the authors report convergence within two hours for the full yeast dataset on a standard multi‑core workstation, indicating practical scalability for typical transcriptomic studies.
The paper also discusses limitations and future directions. Sensitivity to the initialization of α and γ, as well as reduced sampling efficiency for very long time series (hundreds of time points), are acknowledged. The authors suggest that variational inference or stochastic variational Bayes could alleviate these issues and enable application to even larger, multi‑omics time‑course experiments.
In summary, this work introduces a principled, fully Bayesian clustering framework that respects temporal ordering, automatically determines model complexity, and avoids over‑fitting. Empirical results on two large biological datasets demonstrate that the HDP‑HMM not only yields superior clustering quality compared with both traditional static methods and manually tuned finite HMMs, but also provides richer, biologically interpretable state transition structures. The approach represents a significant step forward for the analysis of dynamic gene‑expression data and offers a flexible foundation for future extensions to more complex, multi‑modal time‑course studies.
Comments & Academic Discussion
Loading comments...
Leave a Comment