Discovering shared and individual latent structure in multiple time series

This paper proposes a nonparametric Bayesian method for exploratory data analysis and feature construction in continuous time series. Our method focuses on understanding shared features in a set of time series that exhibit significant individual variability. Our method builds on the framework of latent Diricihlet allocation (LDA) and its extension to hierarchical Dirichlet processes, which allows us to characterize each series as switching between latent topics'', where each topic is characterized as a distribution over words’’ that specify the series dynamics. However, unlike standard applications of LDA, we discover the words as we learn the model. We apply this model to the task of tracking the physiological signals of premature infants; our model obtains clinically significant insights as well as useful features for supervised learning tasks.

💡 Research Summary

The paper introduces a non‑parametric Bayesian framework for exploratory analysis and feature construction in collections of continuous‑time series that exhibit both shared dynamics and substantial individual variability. Building on the latent Dirichlet allocation (LDA) paradigm and its hierarchical Dirichlet process (HDP) extension, the authors reinterpret each time series as a “document”, short overlapping windows of the series as “words”, and the underlying dynamical patterns of those windows as “topics”. Unlike conventional LDA, the vocabulary (the set of words) is not predefined; instead, it is discovered jointly with the model parameters by clustering window‑level statistical descriptors (e.g., multivariate Gaussian or Bayesian regression coefficients) obtained from the raw signal.

At the top level, a global Dirichlet process draws a distribution over topics that is shared across all series. For each series, a child Dirichlet process draws a series‑specific mixture over the same set of topics, allowing individual series to emphasize different topics while still borrowing statistical strength from the global pool. The concentration parameters of the global (α) and series‑specific (γ) Dirichlet processes are given vague beta priors and are inferred during sampling, automatically controlling the number of active topics and the degree of sharing.

Inference is performed using a Gibbs sampler that alternates between (i) assigning each window to a word cluster, (ii) assigning each word to a topic, and (iii) updating the topic‑mixing proportions for each series. The authors also outline a variational Bayes alternative for larger data sets. Model hyper‑parameters such as window length and the distance metric for word clustering are treated as design choices; the paper demonstrates that reasonable defaults yield robust results.

The methodology is evaluated on a clinically important dataset: continuous recordings of heart rate, respiratory rate, and oxygen saturation from 30 premature infants over 24‑hour periods. Compared with baseline methods—including K‑means clustering, Gaussian mixture models, and hidden Markov models—the proposed HDP‑LDA model achieves higher log‑likelihoods and lower perplexities, indicating a better fit to the data’s complex structure. Importantly, the learned topics correspond to interpretable physiological states; for example, one topic spikes in prevalence shortly before episodes of hypoxemia, providing a potential early‑warning signal for clinicians.

Beyond unsupervised discovery, the authors use the inferred topic proportions as features for supervised tasks such as predicting imminent apnea or classifying infants at risk for developmental delays. In these experiments, classifiers built on topic‑based features outperform those using traditional time‑domain or frequency‑domain descriptors, improving the area under the ROC curve by up to 0.12.

The paper’s contributions are threefold: (1) a novel way to discover a data‑driven vocabulary of dynamical patterns in continuous signals, (2) a hierarchical Bayesian model that simultaneously captures shared structure and individual heterogeneity across multiple series, and (3) empirical evidence that the resulting latent topics are both statistically useful and clinically meaningful. Limitations include the computational burden of MCMC for long recordings and sensitivity to the chosen window length; the authors suggest future work on online variational inference and adaptive window selection to enable real‑time monitoring. Overall, the study demonstrates that borrowing ideas from topic modeling can substantially enrich the analysis of multivariate time‑series data, opening avenues for both scientific insight and practical decision support in healthcare and beyond.

💡 Research Summary

📜 Original Paper Content