Estimation of time-delayed mutual information and bias for irregularly and sparsely sampled time-series

Reading time: 5 minute
...

📝 Original Info

  • Title: Estimation of time-delayed mutual information and bias for irregularly and sparsely sampled time-series
  • ArXiv ID: 1110.1615
  • Date: 2016-01-20
  • Authors: DJ Albers and George Hripcsak

📝 Abstract

A method to estimate the time-dependent correlation via an empirical bias estimate of the time-delayed mutual information for a time-series is proposed. In particular, the bias of the time-delayed mutual information is shown to often be equivalent to the mutual information between two distributions of points from the same system separated by infinite time. Thus intuitively, estimation of the bias is reduced to estimation of the mutual information between distributions of data points separated by large time intervals. The proposed bias estimation techniques are shown to work for Lorenz equations data and glucose time series data of three patients from the Columbia University Medical Center database.

💡 Deep Analysis

Figure 1

📄 Full Content

In many experimental and computational contexts, experiments are designed in such a way that the data that are collected are uniformly and densely sampled in time, stationary (statistically), and of sufficient length such that various time-series analysis can be directly applied. Nevertheless, there are an increasing number of scientific situations where data collection is expensive, dangerous, difficult and for which there is little that can be done to control the sampling rates and lengths of the series. Some examples could be taken from astronomy [1], atmospheric science [2] [3], geology [4], paleoclimatology [5], seismology [6] geography [7], neuroscience [8] [9] [10], epidemiology [11], genetics [12], and medical data in general [13] [14].

This paper primarily focuses on establishing a methodology for confirming the existence of time-based correlation for data sets that may include highly non-uniform sampling rates and small sample sets. To achieve this, we will use an information-theoretic framework as an alternative to a more standard signal processing framework because applying signal processing tools is difficult in the context of time series with large gaps between measurements (e.g., spectral analysis for irregularly sampled points can be delicate to implement [15] [16]).

More specifically, to circumvent the problem of erratic measurement times, we use time-delayed mutual information [17] (TDMI), which does not rely on sampling rates, but rather on an overall number of data points (we will explain this in detail in section II D) used to estimate the TDMI for a given time-separation. At a fundamental level, estimating the TDMI requires estimating probability densities (PDF), and PDF estimates fundamentally have two components, the PDF estimate and the bias (or noise floor) associated with the estimation of the PDF. Therefore, relative to the information-theoretic context, the problem of establishing time-dependent correlation can be reduced to quantifying and estimating the bias of the statistical estimator used to estimate the probability density or mass functions (PDF/PMF) associated with a information-theoretic (mutual information) calculation. (Note, for a more detailed review of nonlinear time series analysis methods, see Ref. [18]; for more standard time series analysis techniques see [19] or [17].)

The primary motivation for this work originates in work with physiologic time series extracted from electronic health records (EHR) [20,21]. EHR data are never collected in a controlled fashion, and thus are pathologically irregularly sampled in time. Nevertheless, EHR data will likely be the only medium-to long-term source of population scale human health data that will ever exist because human data can be too expensive, dangerous, and intrusive to collect. Therefore, if we ever wish to understand long-term physiologic trends in human beings, we must find a way to use EHR data, and that means finding a way to cope with time series whose values are mostly missing yet not missing at random. Here we wish to establish a method to begin to quantify predictability of a patient based on their physiologic time series. We hope to apply predictability in at least three ways. First, in testing physiological models on retrospective EHR, such as in glucose metabolism, we have found that while it is sometimes impossible to distinguish two models based comparing their outputs to raw data, sometimes models can be distinguished by comparing their estimates of derived properties like predictability [22]. Second, in carrying out retrospective research on EHR data, we often want to include only those patients whose records are sufficiently complete, yet there is no simple definition of completeness. We propose to use predictability as a feature that correlates with completeness: a record that is unpredictable across a wide range of variables may be so because it is incomplete. And third, the predictability of a set of clinical or physiological variables may be used as a feature to define a disease or syndrome (or, for example, to subdivide the disease or syndrome). Such a feature may add information beyond simple means, variances, and simple characterizations of joint distributions.

A. Non-uniformly sampled time-series: basic notation

Begin by defining a scalar time-series as a sequence of measurements x taken at time t by x t(i) noting that t denotes the real time that the measurement was taken, and i denotes the sequence time, or sequential numeric index that denotes the measurement’s sequential location in the time series. Note that for a discrete time series, t is a natural number (t ∈ N ), whereas for a continuous time series, t is a real number (t ∈ R). Denote differences in real time by δt = t(i) -t(j) where i ≥ j. Similarly, denote differences in sequence time by τ = i -j where i ≥ j, noting that, regardless of the δt between sequential measurements, i is always followed by i + 1 (there are no “miss

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut