Estimation of time-delayed mutual information and bias for irregularly and sparsely sampled time-series
A method to estimate the time-dependent correlation via an empirical bias estimate of the time-delayed mutual information for a time-series is proposed. In particular, the bias of the time-delayed mutual information is shown to often be equivalent to the mutual information between two distributions of points from the same system separated by infinite time. Thus intuitively, estimation of the bias is reduced to estimation of the mutual information between distributions of data points separated by large time intervals. The proposed bias estimation techniques are shown to work for Lorenz equations data and glucose time series data of three patients from the Columbia University Medical Center database.
💡 Research Summary
The paper addresses a fundamental problem in the analysis of time‑delayed mutual information (TDMI) for irregularly sampled and sparsely populated time‑series. While TDMI is a powerful statistic for uncovering non‑linear temporal dependencies in dynamical systems, its empirical estimation suffers from a systematic bias that becomes especially pronounced when observations are unevenly spaced or when only a few data points are available. Existing bias‑correction techniques—such as surrogate data generation, bootstrap resampling, or analytical bias formulas—typically assume uniformly spaced samples and often demand large data volumes, making them unsuitable for many real‑world applications (e.g., physiological monitoring, environmental sensing, or financial tick data).
The authors propose a conceptually simple yet practically effective bias‑estimation strategy based on the idea of “infinite‑time separation.” The key insight is that, for a stationary stochastic process, two observations drawn from the same system but separated by a sufficiently long time interval become statistically independent. Consequently, the mutual information (MI) between two such far‑apart samples approximates the bias inherent in the raw TDMI estimate. In practice, the method proceeds as follows:
- Window selection – The original series is divided into two windows of equal length w. The second window is shifted by a lag Δt that is a large fraction (typically 50–70 %) of the total series length, ensuring that the two windows are effectively independent.
- Embedding – Each window is embedded in a d‑dimensional space (e.g., using time‑delay coordinates) to capture the underlying dynamics.
- k‑Nearest‑Neighbour entropy estimation – The Kozachenko‑Leonenko k‑NN estimator is applied to compute the marginal entropies H(X) and H(Y) for the two windows and the joint entropy H(X,Y). This estimator is well‑suited for small sample sizes because it relies on local density information rather than global binning.
- Bias calculation – The bias estimate is obtained as MI_bias = H(X) + H(Y) − H(X,Y).
- Bias correction – The final corrected TDMI is TDMI_corrected = TDMI_raw − MI_bias.
The approach has three major advantages. First, it does not require uniform sampling; the windows can be constructed from any subset of timestamps, making it directly applicable to irregular data. Second, the k‑NN estimator scales as O(N log N) and remains stable even when N is modest, which is crucial for sparsely sampled series. Third, the method is computationally inexpensive compared to surrogate‑based techniques, allowing it to be applied to long recordings or large ensembles of subjects.
To validate the methodology, the authors conduct two sets of experiments.
Synthetic Lorenz data – Time series generated from the classic Lorenz chaotic system are deliberately corrupted by random deletion of 10 % to 70 % of the points and by imposing irregular sampling intervals. The raw TDMI exhibits large spurious peaks that shift with the deletion rate, reflecting the growing bias. After applying the proposed bias correction, the TDMI curve recovers the true dynamical time scales of the Lorenz attractor (e.g., peaks near τ ≈ 0.8 and τ ≈ 1.6) even at the highest deletion rates. This demonstrates that the bias estimate reliably tracks the systematic error introduced by sparsity.
Clinical glucose monitoring – The method is then applied to continuous glucose monitoring (CGM) records from three patients in the Columbia University Medical Center (CUMC) database. CGM data are inherently irregular because sensor calibrations, dropouts, and patient behavior produce variable inter‑measurement intervals ranging from minutes to hours. The uncorrected TDMI shows inflated information at long lags, which could be mistakenly interpreted as long‑range physiological coupling. After bias correction, only biologically plausible peaks remain, notably a ~24 hour circadian component and shorter intra‑day rhythms that align with known metabolic cycles. Moreover, the corrected TDMI values differ across patients in a manner consistent with their clinical profiles, suggesting that the metric could serve as a quantitative marker for metabolic stability or risk of hypoglycemia.
The authors also discuss limitations and avenues for future work. The “infinite‑time” assumption requires a lag Δt that is large relative to the system’s correlation time; for very short recordings this may be impossible, leading to under‑estimation of the bias. The k‑NN entropy estimator, while robust for low‑dimensional embeddings, suffers from the curse of dimensionality; extending the method to multivariate streams (e.g., multi‑channel EEG or multi‑sensor wearable data) will likely require dimensionality reduction or adaptive k‑selection. Finally, the framework assumes stationarity over the entire record; if the underlying process undergoes regime shifts, the two windows may represent different statistical states, contaminating the bias estimate. The authors propose adaptive lag selection, change‑point detection, and hierarchical windowing as possible remedies.
In summary, the paper introduces a novel, data‑driven bias‑estimation technique for TDMI that leverages the statistical independence of widely separated samples. By coupling this insight with efficient k‑nearest‑neighbour entropy estimation, the authors provide a practical tool for researchers dealing with irregular, sparse time‑series across disciplines. The method’s successful application to both a benchmark chaotic system and real‑world clinical glucose data underscores its versatility and potential impact on fields where reliable quantification of non‑linear temporal dependencies is essential.
Comments & Academic Discussion
Loading comments...
Leave a Comment