Introducing Feature-Based Trajectory Clustering, a clustering algorithm for longitudinal data

Introducing Feature-Based Trajectory Clustering, a clustering algorithm for longitudinal data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present a new algorithm for clustering longitudinal data. Data of this type can be conceptualized as consisting of individuals and, for each such individual, observations of a time-dependent variable made at various times. Generically, the specific way in which this variable evolves with time is different from one individual to the next. However, there may also be commonalities; specific characteristic features of the time evolution shared by many individuals. The purpose of the method we put forward is to find clusters of individual whose underlying time-dependent variables share such characteristic features. This is done in two steps. The first step identifies each individual to a point in Euclidean space whose coordinates are determined by specific mathematical formulae meant to capture a variety of characteristic features. The second step finds the clusters by applying the Spectral Clustering algorithm to the resulting point cloud.


💡 Research Summary

The paper introduces Feature‑Based Trajectory Clustering (FBTC), a novel two‑stage method for clustering longitudinal data, where each subject is represented by a trajectory—a set of observations of a time‑dependent variable at irregular time points. The authors assume an underlying smooth function f(t) exists for each subject, sampled at the observed times, and that measurement error is negligible. The core idea is to capture a broad range of functional characteristics of f(t) using twenty mathematically defined trajectory measures, then cluster the resulting feature vectors with spectral clustering, which can uncover non‑convex cluster structures that traditional methods miss.

The twenty measures fall into several thematic groups. The first group includes basic statistics: maximum (m₁), minimum (m₂), range (m₃), mean (m₄) computed via trapezoidal integration, and standard deviation (m₅). The second group concerns the best affine approximation of the underlying function. The slope (m₆) and intercept (m₇) of the line that minimizes L₂ distance to f(t) are obtained by solving a least‑squares problem on the observed points; the proportion of variance explained by this line (m₈) is analogous to an R² statistic. The third group captures oscillatory and curvature information. The rate of intersection with the affine line (m₉) quantifies how often the trajectory crosses its linear trend, reflecting periodic or wavy behavior. First‑ and second‑derivative approximations are derived via central differences adapted to irregular spacing, yielding measures of average, maximum, minimum, and range of the derivative (m₁₀‑m₁₃) and curvature (m₁₄‑m₁₇) using the curvature formula |f″(t)|/(1+|f′(t)|²)³⁄². The final group contains global shape descriptors such as the mean absolute value (m₁₈) and a normalized absolute‑value dispersion (m₁₉). Each measure is defined at the functional level and then approximated from the discrete trajectory, typically using the trapezoidal rule for integrals or finite‑difference formulas for derivatives.

After computing all twenty measures for every subject, each trajectory becomes a point in a 20‑dimensional Euclidean space. The authors construct a similarity matrix (e.g., Gaussian kernel) on these points, compute the graph Laplacian, and extract the leading eigenvectors to embed the data into a low‑dimensional spectral space. A standard k‑means algorithm is then applied to the embedded points to obtain the final clusters. Spectral clustering is chosen because it can separate clusters that are not linearly separable in the original feature space, a limitation of methods such as k‑means or latent class analysis that rely solely on distance.

The method is evaluated on synthetic data, where ground‑truth clusters are defined by distinct underlying functions, and on real‑world datasets, including hemoglobin measurements from blood donors and environmental monitoring records. In synthetic experiments FBTC perfectly recovers the known clusters, demonstrating robustness to irregular sampling and to functions with similar means but different shapes. In the hemoglobin case, subjects with comparable average levels are split into groups that differ in the speed of decline, recovery, and curvature of their trajectories—information that would be lost using only mean‑based clustering. Environmental data show similar benefits: clusters capture distinct seasonal patterns and abrupt pollution spikes that are invisible to simple distance‑based methods.

The authors conclude that (1) the twenty carefully designed measures provide a rich, interpretable summary of functional behavior; (2) mapping trajectories to a high‑dimensional feature space enables the use of powerful graph‑based clustering techniques; and (3) spectral clustering successfully identifies meaningful subpopulations in both simulated and real longitudinal datasets. They suggest future extensions such as weighting or selecting subsets of measures according to domain knowledge, incorporating more flexible non‑linear approximations (e.g., splines or kernel regressions), and handling measurement error through probabilistic extensions. Overall, FBTC offers a systematic, mathematically grounded framework for uncovering hidden structure in longitudinal data.


Comments & Academic Discussion

Loading comments...

Leave a Comment