Modeling of longitudinal cohort data typically involves complex temporal dependencies between multiple variables. There, the transformer architecture, which has been highly successful in language and vision applications, allows us to account for the fact that the most recently observed time points in an individual's history may not always be the most important for the immediate future. This is achieved by assigning attention weights to observations of an individual based on a transformation of their values. One reason why these ideas have not yet been fully leveraged for longitudinal cohort data is that typically, large datasets are required. Therefore, we present a simplified transformer architecture that retains the core attention mechanism while reducing the number of parameters to be estimated, to be more suitable for small datasets with few time points. Guided by a statistical perspective on transformers, we use an autoregressive model as a starting point and incorporate attention as a kernel-based operation with temporal decay, where aggregation of multiple transformer heads, i.e. different candidate weighting schemes, is expressed as accumulating evidence on different types of underlying characteristics of individuals. This also enables a permutation-based statistical testing procedure for identifying contextual patterns. In a simulation study, the approach is shown to recover contextual dependencies even with a small number of individuals and time points. In an application to data from a resilience study, we identify temporal patterns in the dynamics of stress and mental health. This indicates that properly adapted transformers can not only achieve competitive predictive performance, but also uncover complex context dependencies in small data settings.
Transformer neural network architectures [1] have been successful for modeling of data with sequence structure, in particular for large language models [2]. Key to this is the ability to take context effects into account, i.e. potentially complex patterns of elements that jointly occur in a sequence, when predicting the next element. Transformers have since been increasingly adopted for time series [3] and modeling of longitudinal data [4,5]. However, performance typically hinges on the availability of large amounts of training data [6]. In small data settings, transformers might overfit and generalize poorly [7], and thus can be outperformed by much simpler models [8].
To unlock the potential of transformers for longitudinal cohort data with few individuals and time points, we propose a minimal transformer model that relies on a considerably smaller number of parameters that need to be estimated. Instead of starting from the full transformer architecture and simplifying it, we use an established statistical model as a starting point, specifically a linear vector autoregressive model, and add in the minimally necessary ingredients of the multi-head attention mechanism that is at the core of the transformer architecture.
Vector autoregressive (VAR) models, comprehensively covered by Lütkepohl, [9] , have originally been designed to model multivariate time series by expressing the current time point as a linear combination of a fixed number of the most recent points, observed on an equidistant time grid. This does not allow for flexible-length patterns or a strong influence of non-recent time points. To allow for a larger number of characteristics, Bayesian VAR uses priors for regularization [10], and factor-augmented VAR projects variables into a lower-dimensional space [11]. Several approaches have been developed to allow parameters or regimes to change over time [12,13,14], some of which also allow for irregular sampling. Extensions such as those proposed by Cai et al. [15] can model complex and nonlinear multivariate structures. Still, the number of past time points that are considered and needed for subsequent prediction has to be decided on beforehand. As an alternative, several artificial neural network architectures have been suggested, such as recurrent neural networks [16] or LSTNet [17], but still struggle with long-range dependencies across the history of an observational unit.
Transformer neural network architectures, which have been developed to address this, can be seen as an extension of dynamic VARs [18] to deal with a varying number of past time points. Importance does not have to be inversely proportional in time, but is determined by an attention mechanism that assigns weights depending on a transformation of the characteristics of an observational unit observed at a time point. As a single attention pattern might be too limited [19], transformers typically use multi-head attention, i.e. several weighting patterns. This can be interpreted as a nonparametric Bayesian ensemble, where each head approximates a sample from a posterior over attention distributions [20]. Yet, the interpretation of a fitted transformer model remains challenging. While attention scores have been proposed as indicators of feature relevance, their validity as explanations has been questioned, as attention weights may not reliably reflect the true importance of input features [21]. Recent work has focused on developing formal statistical frameworks for complex models [22,23]. Notably, [24] introduced selective inference methods for vision transformers. However, these approaches do not specifically address the statistical significance of patterns in longitudinal cohort data, i.e. an approach that facilitates subsequent interpretation in this setting is missing so far. Enabled by the proposed minimal transformer architecture, we introduce a permutation-testing approach to fill this gap.
In Section 2, we introduce the minimal transformer architecture, MiniTransformer, also outlining the main differences between the proposed architecture and the classical transformer, and providing a statistical testing framework. We evaluate our approach through a simulation design and an application to data from a cohort study on psychological resilience in Section 3. Section 4 provides concluding remarks and a more general perspective on the potential of transformers for modeling small longitudinal data in biomedicine.
Consider the longitudinal data of an individual, where for each of T successive time points t i , i = 1, . . . , T , the vector x ti ∈ R p+1 comprises p measurements of characteristics of the individual and a constant term 1. The aim is to predict the values in y = (y 1 , . . . , y q ) ′ , observed at some future time t T +1 , which might be future measurements of the same variables, with q = p, or some other future characteristics.
In a first step, we want to transform each x ti into scalar values
x(h) ti , h = 1
This content is AI-processed based on open access ArXiv data.