A statistical perspective on transformers for small longitudinal cohort data

A S T A T I S T I C A L P E R S P E C T I V E O N T R A N S F O R M E R S F O R S M A L L L O N G I T U D I N A L C O H O R T D A T A Kiana Farhadyar 1 , 2 ∗ , Maren Hackenber g 1 , 2 , Kira Ahrens 3 , Charlotte Schenk 3 , Bianca Kollmann 4 , 5 , Oliver Tüscher 5 , 6 , 7 , 8 , Klaus Lieb 5 , 6 , Michael M. Plichta 3 , Andreas Reif 3 , Raffael Kalisch 5 , 9 , Martin W olkewitz 1 , 2 , Moritz Hess 1 , 2 , Harald Binder 1 , 2 1 Institute of Medical Biometry and Statistics (IMBI), Univ ersity of Freibur g, Freiburg, German y 2 Freibur g Center for Data Analysis, Modeling and AI, Univ ersity of Freiburg, Freib urg, Germany 3 Department of Psychiatry , Psychosomatic Medicine and Psychotherapy , University Hospital Frankfurt, Frankfurt, German y 4 Department of Neuropsychology and Psychological Resilience Research, Central Institute of Mental Health (ZI), Mannheim, Germany 5 Leibniz Institute for Resilience Research (LIR), Mainz, Germany 6 Department of Psychiatry and Psychotherapy , Johannes Gutenberg Uni versity Medical Center Mainz, Mainz, Germany 7 Department of Psychiatry , Psychotherapy and Psychosomatic Medicine, Univ ersity Medicine Halle, Martin-Luther Univ ersity Halle-W ittenberg, Halle, Germany 8 German Center for Mental Health (DZPG), partner site Halle-Jena-Magdebur g, Halle, Germany 9 Neuroimaging Center (NIC), Focus Program T ranslational Neuroscience (FTN), Johannes Gutenberg Uni versity Medical Center , Mainz, Germany A B S T R AC T Modeling of longitudinal cohort data typically in volv es complex temporal dependencies between multiple variables. There, the transformer architecture, which has been highly successful in language and vision applications, allows us to account for the fact that the most recently observ ed time points in an indi vidual’ s history may not always be the most important for the immediate future. This is achie ved by assigning attention weights to observations of an indi vidual based on a transformation of their v alues. One reason why these ideas hav e not yet been fully le veraged for longitudinal cohort data is that typically , large datasets are required. Therefore, we present a simpliﬁed transformer architecture that retains the core attention mechanism while reducing the number of parameters to be estimated, to be more suitable for small datasets with fe w time points. Guided by a statistical perspectiv e on transformers, we use an autoregressiv e model as a starting point and incorporate attention as a kernel-based operation with temporal decay , where aggregation of multiple transformer heads, i.e. dif ferent candidate weighting schemes, is expressed as accumulating e vidence on different types of underlying characteristics of indi viduals. This also enables a permutation-based statistical testing procedure for identifying conte xtual patterns. In a simulation study , the approach is shown to recov er contextual dependencies e ven with a small number of indi viduals and time points. In an application to data from a resilience study , we identify temporal patterns in the dynamics of stress and mental health. This indicates that properly adapted transformers can not only achie ve competiti ve predictiv e performance, but also unco ver complex conte xt dependencies in small data settings. K eywords Cohort data · small data · temporal patterns · transformers · attention mechanism · permutation testing 1 Introduction T ransformer neural network architectures [ 1 ] hav e been successful for modeling of data with sequence structure, in particular for large language models [ 2 ]. Key to this is the ability to tak e context ef fects into account, i.e. potentially complex patterns of elements that jointly occur in a sequence, when predicting the next element. T ransformers hav e since been increasingly adopted for time series [ 3 ] and modeling of longitudinal data [ 4 , 5 ]. Howe ver , performance typically hinges on the av ailability of large amounts of training data [ 6 ]. In small data settings, transformers might ov erﬁt and generalize poorly [7], and thus can be outperformed by much simpler models [8]. ∗ Corresponding author: Kiana Farhadyar (kiana.f arhadyar@uniklinik-freiburg.de). T ransformers for Small Longitudinal Cohort Data T o unlock the potential of transformers for longitudinal cohort data with fe w indi viduals and time points, we propose a minimal transformer model that relies on a considerably smaller number of parameters that need to be estimated. Instead of starting from the full transformer architecture and simplifying it, we use an established statistical model as a starting point, speciﬁcally a linear vector autore gressive model, and add in the minimally necessary ingredients of the multi-head attention mechanism that is at the core of the transformer architecture. V ector autoregressi ve (V AR) models, comprehensively cov ered by Lütkepohl, [9] , have originally been designed to model multiv ariate time series by expressing the current time point as a linear combination of a ﬁxed number of the most recent points, observed on an equidistant time grid. This does not allow for ﬂexible-length patterns or a strong inﬂuence of non-recent time points. T o allow for a lar ger number of characteristics, Bayesian V AR uses priors for re gularization [ 10 ], and factor-augmented V AR projects v ariables into a lower -dimensional space [ 11 ]. Se veral approaches hav e been dev eloped to allow parameters or re gimes to change ov er time [ 12 , 13 , 14 ], some of which also allow for irregular sampling. Extensions such as those proposed by Cai et al. [15] can model complex and nonlinear multiv ariate structures. Still, the number of past time points that are considered and needed for subsequent prediction has to be decided on beforehand. As an alternativ e, several artiﬁcial neural netw ork architectures have been suggested, such as recurrent neural networks [ 16 ] or LSTNet [ 17 ], but still struggle with long-range dependencies across the history of an observational unit. T ransformer neural network architectures, which ha ve been de veloped to address this, can be seen as an e xtension of dynamic V ARs [ 18 ] to deal with a varying number of past time points. Importance does not hav e to be in versely proportional in time, but is determined by an attention mechanism that assigns weights depending on a transformation of the characteristics of an observational unit observed at a time point. As a single attention pattern might be too limited [ 19 ], transformers typically use multi-head attention, i.e. several weighting patterns. This can be interpreted as a nonparametric Bayesian ensemble, where each head approximates a sample from a posterior ov er attention distributions [ 20 ]. Y et, the interpretation of a ﬁtted transformer model remains challenging. While attention scores hav e been proposed as indicators of feature rele vance, their v alidity as explanations has been questioned, as attention weights may not reliably reﬂect the true importance of input features [ 21 ]. Recent work has focused on developing formal statistical framew orks for complex models [ 22 , 23 ]. Notably , [ 24 ] introduced selectiv e inference methods for vision transformers. Howe ver , these approaches do not speciﬁcally address the statistical signiﬁcance of patterns in longitudinal cohort data, i.e. an approach that facilitates subsequent interpretation in this setting is missing so far . Enabled by the proposed minimal transformer architecture, we introduce a permutation-testing approach to ﬁll this gap. In Section 2, we introduce the minimal transformer architecture, MiniTra nsformer, also outlining the main dif ferences between the proposed architecture and the classical transformer , and providing a statistical testing frame work. W e ev aluate our approach through a simulation design and an application to data from a cohort study on psychological resilience in Section 3. Section 4 provides concluding remarks and a more general perspectiv e on the potential of transformers for modeling small longitudinal data in biomedicine. 2 Methods 2.1 The MiniT ransformer appr oach Consider the longitudinal data of an individual, where for each of T successiv e time points t i , i = 1 , . . . , T , the vector x t i ∈ R p +1 comprises p measurements of characteristics of the individual and a constant term 1. The aim is to predict the values in y = ( y 1 , . . . , y q ) ′ , observed at some future time t T +1 , which might be future measurements of the same variables, with q = p , or some other future characteristics. In a ﬁrst step, we want to transform each x t i into scalar values ˜ x ( h ) t i , h = 1 , . . . , H , thereby implementing a lightweight version of multi-head attention to potentially reﬂect up to H dif ferent patterns in the data. Speciﬁcally , the transformation of x t i should be able to take the information x t l , l = 1 , . . . , T , from another time point into account. T o achie ve this, we use a kernel function g ( x t i , x t l ; w ( h ) query , w ( h ) key ) = exp  x ′ t i w ( h ) query · x ′ t l w ( h ) key  · exp ( − ( w dist · | t i − t l | ) γ ) (1) that encapsulates the idea of pairwise attention, parameterized by query parameters w ( h ) query ∈ R p +1 , ke y parameters w ( h ) key ∈ R p +1 , and value parameters w ( h ) value ∈ R p +1 , for each head h = 1 , . . . , H . The second term in the product reﬂects the idea of decay of information as the temporal distance | t i − t l | between the pair of observ ations increases, where w dist ≥ 0 is a parameter that should be learned from the data, and γ is a tuning parameter that we hav e set to 2 T ransformers for Small Longitudinal Cohort Data γ = 5 in our applications. Based on the kernel function (1), we obtain transformed v alues ˜ x ( h ) t i = P i l =1 g ( x t i , x t l ; w ( h ) query , w ( h ) key ) · x ′ t l w ( h ) value P i l =1 g ( x t i , x t l ; w ( h ) query , w ( h ) key ) i = 1 , . . . , T , h = 1 , . . . , H (2) that comprise weighted av erages of the observations up to the current time point. While y ∈ R q could be predicted directly based on these transformed v alues, it will often be the case that there is some correlation structure between the elements, which can be e xplained by a lower -dimensional representation z = ( z 1 , . . . , z C ) with C < q . Therefore, we suggest to cumulate the transformed values ˜ x ( h ) t i into z c = T X i =1 ˜ x ′ t i w ( c ) cum · exp ( − ( w horizon · | t T +1 − t i | ) γ ) c = 1 , . . . , C, (3) where each cumulant z c can receive input from the H different heads via the vector of transformed values ˜ x t i = ( ˜ x (1) t i , . . . , ˜ x ( H ) t i ) ′ , as parameterized via w ( c ) cum = ( w ( c ) cum , 1 , . . . , w ( c ) cum ,H ) ′ . There, each contribution receiv es a decay according to the distance | t T +1 − t i | to the prediction horizon time t T +1 , at which y is observed, parameterized by w horizon ≥ 0 . Predictions are then obtained via a standard regression model ˆ y r = β ( r ) 0 + z ′ β ( r ) , r = 1 , . . . , q , (4) with intercept parameters β ( r ) 0 and parameter v ectors β ( r ) = ( β ( r ) 1 , . . . , β ( r ) C ) ′ , c = 1 , . . . , C, for transforming the input of the cumulant vector z = ( z 1 , . . . , z C ) ′ . Estimates for all parameters can be obtained jointly , by minimizing the squared error loss P p r =1 ( y r − ˆ y r ) 2 . If other types of endpoints are to be considered, e.g., binary or time-to-ev ent endpoints, appropriate alternativ e regression models and loss functions can be used instead. Assuming that data for se veral indi viduals is av ailable, batched stochastic gradient descent [ 25 ] is used for parameter estimation, with gradients obtained via dif ferential programming techniques [26]. 2.2 Main differences to standard transf ormers In a standard decoder -based transformer architecture [ 1 ], each element of a sequence, typically called “tok en” is mapped into a high-dimensional space via a neural network embedding layer . In text processing applications, where the tokens are (parts of) w ords, this is necessary to obtain a quantitativ e representation for subsequent modeling. Such an embedding layer could also be considered in longitudinal cohort settings, where the tokens, i.e., the observations in the history of an indi vidual, are already quantitative, to increase representation capacity . Howe ver , such a transformation could make interpretation of the ef fect of indi vidual characteristics considerably more dif ﬁcult, particularly in small data settings where ov erparameterization is also a concern [ 27 ]. Therefore, the MiniTransformer approach omits this embedding and just adds a constant element to the input vectors to allo w for shifts in representation. As a further simpliﬁcation, the query and ke y projections in (1) and the v alue projection in (2) reduce the representations to scalars, with weight parameter vectors w ( h ) query , w ( h ) key , and w ( h ) v alue , for each attention head h = 1 , . . . , H . In contrast, standard transformers rely on weight matrices to project into query , key , and value representations, typically of the same dimension q QKV ≥ p . While the MiniTransformer does not ha ve that much representational ﬂexibility , some of this might be recov ered by still allowing for the input of multiple attention heads in (3), to potentially reﬂect different types of patterns in the data. In standard transformers, the information about the order of observ ations is incorporated through either absolute positional encoding [ 1 ] or relativ e positional encoding [ 28 ], each with ﬂexible inﬂuence for each index position. This is combined with a clipping value c , beyond which all distances are treated equally , resulting in at least c + 1 parameters that need to be estimated. The MiniT ransformer uses a relative positional encoding schemes in (1) and (3), but reduces complexity to just three parameters w dist , w horizon , and γ . This still allows for reﬂecting patterns where the inﬂuence of observations smoothly diminishes o ver time (e.g. 29, 30, 31). Finally , the MiniTransformer simpliﬁes the standard practice of transforming the concatenated multi-head output, in our case ˜ x t i , by a lar ge projection layer . Instead of implementing the latter as a fully connected neural network (with ≫ ( q QK V · H ) 2 parameters), we just use a linear transformation via w ( . ) cum to summarize information across attention heads. While the further linear transformation in (4) might seem redundant, we found in our experiments that 3 T ransformers for Small Longitudinal Cohort Data this provided more ﬂexibility when combined with random parameter initialization and stochastic gradient descent. Non-linear transformations, as often employed in standard transformers, could be easily introduced in (2), (3), and (4) to ease identiﬁability concerns, but we found no systematic beneﬁt in our e xperiments. T aken together , the simpliﬁcations in the MiniT ransformer reduce the number of parameters to be estimated by at least O ( p 2 H 2 ) , while still preserving core ideas of attention and temporal decay . 2.3 A permutation test for context effects Based on the attention kernel (1), the MiniTransformer approach determines the inﬂuence of each of the pre vious observations, which serv e as a context to the current observation, on the transformation of the current observ ation in (2), and thus also its contribution to the subsequent prediction in (4). Therefore, the estimated parameters reﬂect context patterns, and in particular, which characteristics of indi viduals contribute the most to such context patterns. Unfortunately , the individual parameters are dif ﬁcult to interpret, and also not amenable to statistical inference. Therefore, we focus on the ef fect of characteristics of indi viduals on the context that are reﬂected in changes in prediction. For this, we propose a permutation testing approach that compares predictions that are based on a single observation without conte xt to predictions with context observ ations. Speciﬁcally , we consider just a single pre vious observ ation x context =  1 , x context 1 , . . . , x context p  ′ as a potential context for each of V potential later observ ations x visit v =  1 , x visit v , 1 , . . . , x visit v ,p  ′ , v = 1 , . . . , V , that might subsequently be used for predicting a future value y r . W e will ignore precise time information (essentially setting the time decay factors in (1) and (3) to the value 1 ). Therefore, we will also omit time indices. The ef fect of variable j ∈ { 1 , . . . , p } on the dif ference between the contribution of x visit v to the prediction of ˆ y r with and without context is ∆ ( r ) j,v = C X c =1 β ( r ) c · H X h =1 w ( h ) cum ,h · P p j =1 w ( h ) value ,j ·  x visit v ,j − x context j  1 + g ( x visit v , x visit v ; . ) /g ( x visit v , x context ; . ) , (5) which in particular depends on how w ( h ) value = ( w ( h ) value , 0 , w ( h ) value , 1 , . . . , w ( h ) value ,p ) ′ weights the differences between the v ariable with index j in x visit v and x context . The difference (5) also increases when the attention k ernel g ( · ) between the two observ ations becomes larger relati ve to the self-attention of x visit v , which can also be dri ven by the v ariable with index j . Assuming that the variables in x context hav e similar v ariance, which could be facilitated by standardization before model ﬁtting, we can no w add some value δ either to the v ariable with index j 1 in x context , or to the v ariable with index j 2 , to induce a comparable change, and compare the two changes by calculating (5) for each of the two, to obtain ∆ ( r ) j 1 ,v and ∆ ( r ) j 2 ,v . More generally , this could be performed for all v ariables, and all visit observations x visit v , resulting in a matrix ∆ ( r ) :=  ∆ ( r ) j,v  j =1 ,...,p v =1 ,...,V . (6) This matrix depends on the values of x context and δ , and these should be chosen so that x context represents some kind of reference observ ation, and δ a change of reasonable magnitude. For e xample, when there is only binary data with values 0 and 1 , all values of x context might be set to zero, and δ set to 1 . Summary statistics s ( r ) j , j = 1 , . . . , p, can now be calculated across the rows of ∆ ( r ) to assess the impact of each variable as part of the context. W e suggest considering the av erage row v alues to capture directional effects, b ut the av erage of squared values, or something else, might also be useful. Under the null hypothesis of no distinct context effect of the v ariable with index j , the ro ws of ∆ ( r ) should be exchangeable. Therefore, we calculate summary statistics s ( r,m ) j for m = 1 , . . . , M permutations, and obtain a p-value pval = 1 M M X m =1 I  s ( r ) j ≥ s ( r,m ) j  (7) that reﬂects the conte xt effect on y r , where I ( · ) is the indicator function that returns one if the condition is true and zero otherwise. The empirical null distribution may be contaminated by other variables that play a signiﬁcant role in the context, b ut this will only lead to a conservati ve beha vior of (7). While we ha ve focused on a single prediction tar get y r so far , we can consider all elements of y to obtain the matrix S =  s ( r ) j  j =1 ,...,p r =1 ,...,q , e.g. for visualization as a heatmap. 4 T ransformers for Small Longitudinal Cohort Data T able 1: Prediction performance of the MiniTransformer for simulated data with p = 10 variables in training datasets with different numbers of sequences n train , compared to a simple av eraging approach, re gression models, and an approach informed by the true structure. Mean squared error is considered across all v ariables (MSE) and speciﬁcally for the variable that is af fected by the true pattern (MSE j 3 ), together with standard de viations. Smallest v alues are indicated by boldface. Appr oach Metric n train = 100 n train = 200 n train = 500 n train = 1000 A verage MSE 0 . 209 ± 0 . 002 0 . 21 ± 0 . 002 0 . 209 ± 0 . 002 0 . 209 ± 0 . 002 MSE j 3 0 . 218 ± 0 . 007 0 . 217 ± 0 . 009 0 . 214 ± 0 . 01 0 . 211 ± 0 . 007 Regression MSE 0 . 209 ± 0 . 003 0 . 206 ± 0 . 002 0 . 204 ± 0 . 002 0 . 204 ± 0 . 002 MSE j 3 0 . 155 ± 0 . 005 0 . 153 ± 0 . 006 0 . 152 ± 0 . 005 0 . 151 ± 0 . 004 Informed MSE 0 . 204 ± 0 . 002 0 . 204 ± 0 . 002 0 . 204 ± 0 . 002 0 . 204 ± 0 . 002 MSE j 3 0 . 163 ± 0 . 008 0 . 163 ± 0 . 008 0 . 161 ± 0 . 008 0 . 159 ± 0 . 005 MiniT ransformer MSE 0 . 21 ± 0 . 006 0 . 202 ± 0 . 002 0 . 196 ± 0 . 001 0 . 195 ± 0 . 002 MSE j 3 0 . 141 ± 0 . 03 0 . 097 ± 0 . 013 0 . 068 ± 0 . 006 0 . 06 ± 0 . 007 3 Empirical ev aluation W e e valuated the MiniT ransformer both with simulated and real data. T o ensure reproducibility , we provide an implementation of the MiniTransformer architecture, the permutation testing approach, and all ev aluation procedures at https://github.com/kianaf/MiniTransformer . This repository includes separate Jupyter notebooks for each experiment, documenting all data preprocessing steps, hyperparameters, and random seeds. 3.1 Simulation study W e generate sequences of binary data x t i ∈ { 0 , 1 } p , i = 1 , . . . , T . Where the value of the variable with index j 3 depends on past v alues of the variables with indices j 1 , j 2 and j 3 . This dependency structure is go verned by an unobserved v ariable z t i , which encodes whether a speciﬁc pattern of acti vation has occurred and persisted up to time t i . Speciﬁcally , z t i is set to 1 if there exists a pair of earlier time points i 1 < i 2 ≤ i such that variable j 1 was acti ve at time t i 1 , variable j 2 was acti ve at time t i 2 , and variable j 3 has remained inactiv e from time t i 2 onwards. Otherwise, z t i = 0 . This condition can be formally written as: z t i =  1 , if ∃ i 1 < i 2 ≤ i such that x t i 1 ,j 1 = x t i 2 ,j 2 = 1 ∧ ∀ i ∗ > i 2 : x t i ∗ ,j 3 = 0 , 0 , otherwise. The observed v alues are iterativ ely generated as x t i +1 ,j =  I ( z t i = 1 ∧ B t i = 1 , B t i ∼ Bernoulli(0 . 9)) j = j 3 , Bernoulli(0 . 7) , j  = j 3 , and for i > 2 the sequence is terminated with probability 0.2 to obtain sequences of different length, with at least 3 observations. T o assess the impact of the number of sequences on performance, we trained the MiniTransformer on datasets comprising 100, 500, or 1000 sequences, each with p = 10 variables. For estimating the parameters, the task is to predict all p variables of the last observation of a sequence, and the same for all subsequences of at least length 3 that can be obtained by starting at the ﬁrst time point and gradually adding more time points. W e used an architecture with H = 12 heads and C = 2 cumulants, and estimated parameters via stochastic gradient with batches of size 1 and a learning rate of 0.001 for 100 epochs. For comparison, we consider three types of approaches: The ﬁrst approach (A verage) uses the av erage of all observations of a variable on the training data to predict on the test data. As a second approach (Regression), we consider a regression model for each v ariable that uses the values of all v ariables from the pre vious time point to predict the next v alue of that speciﬁc v ariable. Finally , we consider an approach (Informed) that is partially informed by the true underlying structure. For all v ariables except x t i ,j 3 , it uses the av erage for prediction, just as the ﬁrst approach. For x t i ,j 3 , it calculates the av erage of all values conditional on x t i ,j 2 being 0 or 1 at the previous time point, and subsequently uses this for prediction. 5 T ransformers for Small Longitudinal Cohort Data T able 2: A verage p-v alues (and standard deviations from 10 repetitions) resulting from the permutation testing approach when using sampling, with either V = 8 or V = 12 patterns of x visit v sampled, contrasted with the full permutation approach with V = 16 , in a setting with p = 4 variables (where variables 1-3 are part of the true pattern) and 50 sequences. V ariable V = 8 V = 12 V = 16 1 0 . 084 ± 0 . 13 0 . 01 ± 0 . 02 < 0 . 001 2 0 . 2 ± 0 . 29 0 . 049 ± 0 . 06 0 . 002 3 0 . 084 ± 0 . 16 0 . 01 ± 0 . 02 0 . 007 4 0 . 607 ± 0 . 24 0 . 633 ± 0 . 18 0 . 635 For e valuation, we consider the mean squared error (MSE) of prediction for 10 repetitions on a test dataset with 1000 sequences. Speciﬁcally , we consider the av erage MSE when predicting all p variables at the last time point of each sequence, and also the MSE just for x t i ,j 3 . As seen from T able 1, the MiniTransformer is competiti ve across dif ferent training sample sizes, ev en with just 100 sequences. It is only outperformed by the informed approach for the smallest sample size when considering the MSE across all variables, b ut not when considering the MSE just for x t i ,j 3 . This might indicate some ov erﬁtting for the other variables by the MiniT ransformer due to the small sample size. Next, we assess whether the learned cumulants recov er the known underlying structure z t in the simulation. As shown abov e, when z t = 1 , the probability of the target x t,j 3 increases sharply , so recovering z t provides a direct test of whether the learned representation captures the intended mechanism rather than merely improving prediction. Figure 1 sho ws a random subset of simulated sequence trajectories, overlaying the true underlying structure z t , the observed and predicted value of x t,j 3 , and the learned cumulants (3). Across sequences, the cumulant trajectories rise and fall in synchrony with the onset/of fset of z t , indicating that the cumulant representation captures the latent mechanism. Moreover , predicted target trajectories align with the observed tar get in periods where the latent structure is activ e, consistent with the simulation design in which z t modulates the target dynamics. T o assess the statistical testing approach, we consider the most challenging setting with just 100 sequences, and the variable with inde x j 3 as the target for prediction. As a full permutation approach with all potential realizations of x visit v is not feasible, we randomly sample 8 patterns. Across the 10 repetitions, the variable with index j 1 then receiv es an av erage p-value of 0 . 146 , with index j 2 an av erage of 0 . 1285 , and with index j 3 an av erage of 0 . 0022 . The other se ven v ariables receiv e averages between 0 . 2242 and 0 . 6777 . This indicates that the approach can uncov er the variables in volv ed in the true underlying pattern, and in particular the v ariables with indices j 1 and j 2 as the dri ving factors of context ef fects. W e also wanted to e valuate ho w well the sampling approach for realizations of x visit v reﬂects what would be obtained with a full permutation approach. For computational feasibility , we consider a setting with just p = 4 v ariables, three of which carry the true pattern, and a total of 50 sequences. The sampling-based p-v alues for V = 8 and V = 12 realizations of x visit v with 10 repetitions are contrasted with the p-values based on the full permutation approach in T able 2. Similar to the results for 100 or more sequences, the three variables that are part of the underlying true pattern recei ve the smallest p-v alues. Y et, the sampling-based approach seems to hav e less power and more v ariability , in particular for a smaller number V of visit observation samples. Still, it is surprising that ev en with just 50 sequences, some parts of the true pattern can still be identiﬁed. 3.2 Real Data: reported stressors and general mental health status W e consider the longitudinal resilience assessment (LORA) study [ 32 ] as an application that moti vated our de velopment of the MiniT ransformer approach. Speciﬁcally , we focus on three questionnaires that the participants answered e very three months: the Mainz Inv entory of Microstressors (MIMIS), a 58-item instrument capturing self-reported daily hassles (microstressors, dh) in the past week [ 33 ]; a 27-item life ev ents in ventory assessing self-reported life ev ents (macrostressors, le) ov er the past three months [ 34 ]; and the 28-item General Health Questionnaire (GHQ-28), which measures mental health status over the past weeks [ 35 ]. The GHQ-28 consists of four subscales, including somatic symptoms, anxiety/insomnia, social dysfunction, and depression, each containing sev en items. Participants reported the number of days they e xperienced each daily hassle in the past week and the number of major life events that occurred in the past three months. The GHQ-28 items were rated on a 4-point Likert scale ranging from 0 (“Not at all”) to 3 (“Much more than usual”), with total and subscale scores calculated by summing the respectiv e item ratings. T o illustrate our approach, we consider two different subsets of items, each with p = 10 variables. In both datasets, we hav e one variable related to the general mental health status as a tar get v ariable. The other variables are stressors selected 6 T ransformers for Small Longitudinal Cohort Data 1 2 3 4 1.0 0.5 0.0 0.5 1.0 1.5 Value Sequence 1 1 2 3 4 1.0 0.5 0.0 0.5 1.0 1.5 Sequence 4 1 2 3 4 5 6 7 1.0 0.5 0.0 0.5 1.0 1.5 Sequence 8 1 2 3 4 5 6 7 8 9 10 1.0 0.5 0.0 0.5 1.0 1.5 Sequence 11 1 2 3 4 5 6 7 8 9 10 1.0 0.5 0.0 0.5 1.0 1.5 Value Sequence 13 1 2 3 4 5 Time step 1.0 0.5 0.0 0.5 1.0 1.5 Value Sequence 17 1 2 3 4 Time step 1.0 0.5 0.0 0.5 1.0 1.5 Sequence 26 1 2 3 4 5 6 7 8 9 Time step 1.0 0.5 0.0 0.5 1.0 1.5 Value Sequence 33 1 2 3 4 5 6 7 8 9 10 Time step 1.0 0.5 0.0 0.5 1.0 1.5 Sequence 36 1 2 3 4 5 6 7 8 9 10 Time step 1.0 0.5 0.0 0.5 1.0 1.5 Sequence 38 1 2 3 4 5 6 7 8 9 10 Time step 1.0 0.5 0.0 0.5 1.0 1.5 Value Sequence 41 1 2 3 4 5 6 7 8 Time step 1.0 0.5 0.0 0.5 1.0 1.5 Value Sequence 45 1 2 3 4 5 6 7 8 9 Time step 1.0 0.5 0.0 0.5 1.0 1.5 Sequence 50 1 2 3 4 5 Time step 1.0 0.5 0.0 0.5 1.0 1.5 Value Sequence 57 1 2 3 4 5 6 7 8 9 10 Time step 1.0 0.5 0.0 0.5 1.0 1.5 Sequence 62 1 2 3 4 Time step 1.0 0.5 0.0 0.5 1.0 1.5 Sequence 63 1 2 3 4 5 Time step 1.0 0.5 0.0 0.5 1.0 1.5 Sequence 64 1 2 3 4 5 6 7 8 9 10 Time step 1.0 0.5 0.0 0.5 1.0 1.5 Sequence 67 1 2 3 4 5 Time step 1.0 0.5 0.0 0.5 1.0 1.5 Sequence 72 1 2 3 4 5 6 7 8 9 10 Time step 1.0 0.5 0.0 0.5 1.0 1.5 Sequence 76 Individual trajectories: underlying structure, target variable, and cumulants T r u e z t ( u n d e r l y i n g s t r u c t u r e a t t ) O b s e r v e d x t , j 3 P r e d i c t e d x t , j 3 z t , 1 ( C u m u l a n t 1 ) z t , 2 ( C u m u l a n t 2 ) Figure 1: Indi vidual simulated and ﬁtted trajectories for a random selection of indi viduals, with true underlying structure z t , observed tar get x t,j 3 , predicted target ˆ x t,j 3 , and learned cumulants ( 3). 7 T ransformers for Small Longitudinal Cohort Data T able 3: Prediction performance of the MiniTransformer for tw o resilience datasets, each with p = 10 , compared to a simple av eraging approach, regression models, and a carry forward approach. Mean squared error is considered across all v ariables (MSE) and for a speciﬁc target v ariable (MSE tar ), together with standard de viations. Smallest values are indicated by boldface. Appr oach Metric Dataset 1 Dataset 2 A verage MSE 0 . 170 ± 0 . 008 0 . 149 ± 0 . 006 MSE tar 0 . 229 ± 0 . 027 0 . 208 ± 0 . 018 Carry forward MSE 0 . 220 ± 0 . 013 0 . 212 ± 0 . 020 MSE tar 0 . 304 ± 0 . 013 0 . 298 ± 0 . 020 Regression MSE 0 . 147 ± 0 . 009 0 . 135 ± 0 . 008 MSE tar 0 . 199 ± 0 . 024 0 . 190 ± 0 . 023 MiniT ransformer MSE 0 . 139 ± 0 . 011 0 . 127 ± 0 . 009 MSE tar 0 . 184 ± 0 . 021 0 . 182 ± 0 . 025 from macro- and microstressors, which we want to in vestigate for their temporal interaction with the target. Items were randomly selected from the av ailable MIMIS and life ev ents inv entories, resulting in a mix of both microstressors (dh) and macrostressors (le) without pre-ﬁltering for theoretical rele vance to the tar get. This random selection allows us to assess whether the statistical testing frame work can identify conte xt effects in an unbiased manner . W e preprocessed the data to obtain binary variables. Speciﬁcally , for the stressors, if a participant reports the presence of an item, we record this as 1, otherwise as 0. The GHQ total is dichotomized with a threshold of ( > 23 ) and GHQ subscales with thresholds ( > 6 ). W e refer to the dichotomized GHQ-sum score as psychological distress and the dichotomized GHQ-b subscale score as anxiety and sleep issues. W e emphasize that this binarization was chosen for assessing stressor effects in a presence/absence sense (independent of frequency or magnitude) and it is not a requirement of the MiniT ransformer architecture, which can, in principle, incorporate continuous-valued v ariables and missingness-aw are representations. The ﬁrst dataset D 1 includes the follo wing items: nightmares (dh_10), sleep problems (dh_35), paperwork (dh_37), housekeeping (dh_38), noise (dh_45), long work hours (dh_53), ﬁnancial problems (le_8), ar guments with a partner (le_17), serious illness (le_22), and a summary measure for Anxiety and Sleep issues (subscale GHQ-b). The second dataset D 2 includes commute to work/school (dh_11), unwanted visit (dh_31), paperwork (dh_37), housekeeping (dh_38), bad weather (dh_42), traf ﬁc (dh_46), lost job (le_1), breakup (le_16), ar guments with a partner (le_17), and a summary measure for psychological distress (GHQ-sum). After selecting the variables of interest for these two datasets, we included only indi viduals with no missing follo w-ups or items in both datasets, purely for simplicity in this real-data demonstration and to keep the showcase focused on the core method. In general, missing follow- ups or items can be handled using standard transformer mechanisms such as attention masking. The ﬁrst dataset comprises |D 1 | = 882 individuals, and the second includes |D 2 | = 878 individuals. In both datasets, indi viduals have sequences of v arying lengths, with a median of 13, a minimum of 3, and a maximum of 20 observations. Note that in autoregressi ve longitudinal modeling, each observ ation is conditioned on an indi vidual’ s prior history , and repeated measurements within a person are strongly dependent. Thus, many follow-ups do not translate into a proportional increase in independent information, i.e., the effecti ve size is primarily gov erned by the number of individuals and trajectory div ersity , placing LORA in a small-data setting. W e compared the MiniT ransformer approach to three other approaches. The ﬁrst and the second are the same as in the simulation study , i.e. an av eraging approach and p regression models, each trained to predict a single v ariable from all variables at the pre vious time point. The third approach just carries the last observed v alues forward as a prediction for the next time point. For the MiniT ransformer , we used H = 8 heads and C = 8 cumulants, and estimated parameters by batched stochastic gradient descent, each batch comprising sequences from two individuals, with a learning rate of η = 0 . 001 and trained the model for 150 epochs. Prediction was assessed across 10 cross-v alidation folds. T able 3 presents the results. The MiniT ransformer approach consistently achiev es the lowest prediction error , both when considering the target variable or the a verage error across all variables. For the statistical testing approach, we used sampling with V = 8 visit samples. For the ﬁrst dataset, we consider anxiety and sleep issues (subscale GHQ-b) as the target, and for the second dataset, psychological distress (GHQ-sum). The results are presented in T able 4. 8 T ransformers for Small Longitudinal Cohort Data T able 4: Results of the permutation testing approach for the two resilience application datasets. A verage p-values are reported together with standard deviations. V ariables are ranked in ascending order of p-values, with rank 1 corresponding to the most important variable. Rank Dataset 1 Dataset 2 V ariable pv al V ariable pval 1 dh_38 0 . 0131 ± 0 . 02 dh_38 0 . 0001 ± 0 . 00 2 dh_53 0 . 0720 ± 0 . 08 dh_37 0 . 0346 ± 0 . 05 3 le_8 0 . 1321 ± 0 . 16 dh_46 0 . 0850 ± 0 . 05 4 dh_37 0 . 1559 ± 0 . 08 dh_11 0 . 0996 ± 0 . 03 5 dh_45 0 . 2240 ± 0 . 11 le_17 0 . 3613 ± 0 . 10 6 dh_10 0 . 2631 ± 0 . 12 le_16 0 . 3631 ± 0 . 12 7 ghq_b_sum 0 . 3326 ± 0 . 43 le_1 0 . 3813 ± 0 . 02 8 le_22 0 . 3732 ± 0 . 08 dh_31 0 . 4589 ± 0 . 11 9 dh_35 0 . 3872 ± 0 . 17 ghq_sum 0 . 6965 ± 0 . 42 10 le_17 0 . 6141 ± 0 . 15 dh_42 0 . 7957 ± 0 . 13 In the ﬁrst dataset, housekeeping (dh_38) had the lo west p-value, followed by long work hours (dh_53), suggesting that these contexts may play an important role in inﬂuencing ho w other v ariables in subsequent observations predict the target. In the second dataset, housekeeping (dh_38) was identiﬁed as the most important conte xt variable, follo wed by paperwork (dh_37), traf ﬁc (dh_46), and commute to work/school (dh_11). At ﬁrst, these results may seem surprising because of the high emphasis on the housekeeping stressor . Still, upon closer examination of the data, we observ e that in the second dataset, housekeeping is a v ery common reported stressor , i.e., it is reported in 94% of the time points. On the other hand, it is a well-kno wn ﬁnding that cessation of housekeeping often leads to a cluttered en vironment that may trigger or worsen mental distress, whereas keeping up with basic cleaning can support mental well-being [36]. T o see if the results from our approach conﬁrm this pattern, we performed a check for the second dataset, where we detected housekeeping as a signiﬁcant context for predicting psychological distress (GHQ_sum). For this, we assumed an imaginary indi vidual, who had reported all ev ents at the ﬁrst time point ( x t 1 ,j = 1 , ∀ j ∈ { 1 , . . . , p + 1 } ), and then we deﬁned different scenarios for the second time point and compared the predicted value. As the ﬁrst scenario, we assumed that the same pattern is reported at the second time point, i.e., x t 1 = x t 2 . Here the model predicts the target as ˆ y t 3 , 10 = 0 . 85 . In the second scenario, we kept all the stressors except for housekeeping ( x t 2 , 5 = 0 ). Using this, the model predicted ˆ y t 3 , 11 = 0 . 89 . In the third scenario, we switched all the stressors at time-point t 2 off except for housekeeping ( x t 2 , 5 = 1 ), and the model predicted ˆ y t 3 , 11 = 0 . 38 . As a ﬁnal scenario, we deﬁned x t 2 ,j = 0 , ∀ j ∈ { 1 , . . . , p + 1 } , and here the model prediction changed to ˆ y t 3 , 11 = 0 . 42 . With these e xamples, we can see that in both changes from scenario one to two and scenario three to four , stopping housekeeping leads to an increase in the predicted value for psychological distress (GHQ_sum). The heatmaps in Figure 2 illustrated the matrix S of test statistics s ( r ) j for both datasets, to highlight context-target effects. No speciﬁc conte xt seems to be important when predicting serious life e vents, e.g., serious illness (le_22) in the ﬁrst dataset and lost job (le_1) and breakup (le_16) in the second dataset. This is in line with the e xpectation that serious life ev ents typically are not triggered by daily hassles. Instead, mental health status-related e vents reported by the GHQ questionnaire may hav e a greater inﬂuence on minor daily hassles, i.e., indi viduals are more likely to report daily hassles, such as paperwork. This also complies with the Mood-congruent memory bias, which is a well-known challenge in mental health research that an indi vidual’ s current mental state can inﬂuence how they percei ve and report stress, and suggests that people experiencing depression or anxiety are more likely to recall and emphasize negati ve or difﬁcult life e vents [37]. 4 Discussion W e introduced the MiniT ransformer approach, a simpliﬁed decoder-only transformer architecture optimized for longitu- dinal data settings with small sample sizes and a few time points. This was complemented by a permutation testing approach that assesses whether a giv en context signiﬁcantly modiﬁes the effect on predictions over time. Due to the ﬂe x- ibility of the patterns that can be picked up by the attention mechanism of a transformer , this statistical testing approach can uncov er context ef fects that reﬂect complex temporal patterns. Speciﬁcally , it quantiﬁes through which v ariables past observations inﬂuence predictions based on the most recent observations, and how these relationships evolv e 9 T ransformers for Small Longitudinal Cohort Data Figure 2: Context-tar get effects for resilience dataset 1 (left) and dataset 2 (right). Dark blue indicates smaller v alues of the visualized test statistic s ( r ) j (i.e., the context has a smaller ef fect), while dark red indicates larger v alues. throughout a sequence. In addition, a visualization based on corresponding test statistics can highlight context–target associations across time, offering an interpretable vie w of temporal dependencies. W e ev aluated and illustrated these approaches using a simulation design with dif ferent sample sizes and a real data application based on tw o subsets of a dataset collected in a resilience study . The results suggested that the MiniT ransformer approach can uncov er patterns even with a very limited number of sequences. Overﬁtting did not seem to be a strong concern, despite still having to estimate a considerable number of parameters. Naturally , there could have been alternati ve w ays for simplifying the standard transformer architecture, as it has many components that could be adapted. W e did not attempt an ev aluation of dif ferent kinds of adaptations because of the large number of v ariants that could potentially be considered. Instead, we chose an approach where we used a rather simple statistical model class, V AR models, as a starting point, and added in only a minimal number of core transformer ideas. This subsequently also enabled an interpretable test statistic for assessing context effects of v ariables. A limitation of the proposed statistical testing approach is the limited sample size for the empirical distribution in settings with a small number of variables. W ith a larger number of v ariables, computational feasibility is a concern. Generating a full null distribution requires a large number of permutations. When having to switch to a sampling-based approach due to computational cost, the statistical po wer of the test may be af fected, particularly in capturing subtle differences between conte xts. This limitation may lead to an increased risk of false negativ es, where potentially informativ e contexts are o verlooked due to insuf ﬁcient permutation-based signiﬁcance estimation. It is worth noting that, although the current implementation focuses on binary variables, the underlying frame work can be readily extended to continuous v ariables. In such cases, contexts could be deﬁned through discretization or binning schemes to capture graded effects. While binary v ariables simplify interpretation, extending the approach to continuous contexts also w ould enable a more ﬁne-grained assessment of ho w varying le vels of a v ariable inﬂuence predictions. This generalization would further enhance the applicability of the MiniT ransformer approach to di verse clinical and longitudinal datasets. In summary , the proposed MiniTransformer approach more generally demonstrates that transformers can indeed be adapted for small longitudinal datasets with few time points, and then not only can achiev e reasonable prediction performance but also can of fer interpretability . Therefore, it might be promising to consider simpliﬁed transformer architectures also for other settings with sequential structure in cohort data. 10 T ransformers for Small Longitudinal Cohort Data A uthor contributions HB and KF developed the MiniT ransformer architecture and the statistical testing approach. KF performed the simulation study and the e valuation on the real data. MHa contributed to the mathematical formulation of the method. MW and MHe provided critical input on the statistical methodology and its interpretation. The LORA consortium co-authors (KA, CS, BK, O T , KL, MP , AR, and RK) contributed to data provision as part of the LORA study . All authors revie wed and approv ed the manuscript. Acknowledgments The work of Maren Hackenberg, Moritz Hess, and Harald Binder was funded by the Deutsche F orschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 499552394 – SFB 1597. This project has received funding from the European Union’ s Horizon 2020 research and innov ation program under Grant Agreement number 777084 (DynaMORE project) and from Deutsche Forschungsgemeinschaft (CRC 1193, subproject Z03). This work is part of the LOEWE Center “D YNAMIC”, funded by the HMWK Hessen. Conﬂict of interest The authors declare no potential conﬂict of interest. References [1] Ashish V aswani, Noam Shazeer , Niki Parmar , Jakob Uszk oreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser , and Illia Polosukhin. Attention is all you need. In Pr oceedings of the 31st International Confer ence on Neural Information Pr ocessing Systems , NIPS’17, pages 6000–6010, Red Hook, NY , USA, December 2017. Curran Associates Inc. [2] Alec Radford and Karthik Narasimhan. Improving Language Understanding by Generati ve Pre-T raining. 2018. [3] Qingsong W en, Tian Zhou, Chaoli Zhang, W eiqi Chen, Ziqing Ma, Junchi Y an, and Liang Sun. Transformers in time series: a surv ey . In Pr oceedings of the Thirty-Second International J oint Confer ence on Artiﬁcial Intelligence , IJCAI ’23, pages 6778–6786, Macao, P .R.China, August 2023. [4] Zhichao Y ang, A vijit Mitra, W eisong Liu, Dan Berlowitz, and Hong Y u. TransformEHR: transformer -based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records. Natur e Communications , 14(1):7857, November 2023. [5] Y ikuan Li, Shishir Rao, José Roberto A yala Solares, Abdelaali Hassaine, Rema Ramakrishnan, Dexter Canoy , Y ajie Zhu, Kazem Rahimi, and Gholamreza Salimi-Khorshidi. BEHR T: Transformer for Electronic Health Records. Scientiﬁc Reports , 10(1):7155, April 2020. [6] Ali Hassani, Ste ven W alton, Nikhil Shah, Abulik emu Abuduweili, Jiachen Li, and Humphre y Shi. Escaping the Big Data Paradigm with Compact T ransformers, June 2022. arXi v:2104.05704 [cs]. [7] T ianyang Lin, Y uxin W ang, Xiangyang Liu, and Xipeng Qiu. A surve y of transformers. AI Open , 3:111–132, January 2022. [8] Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are T ransformers Effecti ve for T ime Series Forecasting?, August 2022. arXi v:2205.13504 [cs]. [9] Helmut Lütkepohl. New Intr oduction to Multiple T ime Series Analysis . Springer , Berlin, Heidelberg, 2005. [10] Robert B. Litterman. F orecasting W ith Bayesian V ector Autoregressions—Fiv e Y ears of Ex- perience. Journal of Business & Economic Statistics , 4(1):25–38, January 1986. _eprint: https://doi.org/10.1080/07350015.1986.10509491. [11] Ben S. Bernanke, Jean Boi vin, and Piotr Eliasz. Measuring the Effects of Monetary Policy: A Factor-Augmented V ector Autoregressi ve (F A V AR) Approach*. The Quarterly J ournal of Economics , 120(1):387–422, February 2005. [12] Howell T ong and Iris Y eung. Threshold Autoregressi ve Modelling in Continuous Time. Statistica Sinica , 1(2):411–430, 1991. [13] Hans-Martin Krolzig. The Markov-Switching V ector Autoregressi ve Model. In Hans-Martin Krolzig, editor , Markov-Switc hing V ector Autor egr essions: Modelling, Statistical Infer ence, and Application to Business Cycle Analysis , pages 6–28. Springer , Berlin, Heidelberg, 1997. 11 T ransformers for Small Longitudinal Cohort Data [14] Ruey S. Tsay . T esting and Modeling Multi variate Threshold Models. J ournal of the American Statistical Association , 93(443):1188–1202, 1998. [15] Zongwu Cai, Jianqing Fan, and Qiwei Y ao. Functional-Coef ﬁcient Regression Models for Nonlinear T ime Series. Journal of the American Statistical Association , 95(451):941–956, 2000. [16] Zachary C. Lipton, John Berko witz, and Charles Elkan. A Critical Revie w of Recurrent Neural Networks for Sequence Learning, October 2015. arXi v:1506.00019 [cs]. [17] Guokun Lai, W ei-Cheng Chang, Y iming Y ang, and Hanxiao Liu. Modeling Long- and Short-T erm T emporal Patterns with Deep Neural Networks, April 2018. arXiv:1703.07015 [cs]. [18] Jiecheng Lu and Shihao Y ang. Linear T ransformers as V AR Models: Aligning Autoregressi ve Attention Mecha- nisms with Autoregressi ve F orecasting, February 2025. [19] Y ingqian Cui, Jie Ren, Pengfei He, Hui Liu, Jiliang T ang, and Y ue Xing. Superiority of Multi-Head Attention: A Theoretical Study in Shallow T ransformers in In-Context Linear Regression. In Pr oceedings of The 28th International Confer ence on Artiﬁcial Intelligence and Statistics , pages 937–945. PMLR, April 2025. [20] Bang An, Jie L yu, Zhenyi W ang, Chunyuan Li, Changwei Hu, Fei T an, Ruiyi Zhang, Y ifan Hu, and Changyou Chen. Repulsiv e Attention: Rethinking Multi-head Attention as Bayesian Inference. In Bonnie W ebber , T rev or Cohn, Y ulan He, and Y ang Liu, editors, Proceedings of the 2020 Confer ence on Empirical Methods in Natural Languag e Pr ocessing (EMNLP) , pages 236–255, Online, Nov ember 2020. Association for Computational Linguistics. [21] Sarthak Jain and Byron C. W allace. Attention is not Explanation, May 2019. arXi v:1902.10186 [cs]. [22] T im Coleman, W ei Peng, and Lucas Mentch. Scalable and Ef ﬁcient Hypothesis T esting with Random Forests. Journal of Mac hine Learning Resear ch , 23(170):1–35, 2022. [23] Daiki Miwa, T omohiro Shiraishi, V o Nguyen Le Duy , T eruyuki Katsuoka, and Ichiro T akeuchi. Statistical T est for Anomaly Detections by V ariational Auto-Encoders, June 2024. arXi v:2402.03724 [stat]. [24] T omohiro Shiraishi, Daiki Miwa, T eruyuki Katsuoka, V olkan Nguyen Le Duy , K ouichi T aji, and Ichiro T akeuchi. Statistical test for attention maps in vision transformers. In Pr oceedings of the 41st International Confer ence on Machine Learning , v olume 235 of ICML’24 , pages 45079–45096, V ienna, Austria, January 2025. JMLR.org. [25] Léon Bottou. Large-Scale Machine Learning with Stochastic Gradient Descent. In Yves Lechev allier and Gilbert Saporta, editors, Pr oceedings of COMPST A T’2010 , pages 177–186, Heidelberg, 2010. Physica-V erlag HD. [26] Maren Hackenberg, Marlon Grodd, Clemens Kreutz, Martina Fischer , Janina Esins, Linus Grabenhenrich, Christian Karagiannidis, and Harald Binder . Using Dif ferentiable Programming for Flexible Statistical Modeling, December 2020. ADS Bibcode: 2020arXiv201205722H. [27] Pablo Barceló, Mikaël Monet, Jorge Pérez, and Bernardo Subercaseaux. Model interpretability through the lens of computational complexity . In Pr oceedings of the 34th International Confer ence on Neural Information Pr ocessing Systems , NIPS ’20, pages 15487–15498, Red Hook, NY , USA, December 2020. Curran Associates Inc. [28] Peter Shaw , Jakob Uszkoreit, and Ashish V aswani. Self-Attention with Relati ve Position Representations, April 2018. arXi v:1803.02155. [29] Claude M. Setodji, Steven C. Martino, Michael S. Dunbar , and W illiam G. Shadel. An Exponential Ef fect Persistence Model for Intensiv e Longitudinal Data. Psycholo gical methods , 24(5):622–636, October 2019. [30] Oliv er K. Schilling, Denis Gerstorf, Anna J. Lücke, Martin Katzorreck, Hans-W erner W ahl, Manfred Diehl, and Ute Kunzmann. Emotional Reacti vity to Daily Stressors: Does Stressor Pile-Up W ithin a Day Matter for Y oung-Old and V ery Old Adults? Psycholo gy and aging , 37(2):149–162, March 2022. [31] Y upeng Li, W ei Dong, Boshu Ru, Adam Black, Xin yuan Zhang, and Y uanfang Guan. Generic medical concept embedding and time decay for div erse patient outcome prediction tasks. iScience , 25(9):104880, August 2022. [32] A. Chmitorz, R. J. Neumann, B. K ollmann, K. F . Ahrens, S. Öhlschläger , N. Goldbach, D. W eichert, A. Schick, B. Lutz, M. M. Plichta, C. J. Fiebach, M. W essa, R. Kalisch, O. Tüscher, K. Lieb, and A. Reif. Longitudinal determination of resilience in humans to identify mechanisms of resilience to modern-life stressors: the longitudinal resilience assessment (LORA) study . Eur opean Ar chives of Psychiatry and Clinical Neur oscience , 271(6):1035– 1051, September 2021. [33] Andrea Chmitorz, Karolina Kurth, Lara K Mey , Mario W enzel, Klaus Lieb, Oliv er Tüscher, Thomas K ubiak, and Raff ael Kalisch. Assessment of Microstressors in Adults: Questionnaire Development and Ecological V alidation of the Mainz In ventory of Microstressors. JMIR Mental Health , 7(2):e14566, February 2020. [34] T urhan Canli, Maolin Qiu, Kazufumi Omura, Eliza Congdon, Brian W . Haas, Zenab Amin, Martin J. Herrmann, R. T odd Constable, and Klaus Peter Lesch. Neural correlates of epigenesis. Pr oceedings of the National Academy of Sciences of the United States of America , 103(43):16033–16038, October 2006. 12 T ransformers for Small Longitudinal Cohort Data [35] D. P . Goldberg, R. Gater , N. Sartorius, T . B. Ustun, M. Piccinelli, O. Gureje, and C. Rutter . The validity of two versions of the GHQ in the WHO study of mental illness in general health care. Psychological Medicine , 27(1):191–197, January 1997. [36] Catherine A. Roster , Joseph R. Ferrari, and M. Peter Jurkat. The dark side of home: Assessing possession ‘clutter’ on subjectiv e well-being. J ournal of En vir onmental Psychology , 46:32–41, June 2016. [37] Władysław Łosiak, Agata Blaut, Joanna Kłosowska, and Julia Łosiak Pilch. Stressful Life Ev ents, Cognitiv e Biases, and Symptoms of Depression in Y oung Adults. F r ontiers in Psycholo gy , 10, September 2019. 13

A statistical perspective on transformers for small longitudinal cohort data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment