General Semiparametric Shared Frailty Model Estimation and Simulation with frailtySurv

JSS Journal of Statistical Software A ugust 2018, V olume 86, Issue 4. doi: 10.18637/jss.v086.i04 General Semiparametric Shared F railt y Mo del: Estimation and Sim ulation with frailt ySurv John V. Monaco Na v al Postgraduate Sc ho ol Malka Gorﬁne T el A viv Universit y Li Hsu F red Hutc hinson Cancer Research Cen ter Abstract The R pac kage frailtySurv for sim ulating and ﬁtting semi-parametric shared frailty mo dels is introduced. Pac kage frailt ySurv implements semi-parametric consistent estima- tors for a v ariet y of frailty distributions, including gamma, log-normal, inv erse Gaussian and pow er v ariance function, and pro vides consistent estimators of the standard errors of the parameters’ estimators. The parameters’ estimators are asymptotically normally distributed, and therefore statistical inference based on the results of this package, such as h yp othesis testing and conﬁdence interv als, can b e p erformed using the normal distri- bution. Extensiv e sim ulations demonstrate the ﬂexibility and correct implemen tation of the estimator. T w o case studies p erformed with publicly av ailable datasets demonstrate applicabilit y of the pac kage. In the Diab etic Retinopath y Study , the onset of blindness is clustered by patient, and in a large hard drive failure dataset, failure times are thought to b e clustered b y the hard drive manufacturer and mo del. K eywor ds : shared frailt y mo del, surviv al analysis, clustered data, frailtySurv , R . 1. In tro duction The semi-parametric Co x prop ortional hazards (PH) regression mo del w as developed by Sir Da vid Co x ( 1972 ) and is by far the most p opular mo del for surviv al analysis. The mo del deﬁnes a hazar d function , whic h is the rate of an even t o ccurring at any giv en time, given the observ ation is still at risk, as a function of the observed co v ariates. When data consist of indep enden t and identically distributed observ ations, the parameters of the Cox PH mo del are estimated using the partial likelihoo d ( Co x 1975 ) and the Breslow ( 1974 ) estimator. Often, the assumption of indep enden t and identically distributed observ ations is violated. In clinical data, it is t ypical for surviv al times to b e clustered or dep end on some unobserv ed co- 2 frailt ySurv : General Semiparametric Shared F railt y Mo del in R v ariates. This can b e due to geographical clustering, sub jects sharing common genes, or some other predisp osition that cannot b e observed directly . Surviv al times can also b e clustered b y sub ject when there are multiple observ ations p er sub ject with common baseline hazard. F or example, the Diab etic Retinopathy Study w as conducted to determine the time to the onset of blindness in high risk diab etic patients and to ev aluate the eﬀectiv eness of laser treatmen t. The treatmen t w as administered to one randomly-selected eye in each patient, leaving the other eye untreated. Ob viously , the tw o eyes’ measurements of each patient are clustered by patien t due to unmeasured patient-speciﬁc eﬀects. Clustered surviv al times are not limited to clinical data. Computer comp onen ts often exhibit clustering due to diﬀeren t materials and man ufacturing pro cesses. The failure rate of magnetic storage devices is of particular in terest since comp onen t failure can result in data loss. A large bac kup storage provider may utilize tens of thousands of hard driv es consisting of hundreds of diﬀerent hard drive mo dels. In ev aluating the time until a hard drive b ecomes inop erable, it is imp ortan t to consider op erating conditions as well as the hard drive mo del. Hard driv e surviv al times dep end on the mo del since commercial grade mo dels may b e built out of b etter materials and designed to ha ve longer lifetimes than consumer grade mo dels. The ab o ve t wo examples are used in Section 5 for demonstrating the usage of the frailtySurv package ( Monaco, Gorﬁne, and Hsu 2018 ). Cla yton ( 1978 ) accounted for cluster-sp eciﬁc unobserved eﬀects b y in tro ducing a random eﬀect term into the prop ortional hazards mo del, which later b ecame known as the shar e d fr ailty mo del . A shared frailt y mo del includes a latent random v ariable, the fr ailty , whic h comprises the unobserv able dep endency b et ween members of a cluster. The frailty has a m ultiplicative eﬀect on the hazard, and giv en the observed cov ariates and unobserved frailty , the surviv al times within a cluster are assumed indep enden t. Under the shared frailt y mo del, the hazard function at time t of observ ation j of cluster i is giv en b y λ ij ( t | Z ij , ω i ) = ω i λ 0 ( t ) e β > Z ij , j = 1 , . . . , m i , i = 1 , . . . , n, (1) where ω i is an unobserv able frailty v ariate of cluster i , λ 0 ( t ) is the unknown common baseline hazard function, β is the unknown regression co eﬃcien t v ector, and Z ij is the observed v ector of co v ariates of observ ation j in cluster i . The frailt y v ariates ω 1 , . . . , ω n , are indep endent and identically distributed with kno wn density f ( · ; θ ) and unknown parameter θ . There are curren tly seve ral estimation tec hniques av ailable with a corresp onding R package ( R Core T eam 2018 ) for ﬁtting a shared frailt y mo del, as shown in T able 1 . In a parametric mo del, the baseline hazard function is of kno wn parametric form, with several unknown pa- rameters. Parameter estimation of parametric mo dels is p erformed by the maximum marginal lik eliho o d (MML) approac h ( Duchateau and Janssen 2007 ; Wienk e 2010 ). The parfm pack- age ( Munda, Rotolo, and Legrand 2012 ) implements several parametric frailt y mo dels. In a semi-parametric mo del, the baseline hazard function is left unsp eciﬁed, a highly imp ortan t feature, as often in practice the shape of the baseline hazard function is unkno wn. Un- der the semi-parametric setting, the top do wnloaded packages, surviv al ( Therneau 2018b ) and coxme ( Therneau 2018a ), implement the p enalized partial lik eliho o d (PPL). frailtypac k parameter estimates are obtained b y nonlinear least squares (NLS) with the hazard func- tion and cum ulative hazard function mo deled by a 4th order cubic M-spline and integrated M-spline, resp ectiv ely ( Rondeau, Mazroui, and Gonzalez 2012 ). Since the frailty term is a latent v ariable, exp ectation maximization (EM) is also a natural estimation strategy for Journal of Statistical Softw are 3 pac kage ::function λ 0 Estimation pro cedure F railt y distributions W eekly do wnloads surviv al ::coxph NP PPL Gamma, LN, L T 3905 gss ::sscox NP PPL LN 1120 co xme ::coxme NP PPL LN 260 frailt ypack ::frailtyPenal NP NLS Gamma, LN 98 R2Ba yesX ::bayesx NP PPL LN 58 phmm ::phmm NP EM LN 52 frailt ySurv ::fitfrail NP PFL Gamma, LN, IG, PVF 50 frailt yHL ::frailtyHL NP HL Gamma, LN 50 parfm ::parfm P MML Gamma, PS, IG 49 survBa yes ::survBayes NP Ba yes Gamma, LN 28 T able 1: R functions for ﬁtting shared frailt y mo dels. NP = nonparametric, P = para- metric, PPL = p enalized partial likelihoo d, NLS = nonlinear least squares, EM = exp ec- tation maximization, PFL = pseudo full likelihoo d, HL = h-likelihoo d, MML = maximum marginal likelihoo d, LN = log-normal, L T = log-t, IG = in verse Gaussian, PS = p ositiv e stable. W eekly do wnloads are av erages from the time the package ﬁrst app ears on the RStu- dio CRAN mirror through 2016-06-01, as rep orted by the RStudio CRAN pac kage download logs: http://cran- logs.rstudio.com/ . semi-parametric mo dels, implemen ted by phmm ( Donohue and Xu 2017 ). More recently , a hierarc hical-likelihoo d (h-likelihoo d, or HL) metho d ( Do Ha, Lee, and k ee Song 2001 ) has b een used to ﬁt hierarc hical shared frailty mo dels, implemented b y frailt yHL ( Do Ha, Noh, and Lee 2018 ). Both R pac kages R2Bay esX ( Brezger, Kneib, and Lang 2005 ; Gu 2014 ) an d gss ( Hirsc h and Wienk e 2012 ) can ﬁt a shared frailty mo del and supp ort only Gaussian random eﬀects with the baseline hazard function estimated by p enalized splines. This work introduces the frailt ySurv R package, an implementation of Gorﬁne, Zuck er, and Hsu ( 2006 ) and Zuck er, Gorﬁne, and Hsu ( 2008 ), wherein an estimation pro cedure for semi- parametric shared frailty mo dels with general frailt y distributions was prop osed. Gorﬁne et al. ( 2006 ) addresses some limitations of other existing metho ds. Sp eciﬁcally , all other a v ailable semi-parametric packages can only be applied with gamma, log-normal (LN), and log-t (L T) frailt y distributions. In con trast, the semi-parametric estimation pro cedure used in frailt ySurv supp orts general frailt y distributions with ﬁnite momen ts, and the current version of frailtySurv implemen ts gamma, log-normal, inv erse Gaussian (IG), and p o wer v ariance function (PVF) frailt y distributions. Additionally , the asymptotic properties of most of the semi-parametric estimators in T able 1 are not known. In contrast, the regression co eﬃcien ts’ estimators, the frailt y distribution parameter estimator, and the baseline hazard estimator of frailt ySurv are bac ked b y a rigorous large-sample theory ( Gorﬁne et al. 2006 ; Zuck er et al. 2008 ). In particular, these estimators are consistent and asymptotically normally distributed. A consistent cov ariance-matrix estimator of the regression co eﬃcien ts’ estimators and the frailt y distribution parameter’s estimator is provided b y Gorﬁne et al. ( 2006 ) and Zuck er et al. ( 2008 ), also implemen ted by frailt ySurv . Alternatively , frailt ySurv can p erform v ariance estimation through a w eigh ted b ootstrap pro cedure. Pac kage frailtySurv is a v ailable from the Comprehensive R Arc hive Net work (CRAN) at https://CRAN.R- project.org/package= frailtySurv . 4 frailt ySurv : General Semiparametric Shared F railt y Mo del in R While some of the packages in T able 1 contain syn thetic and/or real-w orld surviv al datasets, none of them contain functions to sim ulate clustered data. There exist several other packages capable of simulating surviv al data, suc h as the rmultime function in the MST package ( Cal- houn, Su, Nunn, and F an 2018 ), the genSurv function in survMisc ( Dardis 2016 ), and survsim ( Moriña and Na v arro 2014 ), an R package dedicated to sim ulating surviv al data. These func- tions sim ulate only several frailty distributions. frailtySurv contains a rich set of simulation functions, describ ed in Section 2 , capable of generating clustered surviv al data under a wide v ariety of conditions. The sim ulation functions in frailt ySurv are used to empirically v erify the implemen tation of un biased (bug-free) estimators through sev eral simulated exp erimen ts. The rest of this pap er is organized as follows. Sections 2 and 3 describ e the data generation and model estimation functions of frailt ySurv , resp ectively . Section 4 demonstrates sim ulation capabilities and results. Section 5 is a case study of tw o publicly a v ailable datasets, including high-risk patien ts from the Diab etic Retinopathy Study and a large hard drive failure dataset. Finally , Section 6 concludes the pap er. The curren tly supported frailt y distributions are describ ed in Appendix A , full sim ulation results are presen ted in App endix B , and Appendix C con tains an empirical analysis of run time and accuracy . 2. Data generation The genfrail function in frailt ySurv can generate clustered surviv al times under a wide v ariety of conditions. The surviv al function at time t of the j th observ ation of cluster i , given time-indep enden t cov ariate Z ij and frailty v ariate ω i , is given by S ij ( t | Z ij , ω i ) = exp n − Λ 0 ( t ) ω i e β > Z ij o , (2) where Λ 0 ( t ) = R t 0 λ 0 ( u ) du is the unsp eciﬁed cumulativ e baseline hazard function. In the follo wing sections we describ e in detail the v arious options for setting each comp onen t of the ab o v e conditional surviv al function. 2.1. Cov ariates Co v ariates can b e sampled marginally from normal, uniform, or discrete uniform distributions, as sp eciﬁed by the covar.distr parameter. The v alue of β is sp eciﬁed to genfrail through the covar.param parameter. User-supplied cov ariates can also b e passed as the covar.matrix parameter. There is no limit to the cov ariates’ v ector dimension. How ever, the estimation pro cedure requires the num b er of clusters to b e m uch higher than the num b er of cov ariates. These options are demonstrated in Section 2.7 . 2.2. Baseline hazard There are three w ays the baseline hazard can b e sp eciﬁed to generate surviv al data: as the in verse cumulativ e baseline hazard Λ − 1 0 , the cum ulative baseline hazard Λ 0 , or the baseline hazard λ 0 . If the cumulativ e baseline hazard function can b e directly in verted, then failure times can b e computed by T o ij = Λ − 1 0 n − ln ( U ij ) e − β > Z ij /ω i o , (3) Journal of Statistical Softw are 5 where U ij ∼ U (0 , 1) and T o ij is the failure time of member j of cluster i . Consequen tly , if Λ − 1 0 is provided as parameter Lambda_0_inv in genfrail , then surviv al times are determined by Equation 3 . This is the most eﬃcient wa y to generate surviv al data. When Λ 0 cannot b e in verted, one can use a univ ariate ro ot-ﬁnding algorithm to solv e S ij  T o ij | Z ij , ω i  − U ij = 0 (4) for failure time T o ij . Alternativ ely , taking the logarithm and solving − Λ 0  T o ij  ω i e β > Z ij − ln U ij = 0 (5) yields greater numerical stabilit y . Therefore, genfrail uses Equation 5 when Λ 0 is pro vided as parameter Lambda_0 in genfrail and uses the R function uniroot , whic h is based on Bren t’s algorithm ( Brent 2013 ). If neither Λ − 1 0 or Λ 0 are provided to genfrail , then the baseline hazard function λ 0 m ust b e passed as parameter lambda_0 . In this case, Λ 0 ( t ) = Z t 0 λ 0 ( s ) ds (6) is ev aluated numerically . Using the integrate function in the stats package ( R Core T eam 2018 ), whic h implemen ts adaptiv e quadrature, Equation 5 can b e n umerically solv ed for T o ij . This approac h is the most computationally exp ensiv e since it requires n umerical integration to b e p erformed for eac h observ ation ij and at each iteration in the ro ot-ﬁnding algorithm. Section 2.7 demonstrates generating data using each of the ab o ve metho ds, which all generate failure times in the range [0 , ∞ ) . Th e computational complexity of each metho d is O ( n ) under the assumption that a constant amount of w ork needs to b e p erformed for each observ ation. Despite this, the constant amount of work p er observ ation v aries greatly dep ending on how the baseline hazard is sp eciﬁed. Using the inv erse cum ulativ e baseline hazard, there exists an analytic solution for eac h observ ation and only arithmetic op erations are required. Sp ecifying the cum ulative baseline hazard requires ro ot ﬁnding for each observ ation, and sp ecifying the baseline hazard requires b oth ro ot ﬁnding and numerical integration for eac h observ ation. Since the time to p erform ro ot ﬁnding and n umerical integration is not a function of n , the complexit y remains linear in eac h case. App endix C.1 contains b enc hmark sim ulations that compare the timings of eac h metho d. 2.3. Shared frailty Shared frailt y v ariates ω 1 , . . . , ω n are generated according to the sp eciﬁed frailty distribution, through parameters frailty and theta of genfrail , resp ectiv ely . The a v ailable distributions are gamma with mean 1 and v ariance θ ; PVF with mean 1 and v ariance 1 − θ ; log-normal with mean exp( θ / 2) and v ariance exp(2 θ ) − exp( θ ) ; and in verse Gaussian with mean 1 and v ariance θ . genfrail can also generate frailty v ariates from a p ositiv e stable (PS) distribution, although estimation is not supp orted due to the PS having an inﬁnite mean. The supp orted frailt y distributions are describ ed in detail in App endix A . Sp ecifying parameters that induce a degenerate frailty distribution, or passing frailty = "none" , will generate non-clustered data. Hierarc h ical clustering is curren tly not supp orted b y frailt ySurv . 6 frailt ySurv : General Semiparametric Shared F railt y Mo del in R 0.0 0.5 1.0 1.5 0 1 2 3 ω Density Gamma(0.86) LN(1.17) IG(2.03) PVF(0.08) PS(0.70) κ = 0.30 0 1 2 3 ω Gamma(1.64) LN(2.71) IG(15.28) PS(0.55) κ = 0.45 0 1 2 3 ω Gamma(4.67) LN(12.55) PS(0.30) κ = 0.70 Figure 1: F railty distribution densities. The dep endence b et ween tw o cluster members can b e measured by Kendall’s tau 1 , giv en by κ = 4 Z ∞ 0 s L ( s ) L (2) ( s ) ds − 1 , (7) where L is the Laplace transform of the frailty distribution and L ( m ) , m = 1 , 2 , . . . are the m th deriv atives of L . If failure times of the t wo cluster members are indep enden t, κ = 0 . Figure 1 sho ws the densities of the supp orted distributions for v arious v alues of κ . Note that gamma, IG, and PS are sp ecial cases of the PVF. F or gamma, LN, and IG, κ = 0 when θ = 0 , and for PVF, κ = 0 when θ = 1 . Also, for gamma and LN, lim θ →∞ κ = 1 , for IG lim θ →∞ κ = 1 / 2 , for PVF lim θ → 0 κ = 1 / 3 , and for PS κ = 1 − θ . 2.4. Cluster sizes In practice, the cluster sizes m i , i = 1 , . . . , n , can b e ﬁxed or may v ary . F or example, in the Diab etic Retinopathy Study , t wo failure times are observ ed for eac h sub ject, corresponding to the left and right eye. Hence, observ ations are clustered by sub ject, and each cluster has exactly tw o members. If instead the observ ations were clustered b y geographical lo cation, the cluster sizes would v ary , e.g., according to a discrete p o w er law distribution. genfrail is able to generate data with ﬁxed or v arying cluster sizes. F or ﬁxed cluster size, the cluster size parameter K of genfrail is simply an in teger. Alter- nativ ely , the cluster sizes may b e sp eciﬁed by passing a length- N vector of the desired cluster sizes to parameter K . T o generate v aried cluster sizes, K is the name of the distribution to generate from, and K.param sp eciﬁes the distribution parameters. Cluster sizes can b e generated from a k -truncated Poisson with K = "poisson" ( Gey er 2018 ). The truncated Poisson is used to ensure there are no zero-sized clusters and to enforce a 1 Kendall’s tau is denoted by κ to a void confusion with τ , the end of the follo w-up perio d. Journal of Statistical Softw are 7 minim um cluster size. The exp ected cluster size is giv en b y        λ  1 − e − λ P k j =0 λ j j !  − 1 k = 0 , λ − e − λ P k j =1 λ j ( j − 1)! 1 − e − λ P k j =0 λ j j ! k > 0 , (8) where λ is a shap e parameter and k is the truncation p oin t such that min { m 1 , . . . , m n } > k . The t ypical case is with k = 0 for a zero-truncated P oisson. F or example, with λ = 2 and k = 0 , the exp ected cluster size equals 2.313. The parameters of the k -truncated Poisson are determined in K.param = c(lambda, k) of genfrail . A discrete Pareto (or zeta) distribution can also b e used to generate cluster sizes with K = "pareto" . Accurately ﬁtting and generating from a discrete pow er-la w distribution is generally diﬃcult, and genfrail uses a truncated discrete P areto to av oid some of the pitfalls as describ ed in Clauset, Shalizi, and Newman ( 2009 ). The probability mass function is given b y P ( M = m ) = ( m − l ) − s /ζ ( s ) P u − l j =1 j − s /ζ ( s ) , s > 1 , u > l , m = l + 1 , . . . , u, (9) where ζ ( s ) is the Riemann zeta function, s is a scaling parameter, l is the noninclusiv e lo wer b ound, and u is the inclusiv e upper b ound. With large enough u and s  1 , the distribution b eha ves similar to the discrete Pareto distribution and the exp ected cluster size equals 1 ζ ( s ) P ∞ j =1 1 j s − 1 . The distribution parameters are sp eciﬁed as K.param = c(s, u, l) . Finally , a discrete uniform distribution can b e sp eciﬁed b y K = "uniform" in genfrail . The resp ectiv e parameters to K.param are c(l, u) , where l is the noninclusiv e low er b ound and u is the inclusive upp er b ound. Similar to the truncated zeta, the support is { l + 1 , . . . , u } while each cluster size is uniformly selected from this set of v alues. Since the lo w er b ound is noninclusiv e, the exp ected cluster size equals (1 + l + u ) / 2 . 2.5. Censoring The observ ed times T ij and failure indicators δ ij are determined by the failure times T o ij and righ t-censoring times C ij suc h that the observed time of observ ation ij is given by T ij = min  T o ij , C ij  , j = 1 , . . . , m i i = 1 , . . . , n, (10) and the failure indicator is given by δ ij = I  T o ij ≤ C ij  , j = 1 , . . . , m i i = 1 , . . . , n . (11) Curren tly , only right-censoring is supp orted b y frailt ySurv . The censoring distribution is sp eciﬁed by the parameters censor.distr and censor.param for the distribution name and parameters’ vector, resp ectiv ely . A normal distribution is used by default. A log-normal censoring distribution is sp eciﬁed by censor.distr = "lognormal" and censor.param = c(mu, sigma) , where mu is the mean and sigma is the standard deviation of the censoring distribution. Lastly , a uniform censoring distribution can b e sp eciﬁed by censor.distr = "uniform" and censor.param = c(lower, upper) for the low er and upp er bounds on the in terv al, resp ectiv ely . 8 frailt ySurv : General Semiparametric Shared F railt y Mo del in R Sometimes a particular censoring rate is desired. Typically , the censoring distribution pa- rameters are v aried to obtain a desired censoring rate. genfrail can av oid this eﬀort on b ehalf of the user by letting the desired censoring rate b e sp eciﬁed instead. In this case, the appropriate parameters for the censoring distribution are determined to ac hieve the desired censoring rate, given the generated failure times. Let F and G b e the failure time and censoring time cumulativ e distributions, resp ectiv ely . Then, the censoring rate equals E { I ( T o 11 > C 11 ) } = Z ∞ 0 G ( t ) dF ( t ) , (12) where the exp ectation of I ( T o 11 > C 11 ) equals the exp ectation of any random sub ject from the p opulation. The ab o ve formula can b e estimated by b E { I ( T o 11 > C 11 ) } = Z ∞ 0 G ( t ) d b F ( t ) , (13) where b F is the empirical cumulativ e distribution fu nction. T o obtain a particular censoring rate 0 < R < 1 , as a function of the parameters of G , one can solv e R − b E { I ( T o 11 > C 11 ) } = 0 . (14) F or example, if G is the normal cumulativ e distribution function with mean µ and v ariance σ 2 , σ 2 should b e pre-sp eciﬁed (otherwise the problem is non-identiﬁable), and Equation 14 is solv ed for µ . This metho d works with any empirical distribution of failure times. genfrail uses this approac h to achiev e a desired censoring rate, sp eciﬁed b y censor.rate , with normal, log-normal, or uniform censoring distributions. Lastly , user-supplied censorship times can b e supplied through the censor.time parameter, which m ust b e a vector of length N * K , where N is the n umber of clusters and K is the size of each cluster. Because of this, censor.time cannot b e used with v ariable-sized clusters. 2.6. Rounding In some applications the observed times are rounded and tied failure times can o ccur. F or example, the age at onset of certain diseases are often recorded as y ears of age rounded to the nearest integer. T o simulate tied data, the simulated observed times may optionally b e rounded to the nearest in teger of multiple of B by ˙ T ij = B $ T ij B + 0 . 5 % . (15) If B = 1 , the observed times are simply rounded to the nearest integer. The v alue of B is sp eciﬁed by the parameter round.base of genfrail , with the default b eing the non-rounded setting. 2.7. Examples The b est w ay to see how genfrail w orks is through examples. R and frailtySurv v ersions are given by the following commands. Journal of Statistical Softw are 9 R> R.Version()$version.string [1] "R version 3.4.3 (2017-11-30)" R> packageDescription("frailtySurv", fields = "Version") [1] "1.3.5" Consider the surviv al mo del deﬁned in Equation 2 with baseline hazard function λ 0 ( t ) = n d ( ct ) d o t − 1 , (16) where c = 0 . 01 and d = 4 . 6 . Let Gamma (2) b e the frailty distribution, tw o indep enden t standard normally distributed cov ariates, and N  130 , 15 2  the censoring distribution. The resulting surviv al times are represen tative of a late onset disease and with ∼ 40% censoring rate. Generating surviv al data from this mo del, with 300 clusters and 2 mem b ers within each cluster, is accomplished by 2 R> set.seed(2015) R> dat <- genfrail(N = 300, K = 2, beta = c(log(2), log(3)), + frailty = "gamma", theta = 2, + lambda_0 = function(t, c = 0.01, d = 4.6) (d * (c * t) ^ d) / t) R> head(dat, 3) family rep time status Z1 Z2 1 1 1 87.95447 1 -1.5454484 0.9944159 2 1 2 110.04615 0 -0.5283932 -0.9053164 3 2 1 119.94127 1 -1.0867588 0.5240979 Similarly , to generate surviv al data with uniform cov ariates from, e.g., 0 . 1 to 0 . 2 , specify covar.distr = "uniform" and covar.param = c(0.1, 0.2) in the ab o ve example. The co v ariates may also b e sp eciﬁed explicitly in a c(N * K, length(beta)) matrix as the covar.matrix parameter. In the ab o ve example, the baseline hazard function was sp eciﬁed by the lambda_0 parameter. The same dataset can b e generated more eﬃciently using the Lambda_0 parameter if the cu- m ulative baseline hazard function is known. This is accomplished by integrating Equation 16 to get the cumulativ e baseline hazard function Λ 0 ( t ) = ( ct ) d (17) and passing this function as an argument to Lambda_0 when calling genfrail : R> set.seed(2015) R> dat.cbh <- genfrail(N = 300, K = 2, beta = c(log(2),log(3)), + frailty = "gamma", theta = 2, + Lambda_0 = function(t, c = 0.01, d = 4.6) (c * t) ^ d) R> head(dat.cbh, 3) 2 Note that N and K are the parameters of genfrail that corresp ond to math notation n (num b er of clusters) and m i (cluster size), respectively . 10 frailt ySurv : General Semiparametric Shared F railt y Mo del in R family rep time status Z1 Z2 1 1 1 87.95447 1 -1.5454484 0.9944159 2 1 2 110.04615 0 -0.5283932 -0.9053164 3 2 1 119.94127 1 -1.0867588 0.5240979 The cum ulative baseline hazard in Equation 17 is in vertible and it would b e ev en more eﬃcient to sp ecify Λ − 1 0 as Λ − 1 0 ( t ) = c − 1 t 1 /d . (18) This av oids the numerical integration, required by Equation 6 , and ro ot ﬁnding, required by Equation 5 . Equation 18 should b e passed to genfrail as the Lambda_0_inv parameter, again pro ducing the same data when the same seed is used: R> set.seed(2015) R> dat.inv <- genfrail(N = 300, K = 2, beta = c(log(2),log(3)), + frailty = "gamma", theta = 2, + Lambda_0_inv = function(t, c = 0.01, d = 4.6) (t ^ (1 / d)) / c) R> head(dat.inv, 3) family rep time status Z1 Z2 1 1 1 87.95449 1 -1.5454484 0.9944159 2 1 2 110.04615 0 -0.5283932 -0.9053164 3 2 1 119.94127 1 -1.0867588 0.5240979 A diﬀerent frailt y distribution can b e sp eciﬁed while ensuring an exp ected censoring rate b y using the censor.rate parameter. F or example, consider a PVF (0 . 3) frailty distribution while maintaining the 40% censoring rate in the previous example. The censoring distribution parameters are d etermined by genfrail as describ ed in Section 2.5 by sp ecifying censor.rate = 0.4 . This av oids the need to man ually adjust the censoring distribution to ac hieve a particular censoring rate. The resp ective co de and output are: R> set.seed(2015) R> dat.pvf <- genfrail(N = 300, K = 2, beta = c(log(2),log(3)), + frailty = "pvf", theta = 0.3, censor.rate = 0.4, + Lambda_0_inv = function(t, c = 0.01, d = 4.6) (t ^ (1 / d)) / c) R> summary(dat.pvf) genfrail created : 2018-06-14 13:46:36 Observations : 600 Clusters : 300 Avg. cluster size : 2.00 Right censoring rate : 0.39 Covariates : normal(0, 1) Coefficients : 0.6931, 1.0986 Frailty : pvf(0.3) Baseline hazard : Lambda_0 = function (t, tau = 4.6, C = 0.01) (t^(1/tau))/C Journal of Statistical Softw are 11 3. Mo del estimation The fitfrail function in frailtySurv estimates the regression co eﬃcien t vector β , the frailty distribution’s parameter θ , and the non-parametric cumulativ e baseline hazard Λ 0 . The observ ed data consist of { T ij , Z ij , δ ij } for i = 1 , . . . , n and j = 1 , . . . , m i , where the n clusters are indep enden t. fitfrail takes a complete observ ation approach, and observ ations with missing v alues are ignored with a w arning. There are t wo estimation strategies that can b e used. The log-likelihoo d can b e maximized directly , by using control parameter fitmethod = "loglik" , or a system of score equations can b e solved with con trol parameter fitmethod = "score" . Both methods hav e comparable computational requirements and yield comparable results. In b oth metho ds, the estimation pro cedure consists of a doubly-nested loop, with an outer lo op that ev aluates the ob jec- tiv e function and gradien ts and an inner lo op that estimates the piecewise constant hazard, p erforming n umerical integration at eac h time step if necessary . As a result, the estimator implemen ted in frailtySurv has computationally complexit y on the order of O  n 2  . 3.1. Log-likelihoo d The full likelihoo d can b e written as L ( β , θ , Λ 0 ) = n Y i =1 Z m i Y j =1 { λ ij ( T ij | Z ij , ω ) } δ ij S ij ( T ij | Z ij , ω ) f ( ω ) dω = n Y i =1 m i Y j =1 n λ 0 ( T ij ) e β > Z ij o δ ij n Y i =1 ( − 1) N i. ( τ ) L ( N i. ( τ )) { H i. ( τ ) } , (19) where τ is the end of follo w-up p eriod, f is the frailty’s densit y function, N ij ( t ) = δ ij I ( T ij ≤ t ) , N i. ( t ) = P m i j =1 N ij ( t ) , H ij ( t ) = Λ 0 ( T ij ∧ t ) e β > Z ij , and H i. ( t ) = P m i j =1 H ij ( t ) , j = 1 , . . . , m i , i = 1 , . . . , n . Note that the m th deriv ative of the Laplace transform ev aluated at H i. ( τ ) equals ( − 1) N i. ( τ ) R ω N i. ( τ ) exp {− ω H i. ( τ ) } f ( ω ) dω , i = 1 , . . . , n . The log-lik eliho o d equals ` ( β , θ , Λ 0 ) = n X i =1 m i X j =1 δ ij log n λ 0 ( T ij ) e β > Z ij o + n X i =1 log L { N i. ( τ ) } { H i. ( τ ) } . (20) Eviden tly , to obtain estimators b β and b θ based on the log-lik eliho o d, an estimator of Λ 0 , denoted by b Λ 0 , is required. F or given v alues of β and θ , Λ 0 is estimated b y a step function with jumps at the ordered observed failure times τ k , k = 1 , . . . , K , deﬁned b y ∆ b Λ 0 ( τ k ) = d k P n i =1 ψ i  γ , b Λ 0 , τ k − 1  P m i j =1 Y ij ( τ k ) e β > Z ij , k = 1 , . . . , K, (21) where d k is the num b er of failures at time τ k , ψ i ( γ , Λ , t ) = φ 2 i ( γ , Λ , t ) /φ 1 i ( γ , Λ , t ) , Y ij ( t ) = I ( T ij ≥ t ) , and φ ai ( γ , Λ 0 , t ) = L ( N i. ( t )+ a − 1) { H i. ( t ) } a = 1 , 2 . F or the detailed deriv ation of the ab o ve baseline hazard estimation the reader is referred to Gorﬁne et al. ( 2006 ). The estimator of the cumulativ e baseline hazard at time τ k is giv en b y b Λ 0 ( τ k ) = k X l =1 ∆ b Λ 0 ( τ l ) , (22) 12 frailt ySurv : General Semiparametric Shared F railt y Mo del in R and is a function of P m i i =1 b Λ 0 ( T ij ∧ τ k − 1 ) e β > Z ij , i.e., at each τ k , the cum ulative baseline hazard estimator is a function of b Λ 0 ( t ) with t < τ k . Then, for obtaining b β and b θ , b Λ 0 is substituted in to ` ( β , θ , Λ 0 ) . In summary , the estimation pro cedure of Gorﬁne et al. ( 2006 ) consists of the following steps: Step 1. Use standard Cox regression softw are to obtain initial estimates of β , and set the initial v alue of θ to b e its v alue under within-cluster indep endence or under v ery week dep endency (see also the discussion at the end of Section 3.2 ). Step 2. Use the curren t v alues of β and θ to estimate Λ 0 based on the estimation pro cedure deﬁned by Equation 21 . Step 3. Using the current v alue of b Λ 0 , estimate β and θ b y maximizing l ( β , θ , b Λ 0 ) . Step 4. Iterate b et ween Steps 2 and 3 un til con vergence. F or frailty distributions with no closed-form Laplace transform, the in tegral can b e ev aluated n umerically . This adds a considerable o verhead to each iteration in the estimation pro cedure since the integrations must b e p erformed for the baseline hazard estimator that is required for estimating β and θ , as H i. ( τ ) = P m i i =1 Λ 0 ( T ij ∧ τ ) e β > Z ij . With con trol parameter fitmethod = "loglik" , the log-lik eliho o d is the ob jectiv e function maximized directly with resp ect to γ = ( β > , θ ) > , for any giv en Λ 0 , by optim in the stats pac kage using the L-BF GS-B algorithm ( Byrd, Lu, No cedal, and Zh u 1995 ). Bo x constraints sp ecify b ounds on the frailty distribution parameters, t ypically θ ∈ (0 , ∞ ) except for PVF whic h has θ ∈ (0 , 1) . Conv ergence is determined by the relative reduction in the ob jectiv e function through the reltol con trol parameter. By default, this is 10 − 6 . As an example, consider ﬁtting a mo del to the data generated in Section 2 . The following result shows that con vergence is reached after 11 iterations and 15.8 seconds, running Red Hat 6.5, R version 3.2.2, and 2.6 GHz In tel Sandy Bridge pro cessor: R> fit <- fitfrail(Surv(time, status) ~ Z1 + Z2 + cluster(family), + dat, frailty = "gamma", fitmethod = "loglik") R> fit Call: fitfrail(formula = Surv(time, status) ~ Z1 + Z2 + cluster(family), dat = dat, frailty = "gamma", fitmethod = "loglik") Covariate Coefficient Z1 0.719 Z2 1.194 Frailty distribution gamma(1.716), VAR of frailty variates = 1.716 Log-likelihood -2507.725 Converged (method) 11 iterations, 6.75 secs (maximized log-likelihood) Journal of Statistical Softw are 13 3.2. Score equations Instead of maximizing the log-lik eliho o d, one can solve the score equations. The score function with resp ect to β is given by U β = ∂ ∂ β ` ( β , θ , Λ 0 ) = n X i =1   m i X j =1 δ ij Z ij + ∂ ∂ β H i. ( τ ) ∂ ∂ H i. ( τ ) L { N i. ( τ ) } ( H i. ( τ )) L { N i. ( τ ) } ( H i. ( τ ))   = n X i =1   m i X j =1 δ ij Z ij + m i X j =1 H ij ( T ij ) Z ij L { N i. ( τ )+1 } ( H i. ( τ )) L { N i. ( τ ) } ( H i. ( τ ))   . (23) Note that L ( N i. ( τ )+1) { H i. ( τ ) } / L ( N i. ( τ )) { H i. ( τ ) } corresp onds to ψ i in Gorﬁne et al. ( 2006 ). The score function with resp ect to θ is given by U θ = ∂ ∂ θ ` ( β , θ , Λ 0 ) = n X i =1 ∂ ∂ θ L ( N i. ( τ )) ( H i. ( τ )) L ( N i. ( τ )) ( H i. ( τ )) . (24) The score equations are given by U ( β , θ , Λ 0 ) = ( U β , U θ ) = 0 and the estimator of γ = ( β > , θ ) is deﬁned as the v alue of ( β > , θ ) that solv es the score equations for any giv en Λ 0 . Sp eciﬁcally , the only change required in the ab o ve summary of the estimation pro cedure, is to replace Step 3 with the follo wing Step 3’ . Using the current v alue of b Λ 0 , estimate β and θ b y solving U ( β , θ , b Λ 0 ) = 0 . frailt ySurv uses Newton’s metho d implemen ted by the nleqslv pac kage to solve the system of equations ( Hasselman 2017 ). Con vergence is reached when the relativ e reduction of each parameter estimate or absolute v alue of each normalized score is b elow the threshold sp eciﬁed b y reltol or abstol , resp ectiv ely . The default is a relative reduction of b γ less than 10 − 6 , i.e., reltol = 1e-6 . As an example, in the follo wing lines of code and output w e consider again the data gener- ated in Section 2 . The results are comparable to the ﬁtted mo del in Section 3.1 . The score equations can usually b e solved in few er iterations than maximizing the lik eliho o d, although solving the system of equations requires more work in eac h iteration. F or this reason, max- imizing the lik eliho o d is typically more computationally eﬃcien t for large datasets when a p ermissiv e conv ergence criterion is sp eciﬁed. R> fit.score <- fitfrail(Surv(time, status) ~ Z1 + Z2 + cluster(family), + dat, frailty = "gamma", fitmethod = "score") R> fit.score Call: fitfrail(formula = Surv(time, status) ~ Z1 + Z2 + cluster(family), dat = dat, frailty = "gamma", fitmethod = "score") Covariate Coefficient Z1 0.719 Z2 1.194 14 frailt ySurv : General Semiparametric Shared F railt y Mo del in R Frailty distribution gamma(1.716), VAR of frailty variates = 1.716 Log-likelihood -2507.725 Converged (method) 10 iterations, 6.50 secs (solved score equations) L-BF GS-B, used for maximizing the log-lik eliho o d, allo ws for (p ossibly op en-ended) b o x con- strain ts. In con trast, Newton’s metho d, used for solving the system of score equations, do es not supp ort the use of b o x constrain ts and, therefore, has a risk of conv erging to a degenerate parameter v alue. In this case, it is more imp ortan t to hav e a sensible starting v alue. In b oth estimation metho ds, the regression co eﬃcien t vector β is initialized to the estimates giv en b y coxph with no shared frailt y . The frailt y distribution parameters are initialized such that the dep endence b et ween mem b ers in each cluster is small, i.e, with κ ≈ 0 . 3 . 3.3. Baseline hazard The estimated cum ulativ e baseline hazard deﬁned b y Equation 22 is accessible from the resulting mo del ob ject through the fit$Lambda member, which provides a data.frame with the estimates at each observ ed failure time, or the fit$Lambda.fun member, which deﬁnes a scalar R function that takes a time argumen t and returns the estimated cumulativ e baseline hazard. The estimated surviv al curve or cumulativ e baseline hazard can also b e summarized b y the summary metho d for ob jects returned by fitfrail resulting in a data.frame . In the example b elo w, the n.risk column contains the num b er of observ ations still at risk at time t − and the n.event column contains the n umber of failures from the previous time listed to time t + . The output is similar to that of the summary metho d for ‘ survfit ’ ob jects in the surviv al package. R> head(summary(fit), 3) time n.risk n.event surv 1 23.37616 600 1 0.9992506 2 24.38503 599 1 0.9984604 3 25.14435 598 1 0.9976600 R> tail(summary(fit), 3) time n.risk n.event surv 384 139.5629 42 1 0.0016570493 385 140.5862 39 1 0.0011509892 386 141.3295 36 1 0.0007665802 By default, the surviv al curv e estimates at observ ed failure times are returned. Estimates at the censored observ ed times are included if censored = TRUE is passed to the summary metho d for ‘ fitfrail ’ ob jects. The cumulativ e baseline hazard estimates are summarized b y parameter type = "cumhaz" . The estimates can also b e ev aluated at sp eciﬁc times passed to the summary metho d for ‘ fitfrail ’ ob jects through the Lambda.times parameter, demon- strated by: Journal of Statistical Softw are 15 R> summary(fit, type = "cumhaz", Lambda.times = c(20, 50, 80, 110)) time n.risk n.event cumhaz 1 20 600 0 0.00000000 2 50 566 34 0.03248626 3 80 439 127 0.33826069 4 110 274 147 1.69720757 3.4. Standard errors There are t wo wa ys the standard errors can b e obtained for a ﬁtted model. The cov ariance matrix of b γ , the estimators of the regression co eﬃcients and the frailty parameter, can b e obtained explicitly based on the sandwich-t yp e consistent estimator describ ed in Gorﬁne et al. ( 2006 ) and Zuc ker et al. ( 2008 ). The co v ariance matrix is calculated by the vcov function applied to the ‘ fitfrail ’ ob ject returned by fitfrail . Optionally , standard errors can also b e obtained in the call to fitfrail by passing se = TRUE . Using the ab o ve ﬁtted mo del, the co v ariance matrix of b γ is obtained by R> COV.est <- vcov(fit) R> sqrt(diag(COV.est)) Z1 Z2 theta.1 0.09343685 0.12673624 0.36020143 frailt ySurv can also estimate standard errors through a w eighted b o otstrap approach, in whic h the v ariance of both b γ and b Λ 0 are determined 3 . The weigh ted b o otstrap pro cedure consists of indep enden t and identically distributed p ositiv e random weigh ts applied to each cluster. This is in con trast to a nonparametric b ootstrap, wherein eac h b o otstrap sample consists of a random sample of clusters with replacement. The resampling pro cedure of the nonpara- metric b ootstrap usually yields an increased num b er of ties compared to the original data, whic h sometimes causes conv ergence problems. Therefore, w e adopt the weigh ted b ootstrap approac h which do es not c hange the num b er of tied observ ations in the original data. The w eighted b ootstrap is summarized as follows. 1. Sample n random v alues { v ∗ i , i = 1 , . . . , n } from an exp onen tial distribution with mean 1. Standardize the v alues by the empirical mean to obtain standardized weigh ts v 1 , . . . , v n . 2. In the estimation pro cedure, each function of the form P n i =1 h ( T i , δ i , Z i ) is replaced b e the corresp onding weigh ted function P n i =1 v i h ( T i , δ i , Z i ) , where T i = ( T i 1 , . . . , T im i ) , δ i = ( δ i 1 , . . . , δ im i ) , and Z i = ( Z i 1 , . . . , Z im i ) , i = 1 , . . . , n . 3. Rep eat Steps 1–2 B times and tak e the empirical v ariance (and co v ariance) of the B parameter estimates to obtain the weigh ted b ootstrap v ariance (and co v ariance). F or smaller datasets, this pro cess is generally more time-consuming than the explicit estima- tor. If the parallel pac kage is av ailable, all a v ailable cores are used to obtain the bo otstrap parameter estimates in parallel ( R Core T eam 2018 ). Without the parallel package, vcov runs in serial. 3 The sandwich estimator currently only provides the cov ariance matrix of b γ and not b Λ 0 . 16 frailt ySurv : General Semiparametric Shared F railt y Mo del in R R> set.seed(2015) R> COV.boot <- vcov(fit, boot = TRUE, B = 500) R> sqrt(diag(COV.boot))[1:8] Z1 Z2 theta.1 Lambda. 0.00000 0.0742560635 0.0984509739 0.2568936409 0.0000000000 Lambda. 23.37616 Lambda. 24.38503 Lambda. 25.14435 Lambda. 25.33731 0.0006340182 0.0010267995 0.0012781985 0.0014768459 In the preceding example, the full co v ariance matrix for  b γ , b Λ 0  is obtained. If only certain time p oin ts of the estimated cumulativ e baseline hazard function are desired, these can b e sp eciﬁed by the Lambda.times parameter. Since calls to the vcov metho d for ‘ fitfrail ’ ob- jects are typically computationally expensive, the results are cac hed when the same argumen ts are provided. 3.5. Control parameters Con trol parameters provided to fitfrail determine the sp eed, accuracy , and type of es- timates returned. The default control parameters to fitfrail are given b y calling the function fitfrail.control() . This returns a named list with the follo wing mem b ers. fitmethod : Parameter estimation pro cedure. Either "score" to solve the system of score equations or "loglik" to estimate using the log-likelihoo d. Default is "loglik" . abstol : Absolute tolerance for conv ergence. Default is 0 (ignored). reltol : Relative tolerance for con vergence. Default is 1e-6 . maxit : The maxim um n umber of iterations before terminating the estimation procedure. Default is 100. int.abstol : Absolute tolerance for numerical in tegration conv ergence. Default is 0 (ig- nored). int.reltol : Relative tolerance for n umerical in tegration con vergence. Default is 1. int.maxit : The maximum n umber of function ev aluations in numerical in tegration. Default is 1000. verbose : If verbose = TRUE , the parameter estimates and log-lik eliho o d are printed at each iteration. Default is FALSE . The parameters int.abstol , int.reltol , and int.maxit are only used for frailty distri- butions that require n umerical integration, as they sp ecify con vergence criteria of numerical in tegration in the estimation pro cedure inner loop. These con trol parameters can b e adjusted to obtain an sp eed-accuracy tradeoﬀ, whereb y lo wer int.abstol and int.reltol (and higher int.maxit ) yield more accurate numerical integration at the exp ense of more w ork p erformed in the inner lo op of the estimation pro cedure. The abstol , reltol , and maxit parameters sp ecify conv ergence criteria of the outer lo op of the estimation pro cedure. Similar to the numerical in tegration conv ergence parameters, these Journal of Statistical Softw are 17 can also b e adjusted to obtain a sp eed-accuracy tradeoﬀ using either estimation pro cedure ( fitmethod = "loglik" or fitmethod = "score" ). If fitmethod = "loglik" , con vergence is reached when the absolute or relative reduction in log-lik eliho o d is less than abstol or reltol , resp ectiv ely . Using fitmethod = "score" and sp ecifying abstol > 0 (with reltol = 0 ), con vergence is reached when the absolute v alue of each score equation is b elo w abstol . Alternativ ely , using fitmethod = "score" and sp ecifying reltol > 0 (with abstol = 0 ), con vergence is reac hed when the relative reduction of parameter estimates b γ is b elo w reltol . Note that with fitmethod = "score" , abstol and reltol corresp ond to parameters ftol and xtol of nleqslv ::nleqslv , resp ectively . The default con v ergence criteria w ere c hosen to yield approximately the same results with either estimation strategy . 3.6. Mo del ob ject The resulting mo del ob ject returned by fitfrail con tains the regression co eﬃcien ts’ v ector, the frailty distribution’s parameters, and the cumulativ e baseline hazard. Sp eciﬁcally: beta : Estimated regression co eﬃcien ts’ vector named by the input data columns. theta : Estimated frailty distribution parameter. loglik : The resulting log-lik eliho o d. Lambda : data.frame with the cumulativ e baseline hazard at the observ ed failure times. Lambda.all : data.frame with the cumulativ e baseline hazard at all observ ed times. Lambda.fun : Scalar R function that returns the cum ulative baseline hazard at any time p oin t. The mo del ob ject also contains some standard attributes, such as call for the function call. If se = TRUE w as passed to fitfrail , then the mo del ob ject will also con tain mem b ers se.beta and se.theta for the standard error of the regression co eﬃcien ts’ v ector and frailt y parameter estimates, resp ectiv ely . 4. Sim ulation As an empirical proof of implementation, and to demonstrate ﬂexibility , sev eral sim ulations w ere conducted. The simfrail function can b e used to run a v ariety of simulation settings. Sim ulations are run in parallel if the parallel package is a v ailable, and the mc.cores parameter sp eciﬁes how many pro cessor cores to use. F or example, R> set.seed(2015) R> sim <- simfrail(1000, + genfrail.args = alist(beta = c(log(2),log(3)), frailty = "gamma", + censor.rate = 0.30, N = 300, K = 2, theta = 2, + covar.distr = "uniform", covar.param = c(0, 1), + Lambda_0 = function(t, c = 0.01, d = 4.6) (c * t) ^ d), + fitfrail.args = alist( + formula = Surv(time, status) ~ Z1 + Z2 + cluster(family), + frailty = "gamma"), Lambda.times = 1:120) R> summary(sim) 18 frailt ySurv : General Semiparametric Shared F railt y Mo del in R Simulation: 1000 reps, 300 clusters (avg. size 2), gamma frailty Serial runtime (s): 9680.18 (9.68 +/- 1.53 per rep) beta.1 beta.2 theta.1 Lambda.30 Lambda.60 Lambda.90 value 0.6931 1.0986 2.0000 0.003933 0.09539 0.6159 mean.hat 0.6821 1.0929 1.9752 0.003995 0.09716 0.6236 sd.hat 0.2472 0.2529 0.2659 0.001876 0.02248 0.1387 mean.se 0.3130 0.3156 0.3442 NA NA NA cov.95CI 0.9890 0.9850 0.9780 NA NA NA The ab o ve results indicate that the empirical co v erage rates are reasonably close to the nom- inal 95% co verage rate. These results can also b e compared to the estimates obtained by coxph which applies the PPL approac h with gamma frailty mo del: R> set.seed(2015) R> sim.coxph <- simcoxph(1000, + genfrail.args = alist(beta = c(log(2), log(3)), frailty = "gamma", + censor.rate = 0.30, N = 300, K = 2, theta = 2, + covar.distr = "uniform", covar.param = c(0, 1), + Lambda_0 = function(t, c = 0.01, d = 4.6) (c * t) ^ d), + coxph.args = alist( + formula = Surv(time, status) ~ Z1 + Z2 + frailty.gamma(family)), + Lambda.times = 1:120) R> summary(sim.coxph) Simulation: 1000 reps, 300 clusters (avg. size 2), gamma frailty Serial runtime (s): 113.27 (0.11 +/- 0.02 per rep) beta.1 beta.2 theta.1 Lambda.30 Lambda.60 Lambda.90 value 0.6931 1.0986 2.0000 0.003933 0.09539 0.6159 mean.hat 0.6783 1.0913 1.9843 0.004003 0.09754 0.6282 sd.hat 0.2447 0.2522 0.2665 0.001869 0.02221 0.1375 mean.se 0.2456 0.2468 NA NA NA NA cov.95CI 0.9470 0.9440 NA NA NA NA The ab o ve output indicates that the frailt ySurv and PPL approach with gamma frailty dis- tribution provide similar results. Note that the theta.1 mean SE and cov erage rate are NA since coxph do es not provide the SE for the estimated frailty distribution parameter . The correlation b et ween regression co eﬃcien t and frailt y distribution parameter estimates of b oth metho ds is given by R> sapply(names(sim)[grepl("^hat.beta|^hat.theta", names(sim))], + function(name) cor(sim[[name]], sim.coxph[[name]])) hat.beta.1 hat.beta.2 hat.theta.1 0.9912442 0.9911590 0.9982390 The mean correlation b etw een cum ulative baseline hazard estimates is giv en b y Journal of Statistical Softw are 19 R> mean(sapply(names(sim)[grepl("^hat.Lambda", names(sim))], + function(name) cor(sim[[name]], sim.coxph[[name]])), na.rm = TRUE) [1] 0.9867021 F ull simulation results are provided in App endix B and include the following settings: gamma frailt y with v arious num b er of clusters; large cluster size; discrete observed times; oscillating baseline hazard; PVF frailty with ﬁxed and random cluster size; log-normal frailty; and inv erse Gaussian frailt y . It is evident that for all the av ailable frailty distributions our estimation pro cedure and implementation work very w ell in terms of bias, and the sandwic h-type v ariance estimator is dramatically improv ed as the cluster size increases (for example, from 2 to 6). The b ootstrap v ariance estimators are shown to b e accurate even with small cluster size. 5. Case study T o demonstrate the applicability of frailtySurv , results are obtained for tw o diﬀeren t datasets. The ﬁrst is a clinical dataset, for whic h sev eral benchmark results exist. The second is a hard driv e failure dataset from a large cloud backup storage pro vider. Both datasets are pro- vided with frailt ySurv as data("drs", package = "frailtySurv") and data("hdfail", package = "frailtySurv") , resp ectiv ely . 5.1. Diab etic Retinopath y Study The Diab etic Retinopathy Study (DRS) was p erformed to determine whether the onset of blindness in 197 high-risk diab etic patien ts could be dela yed by laser treatmen t ( The Diab etic Retinopath y Study Researc h Group 1976 ). The treatment was administered to one randomly- selected eye in each patien t, lea ving the other eye untreated. Th us, there are 394 observ ations whic h are clustered by patient due to unobserved patient-speciﬁc eﬀects. A failure o ccurred when visual acuity dropp ed to b elo w 5/200, and appro ximately 61% of observ ations are right- censored. All patients had a visual acuity of at least 20/100 at the b eginning of the study . A mo del with gamma shared frailty is estimated from the data. R> data("drs", package = "frailtySurv") R> fit.drs <- fitfrail(Surv(time, status) ~ treated + cluster(subject_id), + drs, frailty = "gamma") R> COV.drs <- vcov(fit.drs) R> fit.drs Call: fitfrail(formula = Surv(time, status) ~ treated + cluster(subject_id), dat = drs, frailty = "gamma") Covariate Coefficient treated -0.918 Frailty distribution gamma(0.876), VAR of frailty variates = 0.876 Log-likelihood -1005.805 Converged (method) 7 iterations, 1.36 secs (maximized log-likelihood) 20 frailt ySurv : General Semiparametric Shared F railt y Mo del in R −1.0 −0.5 0.0 0.5 1.0 1 2 3 4 5 6 7 Iteration Estimate P arameter beta.treated theta.1 P arameter estimate trace −1006.3 −1006.2 −1006.1 −1006.0 −1005.9 −1005.8 1 2 3 4 5 6 7 Iteration Log−liklihood Log−likelihood tr ace Figure 2: Parameter and log-likelihoo d trace. R> sqrt(diag(COV.drs)) treated theta.1 0.1975261 0.3782775 The regression co eﬃcien t for the binary treated v ariable is estimated to b e − 0 . 918 with 0 . 198 estimated standard error, whic h indicates a 60% decrease in hazard with treatment. The p v alue for testing the n ull hypothesis that the treatment has no eﬀect against a tw o sided alternative equals 3 . 5 × 10 − 6 (calculated by 2 * pnorm(-0.918/0.198) ). The parameter trace can b e plotted to determine the path taken by the optimization pro cedure, as follows (see Figure 2 ): R> plot(fit.drs, type = "trace") The long stretc h of nearly-constan t parameter estimates and log-lik eliho od indicates a lo cal maxim um in the ob jectiv e function. In general, a global optim um solution is not guaranteed with numerical tec hniques. The estimated baseline hazard with p oin t-wise 95% b ootstrapp ed conﬁdence interv als is given by (see Figure 3 ): R> set.seed(2015) R> plot(fit.drs, type = "cumhaz", CI = 0.95) where the seed is used to generate the weigh ts in the b ootstrap pro cedure of the cumulativ e baseline hazard plot function. Individual failures are shown b y the rug plot directly ab o ve the time axis. Note that any other CI interv al can b e sp eciﬁed by the CI parameter of the plot metho d for ‘ fitfrail ’ ob jects. Subsequen t calls to the vcov metho d for ‘ fitfrail ’ ob jects with the same arguments will use a cac hed v alue and a void rep eating the computationally- exp ensiv e b o otstrap or sandwich v ariance estimation pro cedures. F or comparison, the following results were obtained with coxph in the surviv al package based on the PPL approach: R> library("survival") R> coxph(Surv(time, status) ~ treated + frailty.gamma(subject_id), drs) Journal of Statistical Softw are 21 0.0 0.5 1.0 1.5 2.0 0 10 20 30 40 50 60 Time Cumulativ e baseline hazard Figure 3: Estimated baseline hazard with p oin t-wise 95% b ootstrapp ed conﬁdence interv als. Call: coxph(formula = Surv(time, status) ~ treated + frailty.gamma(subject_id), data = drs) coef se(coef) se2 Chisq DF p treated -0.910 0.174 0.171 27.295 1.0 1.7e-07 frailty.gamma(subject_id) 114.448 84.6 0.017 Iterations: 6 outer, 30 Newton-Raphson Variance of random effect= 0.854 I-likelihood = -850.9 Degrees of freedom for terms= 1.0 84.6 Likelihood ratio test=201 on 85.6 df, p=2.57e-11 n= 394 5.2. Hard drive failure A dataset of hard drive monitoring statistics and failure w as analyzed. Daily snapshots of a large bac kup storage provider ov er tw o y ears w ere made publicly av ailable 4 . On each da y , the Self-Monitoring, Analysis, and Rep orting T ec hnology (SMAR T) statistics of op erational driv es were recorded. When a hard drive was no longer op erational, it was marked as a failure and remov ed from the subsequent daily snapshots. New hard drives were also con tinuously added to the p opulation. In total, there are ov er 52,000 unique hard driv es o ver approximately t wo y ears of follo w-up and 2885 (5.5%) failures. The data must b e pre-pro cessed in order to extract the SMAR T statistics and failure time of eac h unique hard drive. In some cases, a hard drive fails to rep ort any SMAR T statistics up to sev eral days b efore failing and the most recen t SMAR T statistics b efore failing are recorded. The script for pre-pro cessing is publicly a v ailable 5 . Although there are 40 SMAR T statistics altogether, many (older) drives only rep ort a partial list. The current study is restricted to the co v ariates describ ed in T able 2 , whic h are presen t for all but one hard driv e in the dataset. 4 https://www.backblaze.com/hard- drive- test- data.html 5 https://github.com/vmonaco/frailtySurv- jss 22 frailt ySurv : General Semiparametric Shared F railt y Mo del in R Name Description temp Con tinuous co v ariate, which gives the in ternal temp erature in ◦ C . rer Binary cov ariate, where 1 indicates a non-zero rate of errors that o ccur in hardw are when reading from data from disk. rsc Binary cov ariate, where 1 indicates sectors that encoun tered read, write, or veriﬁcation errors. psc Binary cov ariate, where 1 indicates there w ere sectors w aiting to b e remapp ed due to an unrecov erable error. T able 2: Hard driv e failure cov ariates. The hard drive lifetimes are though t to b e clustered by mo del and man ufacturer. There are 85 unique mo dels ranging in capacity from 80 gigab ytes to 6 terabytes. The cluster sizes lo osely follo w a p o wer-la w distribution, with anywhere from 1 to ov er 15,000 hard drives of a particular mo del. F or a fair comparison, the hard drives of a single manufacturer were selected. The subset of W estern Digital hard drives consists of 40 diﬀeren t mo dels with 178 failures out of 3530 hard driv es. The hard drives are clustered by mo del, and cluster sizes range from 1 to 1190 with a mean of 88.25. A gamma shared frailty mo del w as ﬁtted to the data using the "score" ﬁt metho d and default conv ergence criteria. R> data("hdfail", package = "frailtySurv") R> hdfail.sub <- subset(hdfail, grepl("WDC", model)) R> fit.hd <- fitfrail( + Surv(time, status) ~ temp + rer + rsc + psc + cluster(model), + hdfail.sub, frailty = "gamma", fitmethod = "score") R> fit.hd Call: fitfrail(formula = Surv(time, status) ~ temp + rer + rsc + psc + cluster(model), dat = hdfail.sub, frailty = "gamma", fitmethod = "score") Covariate Coefficient temp -0.0145 rer 0.7861 rsc 0.9038 psc 2.4414 Frailty distribution gamma(1.501), VAR of frailty variates = 1.501 Log-likelihood -1305.134 Converged (method) 10 iterations, 15.78 secs (solved score equations) Bo otstrapped standard errors for the regression co eﬃcien ts and frailty distribution parameter are given by R> set.seed(2015) R> COV <- vcov(fit.hd, boot = TRUE) R> se <- sqrt(diag(COV)[c("temp", "rer", "rsc", "psc", "theta.1")]) R> se Journal of Statistical Softw are 23 0 1 2 3 0 219 438 657 876 1095 1314 1533 1752 1971 2190 Time Cumulativ e baseline hazard Figure 4: Estimated baseline h azard with 95% conﬁdence interv al. temp rer rsc psc theta.1 0.03095664 0.62533725 0.18956662 0.36142850 0.32433275 Signiﬁcance of the regression co eﬃcien t estimates are giv en b y their corresp onding p v alues, R> pvalues <- pnorm(abs(c(fit.hd$beta, fit.hd$theta)) / se, + lower.tail = FALSE) * 2 R> pvalues temp rer rsc psc theta.1 6.400162e-01 2.087038e-01 1.861996e-06 1.429667e-11 3.690627e-06 Only the estimated regression co eﬃcien ts of the reallocated sector count ( rsc ) and p ending sector count ( psc ) are statistically signiﬁcan t at the 0.05 level. Generally , SMAR T statistics are though t to be relativ ely weak predictors of hard drive failure ( Pinheiro, W eb er, and Barroso 2007 ). A hard drive is ab out t wice as likely to fail with at least one previous bad sector (given b y rsc > 0 ), while the hazard increases by a factor of 11 with the presence of bad sectors w aiting to b e remapped. The estimated baseline hazard with 95% CI is also plotted, up to 6 y ears, in Figure 4 . This time span includes all but one hard drive that failed after 15 years (mo del: WDC WD800BB). R> plot(fit.hd, type = "cumhaz", CI = 0.95, end = 365 * 6) 6. Discussion frailt ySurv provides a suite of functions for generating clustered surviv al data, ﬁtting shared frailt y mo dels under a wide range of frailty distributions, and visualizing the output. The semi-parametric mo del has better asymptotic prop erties than most existing implementations, including consisten t and asymptotically-normal estimators, which p enalized partial likelihoo d estimation lac ks. Moreo ver, this is the ﬁrst R package that implemen ts semi-parametric estimators with in verse Gaussian and PVF frailt y mo dels. The complete set of supp orted 24 frailt ySurv : General Semiparametric Shared F railt y Mo del in R frailt y distributions, including implementation details, are describ ed in App endix A . The ﬂexibilit y and robustness of data generation and mo del ﬁtting functions are demonstrated in App endix B through a series of sim ulations. The main limitation of frailt ySurv is the computational complexity , which is approximately an order of magnitude greater than PPL. Despite this, critical sections of co de hav e b een op- timized to provide reasonable p erformance for small and medium sized datasets. Sp eciﬁcally , frailt ySurv caches computationally-exp ensiv e results, parallelizes indep endent computations, and makes extensive use of natively-compiled C++ functions through the Rcpp R package ( Eddelbuettel and F rançois 2011 ). As a remedy for relatively larger computational complex- it y , control parameters allow for ﬁne-grained con trol ov er numerical integration and outer lo op conv ergence, leading to a sp eed-accuracy tradeoﬀ in parameter estimation. The runtime performance and sp eed-accuracy tradeoﬀ of core frailt ySurv functions are exam- ined empirically in App endix C . These sim ulations conﬁrm the O ( n ) complexit y of genfrail and O  n 2  complexit y of fitfrail using either log-likelihoo d maximization or normalized score equations. F railty distributions without analytic Laplace transforms hav e the additional o verhead of numerical integration inside the double-nested lo op, although the growth in run- time is comparable to those without numerical in tegration. Co v ariance matrix estimation also has complexity O  n 2  , dominated b y memory management and matrix op erations. In order to obtain a tradeoﬀ betw een sp eed and accuracy , the con v ergence criteria of the outer lo op estimation pro cedure and conv ergence of n umerical in tegration (for LN and IG frailty) can b e sp eciﬁed through parameters to fitfrail . A ccuracy of the regression co eﬃcien t estimates and frailty distribution parameter, as measured by the residuals, decreases as the absolute and relative reduction criteria in the outer lo op are relaxed (Figure 18 in App endix B ). The sim ulations also indicate a clear reduction in run time as numerical in tegration criteria are relaxed without a signiﬁcant loss in accuracy (Figure 19 in App endix B ). Cho osing a prop er frailty distribution is a challenging problem, although extensive sim ulation studies suggest that missp eciﬁcation of the frailt y distribution do es not aﬀect the bias and eﬃciency of the regression co eﬃcient estimators substantially , despite the observ ation that a diﬀeren t frailt y distribution could lead to appreciably diﬀeren t asso ciation structures ( Glid- den and Vittinghoﬀ 2004 ; Gorﬁne, De-Picciotto, and Hsu 2012 ). There are sev eral existing w orks on tests and graphical pro cedures for chec king the dep endence structures of clusters of size tw o ( Glidden 1999 ; Shih and Louis 1995 ; Cui and Sun 2004 ; Glidden 2007 ). How ever, implemen tation of these pro cedures requires substantial extension to the current package, whic h will b e considered in a separate w ork. A c kno wledgmen ts The authors w ould lik e to thank Go ogle, which partially funded dev elopment of frailtySurv through the 2015 Go o gle Summer of Co de , and NIH grants (R01CA195789 and P01CA53996). References Bern tsen J, Espelid TO, Genz A (1991). “An A daptive Algorithm for the Appro ximate Calculation of Multiple Integrals. ” A CM T r ansactions on Mathematic al Softwar e , 17 (4), 437–451. doi:10.1145/210232.210233 . Journal of Statistical Softw are 25 Bren t RP (2013). A lgorithms for Minimization without Derivatives . Courier Corp oration. Breslo w N (1974). “Co v ariance Analysis of Censored Surviv al Data. ” Biometrics , 30 (1), 89–99. doi:10.2307/2529620 . Brezger A, Kneib T, Lang S (2005). “ Ba yesX : Analyzing Bay esian Structured Additiv e Regres- sion Mo dels. ” Journal of Statistic al Softwar e , 14 (11), 1–22. doi:10.18637/jss.v014.i11 . Byrd RH, Lu P , Nocedal J, Zh u C (1995). “A Limited Memory Algorithm for Bound Con- strained Optimization. ” SIAM Journal on Scientiﬁc Computing , 16 (5), 1190–1208. doi: 10.1137/0916069 . Calhoun P , Su X, Nunn M, F an J (2018). “Constructing Multiv ariate Surviv al T rees: The MST P ackage for R . ” Journal of Statistic al Softwar e , 83 (12), 1–21. doi:10.18637/jss. v083.i12 . Clauset A, Shalizi CR, Newman MEJ (2009). “P ow er-Law Distributions in Empirical Data. ” SIAM R eview , 51 (4), 661–703. doi:10.1137/070710111 . Cla yton DG (1978). “A Mo del for Asso ciation in Biv ariate Life T ables and Its Application in Epidemiological Studies of F amilial T endency in Chronic Disease Incidence. ” Biometrika , 65 (1), 141–151. doi:10.1093/biomet/65.1.141 . Co x DR (1972). “Regression Mo dels and Life-T ables. ” Journal of the R oyal Statistic al So ciety B , 34 (2), 187–220. Co x DR (1975). “Partial Lik eliho od. ” Biometrika , 62 (2), 269–276. doi:10.1093/biomet/ 62.2.269 . Cui S, Sun Y (2004). “Checking for the Gamma F railty Distribution under the Marginal Prop ortional Hazards F railty Mo del. ” Statistic a Sinic a , 14 (1), 249–267. Dardis C (2016). survMisc : Misc el lane ous F unctions for Survival Data . R package v ersion 0.5.4, URL https://CRAN.R- project.org/package=survMisc . Do Ha I, Lee Y, kee Song J (2001). “Hierarc hical Likelihoo d Approach for F railty Mo dels. ” Biometrika , 88 (1), 233–233. doi:10.1093/biomet/88.1.233 . Do Ha I, Noh M, Lee Y (2018). frailtyHL : F r ailty Mo dels via H-Likeliho o d . R pac kage version 2.1, URL https://CRAN.R- project.org/package=frailtyHL . Donoh ue MC, Xu R (2017). phmm : Pr op ortional Hazar ds Mixe d-Eﬀe cts Mo dels . R pac kage v ersion 0.7-10, URL https://CRAN.R- project.org/package=phmm . Duc hateau L, Janssen P (2007). The F r ailty Mo del . Springer-V erlag. Eddelbuettel D, F rançois R (2011). “ Rcpp : Seamless R and C++ Integration. ” Journal of Statistic al Softwar e , 40 (8), 1–18. doi:10.18637/jss.v040.i08 . Genz AC, Malik A (1980). “Remarks on Algorithm 006: An Adaptiv e Algorithm for Numerical In tegration ov er an N-Dimensional Rectangular Region. ” Journal of Computational and A pplie d Mathematics , 6 (4), 295–302. doi:10.1016/0771- 050x(80)90039- x . 26 frailt ySurv : General Semiparametric Shared F railt y Mo del in R Gey er CJ (2018). aster : A ster Mo dels . R package version 0.9.1.1, URL https://CRAN. R- project.org/package=aster . Glidden D V (1999). “Checking the Adequacy of the Gamma F railty Mo del for Multiv ariate F ailure Times. ” Biometrika , 86 (2), 381–393. doi:10.1093/biomet/86.2.381 . Glidden D V (2007). “P airwise Dep endence Diagnostics for Clustered F ailure-Time Data. ” Biometrika , 94 (2), 371–385. doi:10.1093/biomet/asm024 . Glidden DV, Vittinghoﬀ E (2004). “Mo delling Clustered Surviv al Data from Multicentre Clinical T rials. ” Statistics in Me dicine , 23 (3), 369–388. doi:10.1002/sim.1599 . Go edman R, Grothendieck G, Højsgaard S, Pinkus A, Mazur G (2016). Ry acas : R Interfac e to the y acas Computer A lgebr a System . R package version 0.3-1, URL https://CRAN. R- project.org/package=Ryacas . Gorﬁne M, De-Picciotto R, Hsu L (2012). “Conditional and Marginal Estimates in Case- Con trol F amily Data – Extensions and Sensitivity Analyses. ” Journal of Statistic al Com- putation and Simulation , 82 (10), 1449–1470. doi:10.1080/00949655.2011.581669 . Gorﬁne M, Zuck er DM, Hsu L (2006). “Prosp ectiv e Surviv al Analysis with a General Semi- parametric Shared F railt y Mo del: A Pseudo F ull Likelihoo d Approach. ” Biometrika , 93 (3), 735–741. doi:10.1093/biomet/93.3.735 . Gu C (2014). “Smo othing Spline ANO V A Mo dels: R Pac kage gss . ” Journal of Statistic al Softwar e , 58 (5), 1–25. doi:10.18637/jss.v058.i05 . Hanagal DD (2009). “Mo deling Heterogeneit y for Biv ariate Surviv al Data by Po wer V ariance F unction Distribution. ” Journal of R eliability and Statistic al Studies , 2 (1), 14–27. Hasselman B (2017). nleqslv : Solve Systems of Nonline ar Equations . R pac kage version 3.3.1, URL https://CRAN.R- project.org/package=nleqslv . Hirsc h K, Wienke A (2012). “Soft ware for Semiparametric Shared Gamma and Log-Normal F railty Mo dels: An Overview. ” Computer Metho ds and Pr o gr ams in Biome dicine , 107 (3), 582–597. doi:10.1016/j.cmpb.2011.05.004 . Johnson SG (2013). cubature . C library version 1.0.2, URL http://ab- initio.mit.edu/ wiki/index.php/Cubature . Monaco JV, Gorﬁne M, Hsu L (2018). frailt ySurv : Gener al Semip ar ametric Shar e d F r ailty Mo del . R pac kage v ersion 1.3.5, URL https://CRAN.R- project.org/package= frailtySurv . Moriña D, Na v arro A (2014). “The R P ackage survsim for the Sim ulation of Simple and Complex Surviv al Data. ” Journal of Statistic al Softwar e , 59 (2), 1–20. doi:10.18637/jss. v059.i02 . Munda M, Rotolo F, Legrand C (2012). “ parfm : Parametric F railty Mo dels in R . ” Journal of Statistic al Softwar e , 51 (11), 1–20. doi:10.18637/jss.v051.i11 . Pinheiro E, W eb er WD, Barroso LA (2007). “F ailure T rends in a Large Disk Drive Popula- tion. ” In F AST , volume 7, pp. 17–23. Journal of Statistical Softw are 27 R Core T eam (2018). R : A L anguage and Envir onment for Statistic al Computing . R F oun- dation for Statistical Computing, Vienna, Austria. URL https://www.R- project.org/ . Ridout MS (2009). “Generating Random Numbers from a Distribution Sp eciﬁed by its Laplace T ransform. ” Statistics and Computing , 19 (4), 439–450. doi:10.1007/s11222- 008- 9103- x . Rondeau V, Mazroui Y, Gonzalez JR (2012). “ frailt ypack : An R Pac kage for the Analysis of Correlated Surviv al Data with F railty Mo dels Using Penalized Likelihoo d Estimation or Parametrical Estimation. ” Journal of Statistic al Softwar e , 47 (4), 1–28. doi:10.18637/ jss.v047.i04 . Shih JH, Louis T A (1995). “Inferences on the Asso ciation Parameter in Copula Mo dels for Biv ariate Surviv al Data. ” Biometrics , 51 (4), 1384–1399. doi:10.2307/2533269 . Sm yth G, Hu Y, Dunn P , Phipson B, Chen Y (2017). statmo d : Statistic al Mo deling . R pac kage v ersion 1.4.30, URL https://CRAN.R- project.org/package=statmod . The Diab etic Retinopathy Study Researc h Group (1976). “Preliminary Rep ort on Eﬀects of Photo coagulation Therapy . ” A meric an Journal of Ophthalmolo gy , 81 (4), 383–396. doi: 10.1016/0002- 9394(76)90292- 0 . Therneau TM (2018a). coxme : Mixe d Eﬀe cts Cox Mo dels . R pac kage v ersion 2.2-7, URL https://CRAN.R- project.org/package=coxme . Therneau TM (2018b). surviv al : A Package for Survival A nalysis in S . R package v ersion 2.42-3, URL https://CRAN.R- project.org/package=survival . Wienk e A (2010). F r ailty Mo dels in Survival A nalysis . CR C Press. doi:10.1201/ 9781420073911 . Zuc ker DM, Gorﬁne M, Hsu L (2008). “Pseudo-F ull Likelihoo d Estimation for Prosp ec- tiv e Surviv al Analysis with a General Semiparametric Shared F railty Mo del: Asymp- totic Theory . ” Journal of Statistic al Planning and Infer enc e , 138 (7), 1998–2016. doi: 10.1016/j.jspi.2007.08.005 . 28 frailt ySurv : General Semiparametric Shared F railt y Mo del in R A. F railt y distributions All the frailty distributions used in frailtySurv hav e supp ort ω ∈ (0 , ∞ ) . Identiﬁabilit y prob- lems are a voided b y constraining the parameters when necessary . The gamma and PVF hav e a closed-form analytic expression for the Laplace transform, while the log-normal and in verse Gaussian Laplace transforms must b e ev aluated numerically . Analytic deriv ativ es of the gamma and PVF Laplace transform were determined using the Ry acas R pac kage ( Go edman, Grothendiec k, Hø jsgaard, Pinkus, and Mazur 2016 ). The resulting symbolic expressions were v eriﬁed by comparison to n umerical results. All the frailty distribution functions hav e b oth R and C++ implementations, while the C++ functions are used in parameter estimation. The Rcpp R package provides an interface to compiled nativ e co de ( Eddelbuettel and F rançois 2011 ). Numerical in tegration is p erformed by h-adaptive cubature (multi-dimensional integra- tion o ver hypercub es), provided by the cubature C library ( Johnson 2013 ), whic h implements algorithms describ ed in Genz and Malik ( 1980 ) and Berntsen, Esp elid, and Genz ( 1991 ). F or the gamma, log-normal, and inv erse Gaussian, there is a p ositiv e relationship b et ween the distribution parameter θ and the strength of dep endence b et w een cluster members. As θ increases, intra-cluster failure-times dep endency increases. The opp osite is true for the PVF, and as θ increases, the dep endence b et ween failure-times of the cluster’s members decreases. F or frailt y distributions with closed-form Laplace transforms, frailt y v ariates are generated using a mo diﬁed Newton-Raphson algorithm for n umerical transform inv ersion ( Ridout 2009 ). Note that while frailtySurv can generate surviv al data from a p ositiv e stable (PS) frailty distribution with Laplace transform L ( s ) = exp ( − αs α /α ) where 0 < α < 1 , it cannot estimate parameters for this mo del since the PS has inﬁnite mean. F railty v alues from a log-normal distribution are generated in the usual wa y , and in verse Gaussian v ariates are generated using a transformation metho d in the statmo d package ( Sm yth, Hu, Dunn, Phipson, and Chen 2017 ). A.1. Gamma Gamma distribution, denoted by Gamma ( θ − 1 ) ≡ Gamma( θ − 1 , θ − 1 ) , is of mean 1 and v ariance θ . The frailtySurv package uses a one-parameter gamma distribution with shap e and scale b oth θ − 1 , so the density function b ecomes f ( ω ; θ ) = ω 1 θ − 1 exp  − ω θ  θ 1 θ Γ( 1 θ ) . (25) The sp ecial case with θ = 0 is the degenerate distribution in which ω ≡ 1 , i.e., there is no unobserv ed frailty eﬀect. Integrals in the log-likelihoo d function of Equation 20 can b e solved using the Laplace transform deriv atives, given b y L ( m ) ( s ) = ( − 1) m θ − 1 θ  θ − 1 + s  − ( 1 θ + m ) Γ  θ − 1 + m  / Γ  θ − 1  , m = 0 , 1 , 2 , . . . , (26) where L (0) = L . The ﬁrst and second deriv atives of the Laplace transform with resp ect to θ are also required for estimation. Due to their length, these expressions are omitted. See the deriv_lt_dgamma_r and deriv_deriv_lt_dgamma_r internal functions for the explicit expressions. Journal of Statistical Softw are 29 A.2. Po wer v ariance function The p o wer v ariance function distribution is denoted by PVF( θ , δ, θ ) and with density f ( ω ; θ , δ, µ ) = exp − µω + δ θ θ ! 1 π ∞ X k =1 Γ ( k θ + 1) k !  − 1 ω  θk +1 sin ( θ k π ) , (27) where 0 < θ ≤ 1 , µ ≥ 0 , δ > 0 . T o av oid identiﬁabilit y problems, w e let δ = µ = 1 as in Hanagal ( 2009 ), and get a one-parameter PVF density f ( ω ; θ ) = exp  − ω + θ − 1  1 π ∞ X k =1 Γ ( k θ + 1) k !  − 1 ω  θk +1 sin ( θ k π ) . (28) When θ = 1 , the degenerate distribution with ω ≡ 1 is obtained. PVF has exp ectation 1 and v ariance 1 − θ . The Laplace transform is giv en b y L ( s ) = exp h − n (1 + s ) θ − 1 o /θ i . (29) The Laplace transform deriv atives are given b y L ( m ) ( s ) = ( − 1) m L ( s ) m X j =1 c m,j ( θ ) (1 + s ) j θ − m , m = 1 , 2 , . . . (30) with co eﬃcien ts c m,m ( θ ) = 0 c m, 1 ( θ ) = Γ ( m − θ ) Γ (1 − θ ) c m,j ( θ ) = c m − 1 ,j − 1 ( θ ) + c m − 1 ,j ( θ ) { ( m − 1) − j θ } . The partial deriv ativ es of the Laplace transform with resp ect to θ are giv en b y ∂ ∂ θ L ( m ) ( s ) = ∂ ∂ θ   ( − 1) m L ( s ) m X j =1 c m,j ( θ ) (1 + s ) j θ − m   = ( − 1) m  ∂ ∂ θ L ( s )  m X j =1 c m,j ( θ ) (1 + s ) j θ − m + ( − 1) m L ( s ) m X j =1 ( ∂ ∂ θ c m,j ( θ ) (1 + s ) j θ − m + c m,j ( θ ) j (1 + s ) j θ − m ln (1 + s ) ) , (31) where ∂ ∂ θ L ( s ) = exp ( 1 − ( s + 1) θ θ ) ( − 1 − ( s + 1) θ θ 2 − ( s + 1) θ log ( s + 1) θ ) 30 frailt ySurv : General Semiparametric Shared F railt y Mo del in R and the partial deriv atives of the co eﬃcien ts are ∂ ∂ θ c m,m ( θ ) = 0 ∂ ∂ θ c m, 1 ( θ ) = Γ ( m − θ ) n ψ (0) (1 − θ ) − ψ (0) ( m − θ ) o Γ (1 − θ ) ∂ ∂ θ c m,j ( θ ) = ∂ ∂ θ c m − 1 ,j − 1 ( θ ) + ∂ ∂ θ c m − 1 ,j ( θ ) { ( m − 1) − j θ } − j c m − 1 ,j ( θ ) . A.3. Log-normal The log-normal distribution is denoted by LN( θ ) and with density function f ( ω ; θ ) = 1 ω √ θ 2 π exp ( − (ln ω ) 2 2 θ ) , (32) so the mean and v ariance are exp( θ / 2) and exp(2 θ ) − exp( θ ) , resp ectiv ely . The Laplace transform and its deriv atives equal L ( m ) ( s ) = Z ∞ 0 ( − ω ) m e − sω f ( ω ; θ ) dω , m = 0 , 1 , 2 , . . . . (33) Similar to the gamma distribution, the sp ecial case of θ = 0 implies that ω ≡ 1 . The density’s partial deriv ative with resp ect to θ is given b y ∂ ∂ θ f ( ω ; θ ) = ln 2 ( ω ) exp  − ln 2 ω 2 θ  2 √ 2 π θ 5 / 2 ω − exp  − ln 2 ω 2 θ  2 √ 2 π θ 3 / 2 ω . (34) A.4. Inv erse Gaussian The inv erse Gaussian distribution is denoted b y IG( θ ) , with mean 1 and v ariance θ . The densit y is given by f ( ω ; θ ) =  2 π θω 3  − 1 / 2 exp ( − ( ω − 1) 2 2 θ ω ) , (35) where θ > 0 . Th e Laplace transform and its deriv ativ es equal L ( m ) ( s ) = Z ∞ 0 ( − ω ) m e − sω f ( ω ; θ ) dω , m = 1 , 2 , . . . . (36) Similar to the gamma and log-normal, ω ≡ 1 when θ = 0 . The partial deriv ative of the densit y function with resp ect to θ is giv en b y ∂ ∂ θ f ( ω ; θ ) = ( ω − 1) 2 exp n − ( ω − 1) 2 2 θω o 2 √ 2 π θ 2 ω √ θ ω 3 − ω 3 exp n − ( ω − 1) 2 2 θω o 2 √ 2 π ( θ ω 3 ) 3 / 2 . Journal of Statistical Softw are 31 B. Sim ulation results All sim ulations w ere run with 1000 repetitions, n = 300 , ﬁxed cluster size with m = 2 mem b ers within each cluster, cov ariates sampled from U (0 , 1) , regression co eﬃcient v ector β = (log 2 , log 3) > , 30% censorship rate, and Λ 0 as in Equation 17 with c = 0 . 01 and d = 4 . 6 , unless otherwise sp eciﬁed. The same seed is used for eac h conﬁguration. F unction calls are omitted for brevity and can b e seen in the co de rep ository 6 . B.1. Benchmark sim ulation As a b enc hmark simulation, we consider gamma frailt y , with Gamma (2) . The results are summarized as follows: Simulation: 1000 reps, 300 clusters (avg. size 2), gamma frailty Serial runtime (s): 9812.27 (9.81 +/- 1.69 per rep) beta.1 beta.2 theta.1 Lambda.30 Lambda.60 Lambda.90 value 0.6931 1.0986 2.0000 0.003933 0.09539 0.6159 mean.hat 0.6821 1.0929 1.9752 0.003995 0.09716 0.6236 sd.hat 0.2472 0.2529 0.2659 0.001876 0.02248 0.1387 mean.se 0.3130 0.3156 0.3442 NA NA NA cov.95CI 0.9890 0.9850 0.9780 NA NA NA The cum ulative baseline hazard true and estimated functions, with 95% p oin t-wise conﬁdence in terv al, is sho wn in Figure 5 . Figure 6 indicates that by increasing the n umber of clusters, n , the bias and the v ariance of the estimators conv erge to zero, as exp ected. B.2. Large clusters Increasing cluster size improv es the estimated v ariances, esp ecially of the frailty distribution parameter’s estimator. The follo wing simulation results are of Gamma (2) , n = 100 and ﬁxed cluster size with m = 6 , see also Figure 7 . Simulation: 1000 reps, 100 clusters (avg. size 6), gamma frailty Serial runtime (s): 3665.11 (3.67 +/- 0.78 per rep) beta.1 beta.2 theta.1 Lambda.30 Lambda.60 Lambda.90 value 0.6931 1.0986 2.0000 0.003933 0.09539 0.6159 mean.hat 0.7077 1.0910 1.9979 0.003860 0.09554 0.6184 sd.hat 0.1958 0.2038 0.3100 0.001757 0.02191 0.1355 mean.se 0.2135 0.2182 0.3247 NA NA NA cov.95CI 0.9620 0.9660 0.9530 NA NA NA B.3. Discrete observ ation times Data generation allows for failure times to b e rounded with resp ect to a sp eciﬁed base. The observ ed follow-up times w ere rounded to the nearest multiple of 10. The follo wing sim ulation results indicate that even under the setting of ties, the empirical bias is reasonably small, and 6 https://github.com/vmonaco/frailtySurv- jss 32 frailt ySurv : General Semiparametric Shared F railt y Mo del in R 0.0 0.5 1.0 1.5 2.0 0 25 50 75 100 125 Time Cumulative baseline hazard Legend Actual Empirical (0.95 CI) Figure 5: Cum ulative baseline hazard true and estimated functions, with 95% p oin t-wise conﬁdence interv al. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● beta.1 beta.2 theta.1 Lambda.30 Lambda.60 Lambda.90 25 50 250 500 25 50 250 500 25 50 250 500 25 50 250 500 25 50 250 500 25 50 250 500 −2 0 2 N Bias Figure 6: Distribution of the diﬀerence b etw een estimated and true parameters in dep endence of sample size. 0.0 0.5 1.0 1.5 2.0 0 25 50 75 100 125 Time Cumulative baseline hazard Legend Actual Empirical (0.95 CI) Figure 7: Cum ulative baseline hazard true and estimated functions, with 95% p oin t-wise conﬁdence interv al. Journal of Statistical Softw are 33 0 1 2 3 0 25 50 75 100 125 Time Cumulative baseline hazard Legend Actual Empirical (0.95 CI) Figure 8: Cum ulative baseline hazard true and estimated functions, with 95% p oin t-wise conﬁdence interv al. the empirical co verage rates of the conﬁdence interv als are reasonably close to the nominal lev el. S ee the results b elo w and Figure 8 . Simulation: 1000 reps, 300 clusters (avg. size 2), gamma frailty Serial runtime (s): 10160.15 (10.16 +/- 1.73 per rep) beta.1 beta.2 theta.1 Lambda.30 Lambda.60 Lambda.90 value 0.6931 1.0986 2.0000 0.003933 0.09539 0.6159 mean.hat 0.6646 1.0736 1.9733 0.008242 0.14299 0.8182 sd.hat 0.2457 0.2487 0.2630 0.003010 0.03230 0.1837 mean.se 0.3127 0.3147 0.3454 NA NA NA cov.95CI 0.9900 0.9850 0.9800 NA NA NA B.4. Oscillating baseline hazard Consider the baseline hazard function λ 0 ( t ) = a sin( bπ t ) n d ( ct ) d o t − 1 t > 0 (37) where a = 2 , b = 0 . 1 , c = 0 . 01 , and d = 4 . 6 . Suc h an oscillatory comp onen t may b e at ypical in surviv al data, but demonstrates the ﬂexibilit y of frailtySurv data generation and parameter estimation capabilities, as evident in the following simulation results (see also Figure 9 ). Simulation: 1000 reps, 300 clusters (avg. size 2), gamma frailty Serial runtime (s): 9560.76 (9.57 +/- 1.62 per rep) beta.1 beta.2 theta.1 Lambda.30 Lambda.60 Lambda.90 value 0.6931 1.0986 2.0000 0.005658 0.09050 0.7641 mean.hat 0.6810 1.0928 1.9747 0.005746 0.09226 0.7726 sd.hat 0.2462 0.2541 0.2656 0.002339 0.02150 0.1732 mean.se 0.3132 0.3157 0.3445 NA NA NA cov.95CI 0.9880 0.9840 0.9830 NA NA NA B.5. Po wer v ariance function frailt y P ow er v ariance function frailty , with PVF (0 . 3) is considered, and the sim ulation results are 34 frailt ySurv : General Semiparametric Shared F railt y Mo del in R 0.0 0.5 1.0 1.5 2.0 2.5 0 25 50 75 100 125 Time Cumulative baseline hazard Legend Actual Empirical (0.95 CI) Figure 9: Cum ulative baseline hazard true and estimated functions, with 95% p oin t-wise conﬁdence interv al. 0.0 0.5 1.0 1.5 2.0 0 25 50 75 100 125 Time Cumulative baseline hazard Legend Actual Empirical (0.95 CI) Figure 10: Cum ulativ e baseline hazard true and estimated functions, with 95% p oin t-wise conﬁdence interv al. summarized b elo w and in Figure 10 . Simulation: 1000 reps, 300 clusters (avg. size 2), pvf frailty Serial runtime (s): 9004.42 (9.00 +/- 2.01 per rep) beta.1 beta.2 theta.1 Lambda.30 Lambda.60 Lambda.90 value 0.6931 1.0986 0.3000 0.003933 0.09539 0.6159 mean.hat 0.6888 1.0899 0.3245 0.003990 0.09627 0.6223 sd.hat 0.2144 0.2127 0.1127 0.001756 0.01922 0.1124 mean.se 0.2643 0.2687 0.1266 NA NA NA cov.95CI 0.9750 0.9880 0.9670 NA NA NA B.6. Poisson cluster sizes Up un til now, the cluster sizes ha ve b een held constant. V arying cluster sizes are typical in, e.g., geographical clustering and family stud ies. Consider the case in which the family size is randomly sampled from a zero-truncated Poisson with 2 . 313 mean family size. The follo wing simulation results use PVF (0 . 3) . The results are v ery go od in terms of bias and the conﬁdence interv als’ co verage rates; see also Figure 11 . Simulation: 1000 reps, 300 clusters (avg. size 2.315), pvf frailty Journal of Statistical Softw are 35 0.0 0.5 1.0 1.5 2.0 0 25 50 75 100 125 Time Cumulative baseline hazard Legend Actual Empirical (0.95 CI) Figure 11: Cum ulativ e baseline hazard true and estimated functions, with 95% p oin t-wise conﬁdence interv al. Serial runtime (s): 13383.31 (13.38 +/- 2.56 per rep) beta.1 beta.2 theta.1 Lambda.30 Lambda.60 Lambda.90 value 0.6931 1.0986 0.30000 0.003933 0.09539 0.6159 mean.hat 0.6830 1.0777 0.31944 0.004006 0.09767 0.6219 sd.hat 0.1837 0.2020 0.09878 0.001620 0.01838 0.1030 mean.se 0.2361 0.2411 0.10604 NA NA NA cov.95CI 0.9870 0.9790 0.95700 NA NA NA B.7. Log-normal frailty In this simulation, LN (2) was used. The frailt y v ariance equals 47.2. See results b elo w an d Figure 12 . Simulation: 1000 reps, 300 clusters (avg. size 2), lognormal frailty Serial runtime (s): 68060.81 (68.06 +/- 15.07 per rep) beta.1 beta.2 theta.1 Lambda.30 Lambda.60 Lambda.90 value 0.6931 1.0986 2.0000 0.003933 0.09539 0.6159 mean.hat 0.6902 1.0794 1.9597 0.004173 0.09885 0.6282 sd.hat 0.2374 0.2416 0.3805 0.001634 0.02387 0.1280 mean.se 0.3402 0.3557 0.5066 NA NA NA cov.95CI 0.9950 0.9900 0.9650 NA NA NA B.8. Inv erse Gaussian frailty Finally , we used IG (2) , where the frailty v ariance equals 2. Simulation: 1000 reps, 300 clusters (avg. size 2), invgauss frailty Serial runtime (s): 83183.12 (83.18 +/- 17.43 per rep) beta.1 beta.2 theta.1 Lambda.30 Lambda.60 Lambda.90 value 0.6931 1.0986 2.0000 0.003933 0.09539 0.6159 mean.hat 0.6898 1.0862 1.9489 0.004077 0.09648 0.6203 sd.hat 0.2280 0.2305 0.6226 0.001855 0.02108 0.1328 36 frailt ySurv : General Semiparametric Shared F railt y Mo del in R 0.0 0.5 1.0 1.5 2.0 2.5 0 25 50 75 100 125 Time Cumulative baseline hazard Legend Actual Empirical (0.95 CI) Figure 12: Cum ulativ e baseline hazard true and estimated functions, with 95% p oin t-wise conﬁdence interv al. 0.0 0.5 1.0 1.5 2.0 0 25 50 75 100 125 Time Cumulative baseline hazard Legend Actual Empirical (0.95 CI) Figure 13: Cum ulativ e baseline hazard true and estimated functions, with 95% p oin t-wise conﬁdence interv al. mean.se 0.2692 0.2685 0.8916 NA NA NA cov.95CI 0.9840 0.9770 0.9520 NA NA NA See also Figure 13 . C. P erformance analysis R untime was measured by the R function system.time , which measures the CPU time to ev aluate an expression. All runs used 100 clusters of size 2, cov ariates sampled from U (0 , 1) , regression co eﬃcien t vector β = (log 2 , log 3) > , N (130 , 15) censorship distribution, Λ 0 as in Equation 17 with c = 0 . 01 and d = 4 . 6 , and 100 rep etitions of eac h conﬁguration, unless otherwise sp eciﬁed. The benchmark simulations were p erformed using a cluster of Red Hat 6.5 compute no des, eac h with 2 × 2.6 GHz Intel Sandy Bridge (8 core) pro cessors and 64 GB memory . C.1. Core functions The runtimes of frailtySurv functions genfrail and fitfrail , and the vcov metho d for ‘ fitfrail ’ ob jects were determined for increasing v alues of n , ranging from 50 to 200 in Journal of Statistical Softw are 37 Baseline hazard Cumulativ e baseline hazard Inv erse cumulative baseline hazard 0 1 2 3 4 0 1 2 3 0 1 2 3 50 100 150 200 N Runtime (s) Runtime (s) Runtime (s) Gamma LN IG PVF Figure 14: genfrail timings using eac h metho d of baseline hazard sp eciﬁcation for increasing v alues of n . It is most eﬃcient to sp ecify the inv erse cumulativ e baseline hazard to av oid solving for the ro ot in Equation 5 and ev aluating the integral in Equation 6 . incremen ts of 10. F or each function, the runtime was determined for eac h of the four frailty distributions and eac h estimation procedure, where applicable. The b o otstrap co v ariance run time, i.e., vcov for ‘ fitfrail ’ objects with boot = TRUE , was not analyzed since this con- sists primarily of rep etitions of the parameter estimation function, fitfrail . The resulting run times are shown in Figures 14 , 15 , and 16 , resp ectiv ely . Figure 14 shows the runtime of genfrail , whic h is linear in n , i.e., on the order of O ( n ) , although slop e v aries greatly dep ending on ho w the baseline hazard is sp eciﬁed. This is due to the amoun t of w ork that must b e performed p er observ ation. Sp ecifying the cumulativ e baseline hazard or in verse cumulativ e baseline hazard to genfrail results in nearly-constant run time. The linear increase in runtime is more apparent when the baseline hazard is sp eciﬁed since b oth ro ot ﬁnding and numerical in tegration must b e p erformed for each observ ation. The cumulativ e baseline hazard requires only ro ot-ﬁnding to b e p erformed, and the inv erse cum ulative baseline hazard has an analytic solution. The runtimes of fitfrail using each estimation pro cedure and frailt y distribution are shown in Figure 15 . Both estimation pro cedures (log-lik eliho o d reduction and normalized score equations) are on the order of O  n 2  due to the doubly-nested lo op. This complexity is more 38 frailt ySurv : General Semiparametric Shared F railt y Mo del in R Loglikelihood Score 0 20 40 60 0 20 40 60 50 100 150 200 N Runtime (s) Runtime (s) Gamma LN IG PVF Figure 15: fitfrail timings using each estimation metho d for increasing v alues of n . The run time for frailt y distributions requiring numerical integration (inv erse Gaussian and log- normal) grows quick er than those with analytic Laplace transforms (gamma and PVF). 0 20 40 60 50 100 150 200 N Runtime (s) Gamma LN IG PVF Figure 16: Timings for the vcov metho d for ‘ fitfrail ’ ob jects for increasing v alues of n using the analytic cov ariance estimator, i.e., with boot = FALSE . apparen t for the log-normal and inv erse Gaussian frailt y distributions, which b oth ha v e the additional ov erhead of numerical integration. Gamma and PVF frailty distributions ha ve analytic Laplace transforms, thus numerical integration is not performed in the cumulativ e baseline hazard estimation inner lo op. Finally , the runtimes of the vcov metho d for ‘ fitfrail ’ ob jects for each frailt y distribution are shown in Figure 16 . This function is also on the order of O  n 2  , and the sandwic h v ariance estimation pro cedure is dominated by memory management and matrix op erations to compute the Jacobian. As a result, the runtimes of frailt y distributions that require n umerical in tegration (LN and IG) are only marginally larger than those that do not (gamma and PVF). Journal of Statistical Softw are 39 fitfrail co xph frailtyP enal 0 20 40 60 0.00 0.05 0.10 0.15 0.20 0 5 10 15 20 50 100 150 200 N Runtime (s) Runtime (s) Runtime (s) Gamma LN Figure 17: Comparison of frailty mo del estimation run times using frailt ySurv ::fitfrail , surviv al ::coxph , and frailt ypack ::frailtyPenal for increasing v alues of n . fitfrail uses fitmethod = "score" , coxph uses default parameters, and frailtyPenal uses n.knots = 10 and kappa = 2 . C.2. Comparison to other packages The run time of fitfrail using fitmethod = "score" is compared to the functions coxph and frailtyPenal for increasing v alues of n . The runtimes for gamma and log-normal frailty distributions are determined, since this is the largest in tersection of frailty distributions that all three functions supp ort. The resulting run times are sho wn in Figure 17 . Each estimation pro cedure exhibits quadratic complexity on a diﬀerent scale. coxph remains roughly an order of magnitude quick er than fitfrail and frailtyPenal for log-normal frailty . frailtyPenal exhibits a large diﬀerence in p erformance b et w een frailt y distributions. C.3. Sp eed-accuracy tradeoﬀ A sp eed-accuracy tradeoﬀ is achiev ed b y v arying the conv ergence control parameters of the outer lo op estimation pro cedure and n umerical in tegration in the inner lo op. The abstol and reltol parameters con trol the outer lo op conv ergence, and int.abstol and int.reltol con trol the con vergence of the adaptive cubature numerical in tegration in the inner lo op. 40 frailt ySurv : General Semiparametric Shared F railt y Mo del in R 0 50 100 150 −0.2 −0.1 0.0 0.1 0.2 −1.0 −0.5 0.0 0.5 1.0 10 0 10 − 1 10 − 2 10 − 3 10 − 4 10 − 5 10 − 6 10 − 7 10 − 8 10 − 9 10 0 10 − 1 10 − 2 10 − 3 10 − 4 10 − 5 10 − 6 10 − 7 10 − 8 10 − 9 abstol reltol Runtime (s) β − β θ − θ Gamma LN IG PVF Figure 18: Sp eed and accuracy curv es obtained by v arying th e outer lo op con vergence con trol parameters, abstol and reltol . As abstol is v aried, reltol is set to 0 (i.e., it is ignored), and vice versa. b β − β and b θ − θ are the residuals of regression co eﬃcien t and frailt y distribution parameter estimates, resp ectiv ely . Shaded areas cov er the 95% conﬁdence interv als. Sp eed is measured b y the run time of fitfrail , and accuracy is measured by the estimated parameter residuals. In this set of simulations, the scalar regression co eﬃcien t β = log 2 , is used. F railt y distri- bution parameters are chosen such that κ = 0 . 3 ( β = 0 . 857 for gamma, β = 1 . 172 for LN, β = 2 . 035 for IG, and β = 0 . 083 for PVF). With N (130 , 15) right censorship distribution, this results in censoring rates 0.25 for gamma and PVF, 0.16 for LN, and 0.30 for IG. Both the runtime and residuals for β and θ are rep orted using the "score" ﬁt metho d. Figure 18 shows the sp eed and accuracy curves for increasing v alues of abstol and reltol on a log scale, taking on v alues in { 10 − 9 , . . . , 10 0 } . The runtime and residuals remain approxi- mately constant up to 10 − 3 for b oth abstol and reltol . Beyond 10 − 3 , the tradeoﬀ b et w een run time and accuracy is apparen t, esp ecially for frailty distributions requiring n umerical inte- gration. As the con vergence criterion is relaxed, run time decreases and a bias is in tro duced to the parameter estimates. Run times for gamma and PVF frailty are nearly iden tical as these frailt y distributions do not require n umerical in tegration. Journal of Statistical Softw are 41 0 200 400 600 −0.10 −0.05 0.00 0.05 0.10 −0.2 −0.1 0.0 0.1 0.2 10 0 10 − 1 10 − 2 10 − 3 10 − 4 10 − 5 10 − 6 10 − 7 10 − 8 10 − 9 10 0 10 − 1 10 − 2 10 − 3 10 − 4 10 − 5 10 − 6 10 − 7 10 − 8 10 − 9 int.abstol int.reltol Runtime (s) β − β θ − θ LN IG Figure 19: Sp eed and accuracy curves obtained b y v arying the inner lo op n umerical integra- tion conv ergence control parameters, int.abstol and int.reltol . As int.abstol is v aried, int.reltol is set to 0 (i.e., it is ignored), and vice v ersa. b β − β and b θ − θ are the residuals of regression co eﬃcien t and frailt y distribution parameter estimates, resp ectiv ely . Shaded areas co ver the 95% conﬁdence interv als. Figure 19 shows the sp eed and accuracy curves obtained if the parameters int.abstol and int.reltol are v aried ov er the same set of v alues for LN and IG frailty distributions. On this scale, the decrease in run time is appro ximately linear, while residuals of the regression co eﬃcien t and frailty distribution parameters do not signiﬁcan tly change. W e v eriﬁed that the same b eha vior o ccurs using fitmethod = "loglik" . This suggests that the estimation pro cedure is robust to lo w-precision n umerical integration with whic h signiﬁcan tly faster run times can b e ac hieved. A strategy for parameter estimation on larger datasets might then b e to ﬁrst ﬁt a mo del with a high v alue of int.abstol or int.reltol and iterativ ely decrease the numerical integration conv ergence until the parameter estimates do not change. 42 frailt ySurv : General Semiparametric Shared F railt y Mo del in R Aﬃliation: John V. Monaco Departmen t of Computer Science Na v al Postgraduate School Mon terey , CA 93943, United States of America E-mail: vinnie.monaco@nps.edu URL: http://www.vmonaco.com/ Malka Gorﬁne Departmen t of Statistics and Op erations Research T el A viv Universit y Ramat A viv T el A viv, 6997801, Israel E-mail: gorfinem@post.tau.ac.il URL: http://www.tau.ac.il/~gorfinem/ Li Hsu Public Health Sciences Division Biostatistics and Biomathematics Program F red Hutchinson Cancer Researc h Cen ter 1100 F airview A v e. N., M2-B500 Seattle, W A 98109-1024, United States of America E-mail: lih@fredhutch.org URL: https://www.fredhutch.org/en/labs/profiles/hsu- li.html Journal of Statistical Software http://www.jstatsoft.org/ published by the F oundation for Op en Access Statistics http://www.foastat.org/ A ugust 2018, V olume 86, Issue 4 Submitte d: 2015-10-06 doi:10.18637/jss.v086.i04 A c c epte d: 2017-10-22

General Semiparametric Shared Frailty Model Estimation and Simulation with frailtySurv

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment