Estimating Zero-inflated Negative Binomial GAMLSS via a Balanced Gradient Boosting Approach with an Application to Antenatal Care Data from Nigeria

Estimating Zero-inﬂated Ne gati v e Binomial GAMLSS via a Balanced Gradient Boosting Approach with an Application to Antenatal Care Data from Nigeria Alexandra Daub 1 ,* and Elisabeth Bergherr 1 1 Chair of Spatial Data Science and Statistical Learning, Georg-August-Uni versit ¨ at G ¨ ottingen, Platz der G ¨ ottinger Sieben 3, 37073 G ¨ ottingen, Germany ∗ Correspondence: alexandra.daub@uni-goettingen.de Abstract Statistical boosting algorithms are renowned for their intrinsic v ariable selection and enhanced predictive perfor- mance compared to classical statistical methods, making them especially useful for complex models such as general- ized additi ve models for location scale and shape (GAMLSS). Boosting this model class can suffer from imbalanced updates across the distribution parameters as well as long computation times. Shrunk optimal step lengths hav e been shown to address these issues. T o examine the inﬂuence of socio-economic factors on the distribution of the number of antenatal care visits in Nigeria, we generalize boosting of GAMLSS with shrunk optimal step lengths to base-learners beyond simple linear models and to a more complex response variable distribution. In an extensiv e simulation study and in the application we demonstrate that shrunk optimal step lengths yield a more balanced re gularization of the ov er- all model and enhance computational efﬁciency across diverse settings, in particular in the presence of base-learners penalizing the size of the ﬁt. Keyw ords : Antenatal Care; Gradient Boosting; GAMLSS; Regularized Regression; Step Length; V ariable Selection. 1 Intr oduction Generalized additiv e models for location, scale and shape (GAMLSS; Rigby and Stasinopoulos, 2005) ha ve receiv ed increasing attention in recent years (Kneib, 2013; Klein, 2024), being able to model parameters beyond the mean and to incorporate a variety of effects. They extend generalized additiv e models (GAMs; Hastie and Tibshirani, 1990) in that all distribution parameters are modeled based on an additi ve predictor instead of one. Due to their ability to model the response variable distribution in a very ﬂexible manner, GAMLSS are used in a v ariety of ﬁelds. In particular , GAMLSS are e.g. recommended for estimating child growth curves, where they are used to model the distribution of characteristics like height and weight as a function of age (W orld Health Organization, 2006). Beyond biostatistical settings, GAMLSS have also been applied in ﬁelds such as ﬁnance and for modeling weather and climate-related measurements (Ganegoda and Evans, 2012; V illarini et al., 2010). While typically estimated using penalized maximum likelihood, alternativ e estimation techniques have emerged to take advantage of their particular strengths, one of which is statistical boosting (Mayr et al., 2012a; Thomas et al., 2018). In statistical boosting algorithms (Friedman, 2001; B ¨ uhlmann and Y u, 2003), regression models with a low degree of freedom are iterativ ely ﬁtted to the current (pseudo-) residuals, resulting in an ensemble of weak learners as ﬁnal model. Stopping the iterative updating scheme early allows the method to perform v ariable selection as well as coefﬁcient shrinkage, often enhancing predicti ve performance compared to traditional methods. Because of this ability to yield sparse models, statistical boosting algorithms are particularly advantageous for comple x model classes such as GAMLSS. The increased model complexity , howe ver , also poses challenges when boosting GAMLSS. While in general the non-c yclical algorithm introduced by Thomas et al. (2018) is f avorable, it has been sho wn to be prone to imbalanced predictor updates and long computation times (Zhang et al., 2022; Daub et al., 2025). These issues arise when the potential updates of the different submodels are on different scales causing the selection procedure to fa vor updates of certain submodels over others. They originate from structural differences in the sizes of negati ve gradients (hence reﬂecting the structure of the respectiv e likelihood) and, for example, occur in Gaussian location and scale model with a large v ariance (Zhang et al., 2022). One way to address such imbalances between submodels is to apply adaptive rather than ﬁxed step lengths (Zhang et al., 2022; Daub et al., 2025), whereby the adaptiv e step lengths compensate for otherwise small update sizes and thus enable a fairer update selection. A more detailed e xplanation of this topic is pro vided in section 2. 1 T o date, adaptiv e step lengths hav e only been applied for boosting GAMLSS with two-parameter re- sponse v ariable distributions and simple linear base-learners. Zhang et al. (2022) consider a Gaussian location and scale model with simple linear base-learners and propose using shrunk optimal step lengths to address the balancing issue. Daub et al. (2025) inv estigate a different type of adaptive step length, in addi- tion to shrunk optimal step lengths, for negati ve binomial and W eibull distributed response variables. In this work, we generalize this approach with respect to two aspects. First and most importantly , a wider v ariety of base-learner types is considered, speciﬁcally for non-linear , categorical and spatial effects, extending beyond simple linear base-learners. Second, we apply the boosting algorithm to a zero-inﬂated negati ve bi- nomial model for location, scale and shape (ZINB-GAMLSS), which has a more complex response v ariable distribution with substantial dependencies among its parameters. The motiv ation of this extension is to enable a more sophisticated in vestigation of the relationship be- tween antenatal care utilization in Nigeria and socio-economic, demographic and conte xtual characteristics of the mother . W ith 560 deaths per 100,000 li ve births reported in 2013, compared to 16 deaths per 100,000 liv e births in OECD member countries (W orld Health Or ganization et al., 2014), Nigeria is among the coun- tries with the highest maternal mortality worldwide. Moreov er, the country has an insuf ﬁcient antenatal care utilization: 69.8% of the women recei ve fe wer than the recommended minimum number of se ven antenatal care visits for pregnancies without complications and 34.2% report having receiv ed no antenatal care at all (National Population Commission (NPC) [Nigeria] and ICF International, 2014; W orld Health Organiza- tion, 2016). Since antenatal care enables the identiﬁcation of pregnanc y-related and deliv ery-related risks and offers timely and appropriate interventions, strengthening the antenatal care utilization is regarded a key strategy for reducing maternal mortality (W orld Health Organization, 2016). T o in vestigate the effects on antenatal care utilization, we consider data from the 2013 Nigeria Demographic and Health Surve y (Na- tional Population Commission (NPC) [Nigeria] and ICF International, 2014; Gayawan and Omolofe, 2016). In addition to the number of antenatal care visits, it comprises inter alia data on the mother’ s education, her age at birth, access to mass media and the region of residence. From a statistical perspecti ve, the number of antenatal care visits is characterized by excess zeros and a pronounced ov erdispersion. As moreover recei ving no antenatal care is especially problematic and the fac- tors associated with this outcome are therefore of great interest, a zero-inﬂated negati ve binomial model for location, scale and shape is considered. This model framework allo ws to distinguish between the covariate effects on the conditional mean and those on the zero-inﬂation probability , and accommodates a variety of effect types, which is advantageous gi ven this presence of continuous, categorical and spatial co variates. In addition to the main effects, interaction terms are included to assess whether speciﬁc cov ariate combinations are particularly problematic. As this giv es rise to 83 potential effects for each distribution parameter , an estimation method with intrinsic variable selection and regularization is required, where statistical boosting is a prominent choice. When applying non-cyclical boosting with ﬁxed step lengths, we encountered the mentioned problems of long run times and imbalanced predictor updates. The remainder of this article is structured as follows. Section 2 introduces GAMLSS as well as component-wise gradient boosting and expands on how using shrunk optimal step lengths improves the balancedness of the estimated model. In addition, the special properties of shrunk optimal step lengths when combined with penalized base-learners as well as an alternative numerical approach to obtain opti- mal step lengths for a ZINB-GAMLSS are presented. Section 3 contains simulation results for a Gaussian location and scale model with dif ferent effect types as well as the zero-inﬂated negati ve binomial model. In section 4 the results from modeling the number of antenatal care visits in Nigeria are presented and section 5 concludes with an ov erview of our ﬁndings and a discussion. 2 Boosting Generalized Additive Models for Location, Scale and Shape 2.1 Generalized Additive Models f or Location, Scale and Shape In GAMLSS, the response variable Y is assumed to follow a distribution D whose parameters θ 1 , ..., θ K all depend on co variates X = ( X 1 , ..., X J ) ⊤ or suitable subsets (Rigby and Stasinopoulos, 2005). T o account for the parameter space restrictions of D , each distribution parameter θ k , k ∈ { 1 , ..., K } , is related to a structured additiv e predictor η θ k through a bijecti ve link function g θ k , such that g θ k ( θ k ) = η θ k (Fahrmeir et al., 2013). For gi ven cov ariate information X = x , the conditional distribution of Y is speciﬁed as Y | X = x ∼ D ( θ 1 ( x ) , ..., θ K ( x )) with g θ k ( θ k ( x )) = η θ k ( x ) = β 0 ,θ k + J X j =1 f j,θ k ( x j ) , k ∈ { 1 , ..., K } , 2 where f j,θ k is the assumed type of effect of cov ariate X j , j ∈ { 1 , ..., J } , on predictor η θ k . Moti vated by the application, the effect types we will focus on in this w ork are: • linear ef fects, i.e. f ( x j ) = x j β j,θ k • non-linear effects modeled via penalized splines with B-spline basis of degree three and a second order difference penalty (Eilers and Marx, 1996) • categorical ef fects modeled as joint dummy-coded linear effect • discrete spatial effects represented by a Markov random ﬁeld with a ﬁrst-order neighborhood structure (Rue and Held, 2005) For a zero-inﬂated negati ve binomial response variable for example, the three distribution parameters that are modeled are θ 1 = µ , θ 2 = α and θ 3 = π , where π represents the probability that the response variable Y is generated by a zero process, while µ and α correspond to the location and scale parameters of the negati ve binomial distribution that generates Y otherwise. Using the canonical link functions g µ = ln , g α = ln and g π = logit, the zero-inﬂated negativ e binomial model for location, scale and shape (ZINB- GAMLSS) is Y | X = x ∼ ZINB ( µ ( x ) , α ( x ) , π ( x )) with µ ( x ) = exp   β 0 ,µ + J X j =1 f j,µ ( x j )   α ( x ) = exp   β 0 ,α + J X j =1 f j,α ( x j )   π ( x ) = logit − 1   β 0 ,π + J X j =1 f j,π ( x j )   . 2.2 Model-based Boosting As part of the statistical learning toolbox, model-based boosting arose from the ﬁeld of machine learning (Freund and Schapire, 1996) but gained much attention in the statistical community (Hastie et al., 2009) as it can be applied to estimate statistical models (Friedman et al., 2000; Friedman, 2001). Advantages of boosting compared to classical statistical methods include its applicability to high-dimensional data and an intrinsic variable selection (B ¨ uhlmann, 2006; Mayr et al., 2014b). Boosting relies on an iterative estimation procedure to minimize a predeﬁned loss function, where weak learners, often referred to as base-learners, are ﬁtted in ev ery iteration. The overall model is obtained by combining these weak learners into an ensemble (Mayr et al., 2014a). In statistical boosting, the base-learners are statistical models of low complexity , for example simple linear models or splines with a low degree of freedom. For ﬁtting (generalized) additive models, each covariate x j is assigned a base-learner that represents the corresponding effect f j , j ∈ { 1 , ..., J } . The ﬁtted base-learners selected across all iterations m ∈ { 1 , ..., m max } , which are typically denoted by ˆ h [ m ] j , j ∈ { 1 , ..., J } , and each only capture part of the effect f j , are then linearly combined to form the effect estimate ˆ f j . This work focuses on component-wise gradient boosting (B ¨ uhlmann and Y u, 2003; B ¨ uhlmann and Hothorn, 2007), which will be referred to as boosting from this point on. In order to minimize the loss function, in this boosting approach the base-learners are ﬁtted to the negati ve gradient vector . In every iteration the model is only updated by the best-ﬁtting base-learner with respect to a least squares criterion and the ﬁtted base-learner is scaled by a small step length factor in order to av oid overﬁtting. In general, base-learners can range from simple linear models to trees. Here, we focus on regression-type base-learners that are of the form of a penalized linear model. The ﬁtted base-learners ˆ h j , j ∈ { 1 , ..., J } , applied to the observed co variate v alues x j , can thus be expressed by ˆ h j ( x j ) = X j  X ⊤ j X j + ˜ λ j K j  − 1 X ⊤ j u , (2.1) where u is the negati ve gradient ev aluated at the model in the current iteration, X j and K j are the design and penalty matrix of the base-learner associated with h j and ˜ λ j is the respectiv e penalty parameter . For example when modeling a linear effect f j , X j = ( 1 , x j ) and ˜ λ j = 0 . An o vervie w of different base-learner types can be found in Hofner et al. (2012). 3 Algorithm 1 Non-cyclical component-wise gradient boosting algorithm for GAMLSS 1: Initialize the predictors ˆ η [0] θ 1 , ... , ˆ η [0] θ K with offset v alues. 2: f or m = 1 to m stop do 3: for k = 1 to K do 4: Compute the estimated negati ve gradient vector 5: u [ m ] θ k =  − ∂ ∂ η θ k ℓ  g − 1 θ 1 ( η θ 1 ) , · · · , g − 1 θ K ( η θ K ); y      η θ 1 = η [ m − 1] θ 1 ( x i ) , ··· ,η θ K = η [ m − 1] θ K ( x i ) ,y = y i  i =1 ,...,n . 6: Fit ev ery base-learner to the negati ve gradient vector: 7:  x ij , u [ m ] θ k ,i  i =1 ,...,n base-learner − → ˆ h [ m ] j,θ k for j = 1 , ... , J 8: Select the best-ﬁtting base-learner ˆ h [ m ] j ∗ θ k ,θ k , where 9: j ∗ θ k = arg min j ∈{ 1 ,...,J } P n i =1  u [ m ] θ k ,i − ˆ h [ m ] j,θ k ( x ij )  2 . 10: Choose the step length ν [ m ] θ k . E.g., ν [ m ] θ k = 0 . 1 . 11: Compute the update candidate 12: ˆ η c [ m ] θ k = ˆ η [ m − 1] θ k + ν [ m ] θ k ˆ h [ m ] j ∗ θ k ,θ k . 13: end for 14: Select the best update candidate ˆ η c [ m ] θ k ∗ , where 15: k ∗ = arg min k ∈{ 1 ,...,K } − P n i =1 ℓ  g − 1 θ 1  ˆ η [ m − 1] θ 1 ( x i )  , ..., g − 1 θ k  ˆ η c [ m ] θ k ( x i )  , ..., g − 1 θ K  ˆ η [ m − 1] θ K ( x i )  ; y  . 16: Set 17: ˆ η [ m ] θ k = ( ˆ η c [ m ] θ k if k = k ∗ ˆ η [ m − 1] θ k if k  = k ∗ . 18: end f or This generic gradient-based boosting algorithm is aimed at estimating a model with a single additiv e predictor . When boosting GAMLSS, the negati ve log-likelihood − ℓ ( θ 1 , ..., θ K ; y ) is considered as the loss function and K different additiv e predictors η θ k = g θ k ( θ k ) have to be ﬁtted (Mayr et al., 2012a; Thomas et al., 2018). The gradient-boosting routine outlined abov e is, therefore, performed for ev ery additiv e predictor η θ k , k ∈ { 1 , ..., K } . In particular , in iteration m the negativ e gradient u [ m ] θ k =  − ∂ ∂ η θ k ℓ  g − 1 θ 1 ( η θ 1 ) , ..., g − 1 θ K ( η θ K ); y      η θ 1 = η [ m − 1] θ 1 ( x i ) ,...,η θ K = η [ m − 1] θ K ( x i ) ,y = y i  i =1 ,...,n is computed, all base-learners are ﬁtted to the ne gative gradient vector and the best ﬁtted base-learner ˆ h ∗ [ m ] θ k is determined. Based on the selected base-learner , the update candidate ˆ η [ m − 1] θ k + ν ˆ h ∗ [ m ] θ k is computed, where ν represents the step length. In order to ensure a fair base-learner selection, all base-learners should exhibit a comparable degree of ﬂexibility . This is typically achieved by assigning the same number of equi valent degrees of freedom to the base-learners (Kneib et al., 2009). Once all update candidates are determined, the predictor ˆ η θ k ∗ that is updated in iteration m is selected by comparing the negati ve log-likelihood of models with a single update candidate and the remaining pre- dictors set to the values from the pre vious iteration, i.e. k ∗ = arg min k ∈{ 1 ,...,K } − n X i =1 ℓ  g − 1 θ 1  ˆ η [ m − 1] θ 1 ( x i )  , · · · , g − 1 θ k  ˆ η [ m − 1] θ k ( x i ) + ν ˆ h ∗ [ m ] θ k ( x ij ∗ θ k )  , · · · , g − 1 θ K  ˆ η [ m − 1] θ K ( x i )  ; y i  . The model is then only updated with respect to this best update candidate, which is called non-cyclical boosting . The non-cyclical boosting algorithm is displayed in Algorithm 1, where we on purpose leave the step length ν more ﬂexible than originally introduced by Thomas et al. (2018). The iterativ e update procedure in boosting algorithms is typically stopped before con vergence as this yields sev eral adv antages. First, early stopping induces coef ﬁcient shrinkage and often impro ves the predic- tiv e performance compared to a fully con verged model. In addition, if not all cov ariate ef fects are selected until the stopping iteration, an intrinsic variable selection is achie ved, resulting in a sparser model. W ith the step length typically set to a small, predeﬁned value (B ¨ uhlmann and Hothorn, 2007), the stopping iteration 4 m stop is the main tuning parameter of boosting algorithms. It is commonly obtained via cross-validation or by other criteria based on out-of-sample predictive performance. Alternativ ely , also information criteria can be applied to ﬁnd an appropriate stopping iteration (Mayr et al., 2012b). In non-cyclical boosting algorithms, the relative importance of the additiv e predictors is determined intrinsically by the sequence of selected predictor updates. As a result, only a single stopping iteration has to be tuned, like in single-predictor models. This constitutes a key difference to the cyclical updating scheme (Mayr et al., 2012a), in which all additiv e predictors are updated sequentially in each iteration. In cyclical boosting, an individual stopping iteration is assigned to each additive predictor , which has the drawback of a more comple x and computationally intensive tuning procedure (Hofner et al., 2016). 2.3 Non-cyclical Boosting with Shrunk Optimal Step Lengths Determining the relati ve importance of the additiv e predictors intrinsically has limitations when the negati ve gradients are on different scales, as differences in the gradient size ∥ u θ k ∥ 2 directly translate to the size of the ﬁtted base-learners ˆ h ∗ θ k ev aluated at x j ∗ θ k , k ∈ { 1 , ..., K } . Additiv e predictors associated with larger gradients therefore hav e an advantage in the lik elihood-based comparison of update candidates due to their larger update size    ν ˆ h ∗ θ k ( x j ∗ θ k )    2 (Algorithm 1, line 15). As a result, these additiv e predictors tend to be selected more often for update (as long as they have not conv erged yet), which leads to differences in con vergence speed and degree of regularization across predictors at the joint stopping iteration. In the ﬁnal model, the degree of coefﬁcient shrinkage then differs among the additive predictors and the variable selection deteriorates. Fig. 1 (a) illustrates this problem. Either the more slo wly con verging additive predictor ( η µ in this example) misses informati ve cov ariates and exhibits a strong coefﬁcient shrinkage (dashed ex emplary stop- ping iteration), or the faster conv erging predictor ( η σ here) includes non-informativ e covariates and shows almost no shrinkage in the informati ve effects (dotted exemplary stopping iteration), or a combination of the two (stopping iterations in between). In the second case, the algorithm needs man y iterations until η µ is adequately ﬁtted, which leads to a long run time. A way to address these issues is to replace the ﬁxed step length ν by a shrunk optimal step length ν [ m ] θ k , which can vary across additiv e predictors and iterations. In the context of GAMLSS, this was ﬁrst introduced by Zhang et al. (2022) for boosting a Gaussian location and scale model with linear effects. The µ σ 0 500 1000 1500 2000 2500 −1 0 1 2 −0.25 0.00 0.25 iteration (a) ﬁxed step lengths µ σ 0 100 200 300 400 −1 0 1 2 −0.25 0.00 0.25 iteration (b) shrunk optimal step lengths Figure 1: Coefﬁcient paths for a Gaussian location and scale model in the simulation setting without addi- tional non-linear effect (3.1). Dark blue paths represent informativ e and gray paths uninformativ e effects. The dashed and dotted vertical lines represent potential stopping iterations 5 shrunk optimal step length for distribution parameter θ k in iteration m is deﬁned as ν [ m ] θ k = λν ∗ [ m ] θ k with ν ∗ [ m ] θ k = arg min ν − n X i =1 ℓ  g − 1 θ 1  ˆ η [ m − 1] θ 1 ( x i )  , · · · , g − 1 θ k  ˆ η [ m − 1] θ k ( x i ) + ν ˆ h ∗ [ m ] θ k ( x ij ∗ θ k )  , · · · , g − 1 θ K  ˆ η [ m − 1] θ K ( x i )  ; y i  , (2.2) where λ represents a shrinkage factor , e.g. λ = 0 . 1 . When using a shrunk optimal step length, the size of the candidate update ν θ k ˆ h ∗ θ k ev aluated at x j ∗ θ k reﬂects its potential to reduce the loss function. By that, structurally small ﬁtted base-learners can be compensated and the size of candidate updates corresponding to a small negati ve gradient is increased. This has two implications. First, the candidate updates are of similar size and thus a fairer decision of which additi ve predictor is updated can be achiev ed. Second, giv en a candidate update corresponding to small negativ e gradients is selected the updates are larger and fewer updates are required until conv ergence. This leads to a larger relativ e con vergence speed of additive predictors with a small ne gative gradient and thus a more similar con v ergence speed among predictors. Due to the more balanced con vergence behavior , using shrunk optimal step lengths results in a similar degree of regularization among additiv e predictors and an improved variable selection in the overall model (see Fig. 1 (b)). Increasing the con vergence speed of originally slowly con verging additiv e predictors, moreov er, reduces the number of iterations required until the stopping criterion is reached and therefore the ov erall run time. The absolute con ver gence speed can be adjusted by modifying the shrinkage factor λ . The shrinkage f actor underlies the usual trade-off between size of the step length and stopping iteration also observed for ﬁxed step lengths (Friedman, 2001; B ¨ uhlmann and Hothorn, 2007). 2.4 Combining Shrunk Optimal Step Lengths with Non-linear Base-learners Until now , shrunk optimal step lengths hav e only been combined with simple linear base-learners. In principle, the outlined mechanism of compensating for small negati ve gradients by increasing the update size of a ﬁtted base-learner works in the same way also for other base-learner types. There is, howe ver , one important difference when considering penalized base-learners like splines, Markov random ﬁelds or jointly dummy-coded categorical effects with reduced equiv alent degrees of freedom. If the coefﬁcient size is regularized by a base-learner , its potential for improving the loss function in the direction of the ﬁt ˆ h j ( x j ) is not exhausted. This is picked up by the optimal step length leading to a larger step length λ ~ = 0 λ ~ = 400 λ ~ = 800 λ ~ = 1200 µ σ 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 10 20 30 0.1 0.2 0.3 0.4 iteration step length cov ar iate x1 x2 x3 x4 x5 x6 z1 Figure 2: Shrunk optimal step lengths for varying levels of penalization of the base-learner representing the categorical ef fect (columns) in the Gaussian simulation setting (3.1) 6 categorical non − linear spatial µ σ 0 500 1000 1500 2000 0 5000 10000 15000 0 100 200 300 400 500 10 15 20 25 30 0.05 0.10 0.15 0.20 5 6 7 8 0.1 0.2 0.3 10 20 30 40 50 0.1 0.2 0.3 0.4 0.5 λ ~ mean step length Figure 3: Comparison of the penalty parameter and the mean step length in the ﬁrst 100 iterations of a simulation run for base-learners representing different ef fects (columns) in the Gaussian simulation setting (3.1). For the e xplicit speciﬁcation of the base-learners, see Sect. 3 than for unpenalized base-learners of the same additi ve predictor . The magnitude of the dif ference between the optimal step lengths depends on the strength of the penalty . Fig. 2 illustrates this association for a jointly dummy-coded effect with ﬁv e categories and varying levels of penalization. More speciﬁcally , the relationship between the optimal step length lev el and the penalty parameter ˜ λ of a penalized linear base- learner is approximately linear , as illustrated in Fig. 3 for a Gaussian location and scale model and different base-learner types. This relationship, howe ver , is only as distinct when considering the same base-learner . Across base-learner types and cov ariates, the optimal step length is inﬂuenced by multiple additional factors beyond the penalty parameter . W ith the equiv alent degrees of freedom of a penalized linear base-learner associated with h j deﬁned as (Hastie and T ibshirani, 1990) df j ( ˜ λ j ) = trace ( X j  X ⊤ j X j + ˜ λ j K j  − 1 X ⊤ j ) , the relationship between the number of equiv alent degrees of freedom and the corresponding optimal step length cannot be isolated as cleanly and depends on the concrete base-learner speciﬁcation and the data. For example when reducing the number of equiv alent degrees of freedom of a categorical effect, the penalty parameter and thus the optimal step length is comparativ ely large as the penalty matrix K is an identity matrix exclusi vely penalizing the size of the coefﬁcients. A Markov random ﬁeld base-learner with the same number of regions would hav e a lower ˜ λ and thus a lower optimal step length as not solely the size but also the v ariation of effects of neighboring re gions is penalized. The positiv e relationship between degree of penalization and optimal step length constitutes an addi- tional advantage of shrunk optimal step lengths when penalized base-learners are included in the model. W ith ﬁxed step lengths, penalized base-learners require more iterations until the effect is suf ﬁciently ﬁtted. Under strong penalization, this leads to a long run time and may even affect the variable selection. Re- ducing the penalization mitigates these issues, but simultaneously gi ves the base-learner an advantage o ver less ﬂexible base-learners. As shrunk optimal step lengths increase the update size after the base-learner selection, the number of iterations needed until the base-learner is suf ﬁciently ﬁtted is reduced, while a f air base-learner selection can be maintained. 2.5 Computing Shrunk Optimal Step Lengths f or ZINB-GAMLSS Optimal step lengths can be determined in different ways. Most commonly , optimal step lengths are ob- tained numerically via line search as proposed by Friedman (2001) and Zhang et al. (2022). Using a line search, howe ver , has two notable drawbacks. First, an appropriate search interval must be speciﬁed that contains the optimum while not being unnecessarily large, as ov erly wide intervals increase the computa- tional cost and may lead to numeric instabilities. Choosing a suitable upper bound can be challenging, in particular when optimal step lengths are large for certain distribution parameters or when strongly penal- ized base-learners are included in the boosting speciﬁcation. Second, performing a line search to determine 7 the optimal step length for each update candidate in ev ery iteration substantially increases the computation time, which is especially pronounced for complex loss functions and large search intervals. As an alterna- tiv e, analytical expressions or analytical approximations for optimal step lengths can be deri ved, which has been done for some response variable distributions and distribution parameters (Zhang et al., 2022; Daub et al., 2025). Ho wev er , as the complexity of the distrib ution increases, the expressions of the approximated shrunk optimal step lengths get very lengthy and the accuracy of the approximations degrades. Deri ving analytical approximations of the optimal step lengths is, therefore, not suitable here. T o av oid the outlined issues associated with a line search and analytical approximations, we determine the optimal step lengths using Newton’ s method applied to the ﬁrst-order condition of the optimization problem in (2.2). The method is implemented using the R package nleqslv (Hasselman, 2023) and yields very similar results as a line search while being computationally more ef ﬁcient. Since optimal step lengths for a gi ven base-learner and additive predictor tend to be similar across boosting iterations, the Newton al- gorithm is initialized using the optimal step length from the iteration in which the base-learner w as selected last. If a base-learner has not yet been updated, the initial v alue is set to one. The ﬁrst-order conditions are deriv ed from the negati ve log-likelihood of the ZINB-GAMLSS gi ven by − n X i =1 ℓ ( µ i , α i , π i ; y i ) = − X i : y i =0 ln  π i + (1 − π i )(1 + α i µ i ) − 1 α i  − X i : y i > 0 ln(1 − π i ) + y i ln  α i µ i 1 + α i µ i  − 1 α i ln(1 + α i µ i ) + ln Γ  y i + 1 α i  − ln Γ( y i + 1) − ln Γ  1 α i  , where ln Γ denotes the log-gamma function. For the optimal step length of µ in iteration m we for e xample obtain 0 = − ∂ ∂ ν µ n X i =1 ℓ  exp  ˆ η [ m − 1] µ,i + ν µ ˆ h ∗ [ m ] µ,i  , ˆ α [ m − 1] i , ˆ π [ m − 1] i ; y i       ν µ = ν ∗ [ m ] µ = X i : y i =0 exp  ˆ η [ m − 1] µ,i + ν ∗ [ m ] µ ˆ h ∗ [ m ] µ,i  ˆ h ∗ [ m ] µ,i ˆ π [ m − 1] i 1 − ˆ π [ m − 1] i  1 + ˆ α [ m − 1] i exp  ˆ η [ m − 1] µ,i + ν ∗ [ m ] µ ˆ h ∗ [ m ] µ,i  1 ˆ α [ m − 1] i +1 +  1 + ˆ α [ m − 1] i exp  ˆ η [ m − 1] µ,i + ν ∗ [ m ] µ ˆ h ∗ [ m ] µ,i  − X i : y i > 0 ˆ h ∗ [ m ] µ,i  y i − exp  ˆ η [ m − 1] µ,i + ν ∗ [ m ] µ ˆ h ∗ [ m ] µ,i  1 + ˆ α [ m − 1] i exp  ˆ η [ m − 1] µ,i + ν ∗ [ m ] µ ˆ h ∗ [ m ] µ,i  , (2.3) where for notational con venience we deﬁne ˆ η [ m − 1] θ k ,i = ˆ η [ m − 1] θ k ( x i ) , ˆ h ∗ [ m ] θ k ,i = ˆ h ∗ [ m ] θ k ( x i ) and ˆ θ [ m ] k,i = ˆ θ [ m ] k ( x i ) . The ﬁrst-order conditions for ν ∗ [ m ] α and ν ∗ [ m ] π can be deriv ed analogously . Further details on the deriv ation of equation (2.3) as well as the remaining ﬁrst-order conditions are provided in the Supplement. 3 Simulation Study T o ev aluate the performance of the non-cyclical boosting algorithm with shrunk optimal step lengths, we conducted a simulation study with two response variable distrib utions and different effect types, which are motiv ated by our application. In a ﬁrst Gaussian setting, we in vestigate combining shrunk optimal step lengths with other than simple linear base-learners, focusing on the selection and de gree of shrinkage of the effects. Secondly , we consider a zero-inﬂated negati ve binomial response variable in order to in vestigate if shrunk optimal step lengths yield a more balanced regularization and variable selection behavior among additiv e predictors also for this model. In both settings, the results for shrunk optimal step lengths are compared to those from the ﬁxed step length approach, using a ﬁxed step length and shrinkage factor of 0.1. T o determine the stopping iteration, we apply a robust 10-fold cross-validation variant that selects the earliest iteration within a 2% range of the minimum, as this yields more stable results in the presence of a ﬂat likelihood. In each simulation setting, 100 runs are conducted. Estimation via boosting is carried out with our e xtension of the R add-on package gamboostLSS (Hofner et al., 2016). For comparison, the models are also estimated by maximum likelihood using the R add-on package gamlss (Stasinopoulos et al., 2008). 8 3.1 Gaussian setting In the ﬁrst setting, we generate n = 500 observations y i from Y i ∼ N ( µ ( x i ) , σ ( x i )) with µ ( x i ) = η µ ( x i ) = − 1 . 5 x 1 i + 2 . 5 x 2 i + 1 . 5 x 3 i − 2 . 5 x 4 i + f µ ( z i ) log( σ ( x i )) = η σ ( x i ) = 2 + 0 . 2 x 3 i + 0 . 5 x 4 i − 0 . 2 x 5 i − 0 . 5 x 5 i + f σ ( z i ) . (3.1) The cov ariates x 1 i , x 3 i , x 5 i are independently drawn from U ( − 1 , 1) and x 2 i , x 4 i , x 6 i are independent re- alizations of B (1 , 0 . 5) , i ∈ { 1 , ..., n } . In addition, 20 non-informativ e cov ariates are included, where we hav e X 7 i , X 9 i , ..., X 25 i ∼ U ( − 1 , 1) and X 8 i , X 10 i , ..., X 26 i ∼ B (1 , 0 . 5) . The function f represents differ - ent additional effects: either a categorical ef fect, a non-linear ef fect or a discrete spatial effect. These are considered as different settings; details on the speciﬁcation of these ef fects are pro vided in the Supplement. When an informativ e categorical or non-linear effect is included, a non-informati ve effect is added as well. For the discrete spatial ef fect, the informativ e and non-informative ef fect are considered separately . Boosting is performed using linear base-learners with intercept for covariates X 1 , ..., X 26 , a linear base- learner with joint dummy-coded categories for the categorical effect, a centered P-spline of degree three with a second order difference penalty and 20 inner knots for the non-linear effect, and a centered Markov random ﬁeld base-learner with the corresponding neighborhood structure for the discrete spatial effect. T o ensure a fair base-learner selection, the effecti ve degrees of freedom are set to two for all base-learners (Hofner et al., 2011). As this setting is designed to sho w the imbalancedness problem outlined in the previous section, the data exhibit high v ariability and one of the gradients ( u µ ) is v ery small. Consequently , the model is generally difﬁcult to estimate, and the variability in the estimated effects is comparatively large for all methods. Note that in balanced settings, ﬁx ed and shrunk optimal step lengths yield very similar results (Zhang et al., 2022). Fig. 4 shows the distribution of the coefﬁcient estimates in the setting with categorical effects. Note that shrinkage of the estimated effects is typically intended when applying boosting and can be reduced by stopping the algorithm later . W ith respect to the relative regularization across additive predictors, we ﬁnd that the ﬁxed step length approach results in relatively weak regularization for σ and considerably stronger regularization for µ . The use of shrunk optimal step lengths increases the regularization of σ and reduces the re gularization of µ , leading to a considerably smaller imbalance in the de gree of re gularization between the two additi ve predictors. These ﬁnding are in line with the results of Zhang et al. (2022). Comparing categorical with simple linear effects, we ﬁnd that they behav e similarly in terms of relativ e regularization across additiv e predictors, with the categorical ef fect being slightly less regularized overall. Along with the weaker coefﬁcient shrinkage, the ﬁx ed step length approach includes substantially more non-informativ e cov ariates in σ compared to the shrunk optimal step length approach. This is particularly pronounced for the categorical ef fect, where the non-informativ e categorical effect is selected in every simulation run (see T able 1, columns 1-4). For µ , the differences in the selection behavior between the µ σ x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 z 10 z 11 z 13 z 14 z 20 z 21 z 23 z 24 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 z 10 z 11 z 13 z 14 z 20 z 21 z 23 z 24 −0.4 0.0 0.4 −2 0 2 4 coefficient method ml fixed shrunk optimal Figure 4: Distribution of the coef ﬁcient estimates in the Gaussian simulation setting (3.1) with categorical effects. The red horizontal lines represent the true coefﬁcients 9 T able 1: Number of selections of the respectiv e covariate effect in the Gaussian simulation setting (3.1) with additional categorical (columns 1-4), non-linear (columns 5-8) and a non-informative spatial effect (columns 9-12) for 100 simulation runs. The informati ve ef fects are marked in bold categorical non-linear spatial η µ η σ η µ η σ η µ η σ ﬁxed opt ﬁxed opt ﬁxed opt ﬁxed opt ﬁxed opt ﬁxed opt x 1 98 100 81 16 100 100 70 8 86 99 69 8 x 2 100 100 81 13 100 100 75 15 99 100 75 7 x 3 94 99 100 100 100 100 100 100 82 93 100 100 x 4 100 100 100 100 100 100 100 100 96 100 100 100 x 5 3 7 100 100 0 8 100 100 1 5 100 100 x 6 2 8 100 100 1 7 100 100 0 1 100 100 x 7 2 7 84 10 4 10 69 19 1 7 64 8 x 8 9 14 80 15 4 12 65 9 1 2 57 3 x 9 4 9 86 12 1 7 54 9 0 2 70 11 x 10 3 12 80 16 2 8 65 16 0 3 74 10 z 1 100 100 100 100 100 100 100 100 – – – – z 2 2 5 100 10 5 21 99 32 0 11 99 11 two step length approaches are less pronounced: the ﬁxed step length approach includes slightly fewer non-informativ e effects b ut does not select the informative ef fects in all runs. In the simulation setting with non-linear ef fects, a similar pattern with respect to the relati ve regulariza- tion of µ compared to σ can be observed, where the non-informati ve non-linear effect is selected substan- tially less frequently in the optimal step length approach than when ﬁxed step lengths are used. Unlike the categorical effect, the non-informative spline is selected more often in both methods than the linear non- informativ e ef fects (see T able 1, columns 5-8). This likely results from the higher ﬂexibility of the spline base-learner , despite having the same effecti ve degrees of freedom (Hofner et al., 2011). W ith respect to the shrinkage of the informative non-linear effects, the dif ference between the step length approaches is small. The ef fect for µ is slightly less shrunk for ﬁxed step lengths and the v ariability in the effect for σ is slightly larger compared to the shrunk optimal step length approach (see Fig. 5). fixed shrunk optimal µ σ −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −2 −1 0 1 −1.0 −0.5 0.0 0.5 1.0 z par tial eff ect Figure 5: Partial effects of the informative non-linear effect in the Gaussian simulation setting (3.1). The red dashed lines represent the true partial effect 10 Considering an informativ e spatial ef fect yields similar results as the categorical setting. Since the region-speciﬁc effects are smoothed across neighboring regions, they are e xpected to exhibit differing lev els of shrinkage, and consequently also the maximum likelihood estimates are regularized. In addition, slightly fewer cov ariates are selected overall in both spatial settings compared to the categorical setting, which is likely due to the missing second ef fect in this setting. Further results can be found in the Supplement. In order to ev aluate the predictive performance, we consider the mean continuous ranked probability score (CRPS) with either the true distribution or the true point value as reference, the negativ e log-likelihood and the mean squared error (MSE) on test data. W ith respect to both mean CRPS and the negati ve log- likelihood, shrunk optimal step lengths yield a better predictiv e performance than ﬁxed step lengths in all settings, where the strongest improv ement is observed for the CRPS relativ e to the true distribution. The distributions of predicti ve performance measures are displayed in the Supplement. Comparing further properties of the algorithm, we ﬁnd that using shrunk optimal step lengths reduces the number of iterations needed until stopping substantially , which is in line with pre vious ﬁndings in this regard (Zhang et al., 2022; Daub et al., 2025). In addition, considering more strongly penalized base- learners does not increase the stopping iteration substantially when using shrunk optimal step lengths while the ﬁx ed step length approach stops noticeably later (see T able 2). The dif ference between ﬁxed and shrunk optimal approach is especially pronounced for the categorical ef fect, which is due to its penalty structure. This supports the discussion in section 2.4 and constitutes an enhanced advantage of using shrunk optimal step lengths in combination with penalized base-learners. In addition to examining cate gorical, non-linear and spatial effects separately , we also consider a setting in which all effects are included simultaneously . The results are consistent with those obtained in the separate settings, with the required number of iterations increasing approximately proportionally , thereby amplifying the difference in stopping iteration. T able 2: Mean and standard de viation (in brackets) of the stopping iterations for ﬁx ed (top rows) and shrunk optimal step lengths (bottom ro ws) in the dif ferent Gaussian simulation settings. In addition, the relative comparison of effecti ve de grees of freedom of two and four (% mean) is displayed categorical non-linear spatial df = 2 3,294 (663) 1,857 (242) 2,821 (496) ﬁxed df = 4 1,841 (443) 1,545 (219) 2,248 (407) % mean 1.789 1.202 1.255 df = 2 152 (22) 187 (25) 139 (21) shrunk optimal df = 4 146 (21) 175 (21) 136 (20) % mean 1.041 1.069 1.022 3.2 ZINB setting In the second setting, n = 4 , 000 observations are generated from Y i ∼ ZINB ( µ ( x i ) , α ( x i ) , π ( x i )) with log( µ ( x i )) = η µ ( x i ) = 1 . 8 + 0 . 2 x 1 i − 0 . 35 x 2 i − 0 . 2 x 3 i + 0 . 35 x 4 i + f µ ( z i ) log( α ( x i )) = η α ( x i ) = − 1 . 1 + 0 . 6 x 2 i + 0 . 5 x 3 i − 0 . 6 x 4 i − 0 . 5 x 5 i + f α ( z i ) logit ( π ( x i )) = η π ( x i ) = − 0 . 8 + x 3 i − 1 . 25 x 4 i − x 5 i + 1 . 25 x 6 i + f π ( z i ) . (3.2) Again, x 1 i , x 3 i , ..., x 25 i are drawn independently from U ( − 1 , 1) , x 2 i , x 4 i , ..., x 26 i are independent realiza- tions of B (1 , 0 . 5) and categorical, non-linear or spatial effects are added (for more details on the speciﬁca- tion, see the Supplement). The base-learners are speciﬁed as in the Gaussian setting. Due to the construction of the distrib ution and the dependence among the distribution parameters, a zero-inﬂated negati ve binomial model for location, scale and shape is considerably more difﬁcult to estimate than a Gaussian location and scale model. The small gradient of α further ampliﬁes the difﬁculty . Overall, our ﬁndings for the zero-inﬂated negativ e binomial model are consistent with the results from the Gaussian setting. Howe ver , a larger variability in the coefﬁcient estimates for the maximum likeli- hood reference was observed and the additive predictor for α is quite strongly regularized in both boosting approaches. Like in the Gaussian setting, shrunk optimal step lengths yield a more similar degree of regu- larization among the additiv e predictors: α is less regularized, while µ and π show a stronger re gularization compared to the ﬁxed step length approach. These dif ferences are, howe ver , less pronounced than in the Gaussian case, which can be attributed to the ZINB setting being slightly less imbalanced. 11 T able 3: Number of selections of the cov ariate effects in the ZINB simulation setting (3.2) with additional categorical (columns 1-6) and non-linear ef fects (columns 7-12) for 100 simulation runs. The informative effects are mark ed in bold categorical non-linear η µ η α η π η µ η α η π ﬁxed opt ﬁxed opt ﬁxed opt ﬁxed opt ﬁxed opt ﬁxed opt x 1 100 100 5 13 11 6 100 100 12 15 10 7 x 2 100 100 83 94 62 45 100 100 88 88 58 53 x 3 100 100 94 98 100 100 100 100 97 98 100 100 x 4 100 100 78 89 100 100 100 100 92 96 100 100 x 5 52 19 100 100 100 100 51 25 100 100 100 100 x 6 54 17 4 11 100 100 46 15 7 13 100 100 x 7 53 19 2 4 11 6 39 15 2 5 6 6 x 8 38 9 7 11 6 4 45 20 4 9 4 3 x 9 49 10 5 15 6 1 41 19 5 10 3 3 x 10 53 20 2 7 8 4 55 30 5 6 3 3 z 1 100 100 85 97 100 100 100 100 77 78 100 100 z 2 72 17 1 2 1 0 82 40 8 20 13 8 In terms of variable selection, the shrunk optimal approach clearly outperforms the ﬁxed step length approach (see T able 3). Compared to ﬁxed step lengths, it substantially reduces the number of false positiv es in µ and false negati ves in α , although it still tends to select more non-informative effects for µ than for the other additive predictors. For π , all informativ e effects and only few non-informative effects are selected, with x 2 being the only exception. These selection difﬁculties likely arise from the challenge of disentangling ef fects across the dependent additi ve predictors, and the use of shrunk optimal step lengths does not substantially improve the number of false positiv es in this case. Further simulation results are provided in the Supplement. Comparing the methods with respect to the predictiv e performance, we ﬁnd that the boosting approaches outperform the maximum likelihood reference in terms of CRPS and maximum likelihood. Between the boosting approaches, no clear improvement of the shrunk optimal step length approach could be observed. Giv en the lower number of selected cov ariates, the predictiv e performance is, ho wever , achiev ed with a sparser model for shrunk optimal step lengths. The distributions of the predictiv e performance measures for the different approaches are displayed in the Supplement. 4 Modeling the Number of Antenatal Car e V isits in Nigeria In the following, we model the number of antenatal care visits in Nigeria based on maternal socio-economic and demographic characteristics. F or the analysis we use data from the 2013 Nigeria Demographic and Health Surve y (National Population Commission (NPC) [Nigeria] and ICF International, 2014), compiled by Gayawan and Omolofe (2016). The data set contains 18,815 observations of 11 variables, which we split into training and v alidation set  2 3 vs. 1 3  . Among the recorded number of antenatal care visits, 34.2% are zero, while the non-zero observations have a mean of 7.97 and an empirical variance of 34.93 (see Fig. 6 for the empirical distribution). The large proportion of zeros combined with the overdispersion 0.0 0.1 0.2 0.3 0 10 20 30 antenatal care visits relative frequency Figure 6: Empirical distribution of the number of antenatal care visits 12 motiv ates the use of a zero-inﬂated negati ve binomial distribution for the response variable. Aside from the number of antenatal care visits, the data set contains dif ferent socio-economic, demographic and conte xtual characteristics of the mother: education (no education, primary school education and higher education), wealth (a ﬁv e-level wealth index), working status ( work ) as well as access to mass media, speciﬁcally newspaper , radio and television , represent her economic and social status. In addition, the mother’ s age at birth ( a ge ), the duration of her marriage ( married yrs ), her re gion of residence ( zone ) and whether she liv es in an urban or rural area (urban) are included. Descripti ve statistics on the cov ariates are provided in the Supplement. Based on the characteristics of the response variable and because we are not only interested in the effects on its conditional mean but also on the other distribution parameters, a zero-inﬂated negati ve binomial model for location, scale and shape is considered to model the number of antenatal care visits. Aside from the conditional mean, the effects on the zero-inﬂation parameter are particularly rele vant, since ha ving no antenatal care is especially problematic from a medical point of view . For the estimation of the ZINB- GAMLSS via boosting, the categorical cov ariates ( education , wealth ) are dummy-coded and combined into a single effect, the continuous covariates ( age , married yrs ) are modeled using P-splines decomposed into a linear effect and a smooth deviation, and the discrete spatial ef fect of zone is incorporated via a Markov random ﬁeld. The number of equiv alent degrees of freedom is set to two for these base-learners. The remaining covariates, all of which are binary , enter the model as simple linear effects. In addition to these main effects, we include all pairwise interactions between binary and categorical covariates. For the categorical cov ariates, the interactions are deﬁned at the level of the individual categories, reﬂecting the possibility that interaction effects might only be present for certain category combinations. In the following, we compare the results from boosting with ﬁxed and shrunk optimal step lengths, where the ﬁxed step length and the shrinkage factor of the optimal step length are set to 0.1. The stopping iteration is again determined based on a robust 10-fold cross-validation variant. As reference, the boosting results are additionally compared with direct maximum likelihood estimation for a model without interaction terms. The corresponding estimates are included in the Supplement. Fig. 7 displays the coefﬁcient paths for the two step length approaches. When using shrunk optimal step lengths, the paths for µ grow notably slower relativ e to the other predictors than in the ﬁxed step fixed shrunk optimal µ α π 0 2000 4000 6000 0 100 200 300 −0.10 −0.05 0.00 0.05 0.10 −0.2 −0.1 0.0 0.1 −0.4 0.0 0.4 iteration coefficient education newsp:educ0 newspaper newsp:wealth0 newsp:wealth1 newsp:wealth3 newsp:work newsp:telv radio radio:educ0 radio:educ2 radio:wealth0 radio:wealth3 radio:wealth4 radio:work television telv:educ0 telv:educ2 telv:wealth0 telv:wealth1 telv:wealth3 telv:wealth4 urban urban:educ0 urban:educ2 urban:radio urban:telv urban:wealth0 urban:wealth1 urban:wealth3 urban:wealth4 urban:work wealth wealth0:educ0 wealth0:educ2 wealth1:educ0 wealth1:educ2 wealth4:educ0 wealth4:educ2 work work:educ0 work:wealth0 work:wealth1 work:wealth3 work:wealth4 Figure 7: Coefﬁcient paths for the antenatal care data from Nigeria using different step length approaches (columns). The dashed vertical lines represent the stopping iterations. Note that the x-axis scaling varies across the step lengths approaches 13 T able 4: Number of selected base-learners per additiv e predictor for the antenatal care data from Nigeria using different step lengths approaches (ro ws) µ α π ﬁxed 32 8 17 shrunk optimal 22 10 15 length approach, while the paths for α grow faster . The coefﬁcient estimates of the different submodels, therefore, con verge at a similar speed, indicating a more balanced update behavior . In contrast, with ﬁxed step lengths the con vergence rates of the additive predictors differ substantially: the coefﬁcients for µ con verge much faster whereas those for α con ver ge considerably more slowly than the coefﬁcients for π . As a result, the coefﬁcient estimates exhibit a dif ferent degree of shrinkage across the two step length approaches, where the ef fects on µ are often more re gularized when using shrunk optimal step lengths. For example the estimated interaction effects of ne wspaper:wealth0 , ne wspaper:education0 and urban:wealth0 are considerably higher for the ﬁxed step length approach (an overvie w of all estimates is presented in the Supplement). Exceptions such as the effect of access to television on µ , for which the coefﬁcient of the shrunk optimal approach is larger , appear to result from partial compensation of the smaller interaction effects. In line with the stronger coefﬁcient shrinkage, the shrunk optimal step length approach selects ten base-learners less than the ﬁxed step length approach for µ (see T able 4). The additive predictor associated with α , on the other hand, is subject to less regularization when optimal step lengths are used, which becomes apparent in the larger coefﬁcient estimates of wealth , radio and television as well as in the selection of two additional base-learners. For the zero-inﬂation parameter π , both step length approaches yield a similar degree of regularization, with the shrunk optimal step length approach showing a slightly stronger shrinkage. The differing degrees of regularization are also reﬂected in the estimates of the non-linear and spatial effects. When shrunk optimal step lengths are used, the estimated effects of both age and married yrs on µ are smaller or less ﬂexible. For α , shrunk optimal step lengths yield a larger effect for age while married yrs is not selected by either approach (see Fig. 8). As the effect estimates of both continuous cov ariates on π are comparati vely small, the slightly stronger regularization of π for the shrunk optimal step length approach results in differences in the covariate and effect selection. Speciﬁcally , age is not selected at all, and the ef fect of married yrs is regularized to a linear ef fect by the shrunk optimal approach, while the ﬁx ed step length approach includes both smooth deviations. For the spatial effects on µ and π , the step length approaches yield a similar degree of re gularization, whereas the spatial effect on α is again less regularized for shrunk optimal step lengths (see Fig. 9). Overall, the shrunk optimal step length approach thus leads to a more balanced model due to the more similar degree of regularization and smaller differences fixed shrunk optimal µ α π 20 30 40 50 20 30 40 50 −0.050 −0.025 0.000 0.025 −0.09 −0.06 −0.03 −0.003 0.000 0.003 age par tial eff ect fixed shrunk optimal µ α π 0 10 20 30 40 0 10 20 30 40 −0.01 0.00 0.01 −0.001 0.000 0.001 −0.015 0.000 0.015 0.030 married_yrs Figure 8: Partial ef fects of the mother’ s age (left) and marriage duration (right) for the antenatal care data from Nigeria using different step length approaches (columns) 14 µ α π fixed shrunk optimal coefficients −0.4 0.0 0.4 Figure 9: Spatial effects for the antenatal care data from Nigeria using different step length approaches (rows) in the number of selected cov ariates across predictors. This improvement in balancedness is dri ven by substantial differences in the shrunk optimal step lengths across the additi ve predictors, which are shown in Fig. 10. For unpenalized base-learners, the optimal step lengths take v alues around 0.5 for µ , 8 for α and 5 for π (for some base-learners larger); a constant optimal step length of 1 o ver all iterations and across all additi ve predictors would coincide with the ﬁxed step length approach. Compared to the ﬁx ed step length approach, the considerably lar ger step lengths for α and π lead to a substantially faster con vergence of their additive predictors, whereas the smaller values for µ slo w do wn its con vergence. This shift in relative conv ergence behavior results in the more similar conv ergence speed across additiv e predictors and thereby contributes to a more balanced model. Differences in the step length le vel also arise within the same additiv e predictor . Consistent with pre- vious ﬁndings, step lengths for penalized base-learners tend to take larger values than step lengths for unpenalized base-learners. In particular , step lengths corresponding to the categorical effects of wealth and education as well as the spatial effect of zone exhibit comparativ ely large values. While for example comparing the step lengths of education and wealth we observe lar ger optimal step lengths for the stronger penalized base-learners, the relation between the optimal step lengths level s of different base-learners can- not only be reduced to the degree of penalization. This is also reﬂected by the dif ferences in the behavior of optimal step lengths across the additive predictors. In addition to an improved relativ e conv ergence be- havior , shrunk optimal step lengths yield a faster overall con ver gence, resulting in a considerably earlier stopping iteration ( 253 compared to 3 , 584 ). In order to compare the predictiv e performance, we ev aluate the CRPS (with the observed value as reference) and the negati ve log-likelihood on the validation set. Both approaches achiev e a very similar predictiv e accuracy with respect to these measures; ho wever , the shrunk optimal approach attains this per- formance with a sparser model. These ﬁndings are consistent with the simulation results; more detailed results are provided in the Supplement. Concerning the cov ariate effects on the distribution of antenatal care visits, we obtain the following results from boosting with shrunk optimal step lengths (see Fig. 8, Fig. 9 and the Supplement for the coefﬁcient table). For the mean parameter µ , the largest positive effect can be observed for access to television while access to the other mass media ( radio , newspaper ) e xhibits more moderate positiv e ef fects. Although urban itself is not selected, strong effects are apparent for the interactions urban:education2 , urban:wealth0 and urban:wealth1 . Moreov er , wealth and education exert substantial effects on µ and additionally enter the model through interactions; aside from the interactions in volving urban the most notable ones are television:education2 and television:wealth4 . Also for π , education , wealth , urban and television show large effects on the additiv e predictor . In contrast to µ , these ef fects primarily arise from the main effects rather than from interactions with other co variates. The only interaction with a pronounced effect is wealth0:education0 , while wealth1:education0 , work as well as access to radio and newspaper exhibit more moderate effects. With respect to the overdispersion parameter α , strong negati ve effects are observed for newspaper and urban . Access to the other mass media ( radio , telelvision ) as well as wealth 15 µ α π 0 100 200 300 100 200 300 0 100 200 300 2 4 6 1 2 3 0.1 0.2 0.3 iteration step length age education married_yrs newsp:educ0 newsp:wealth0 newsp:work newspaper radio radio:educ2 radio:wealth0 radio:wealth4 television telv:educ0 telv:educ2 telv:wealth1 telv:wealth4 urban urban:educ2 urban:wealth0 urban:wealth1 urban:wealth3 urban:wealth4 urban:work wealth wealth0:educ0 wealth0:educ2 wealth1:educ0 wealth1:educ2 wealth4:educ0 work work:wealth0 work:wealth1 work:wealth3 work:wealth4 zone Figure 10: Shrunk optimal step lengths for the antenatal care data from Nigeria show moderate ef fects. For the mother’ s ag e at birth we obtain a positi ve effect on µ , which is slightly more enhanced in the age group 25 to 35, and a negativ e ef fect on α . The effect of married yrs on µ is considerably smaller compared to the effect of age and the direction varies across the range: for a marriage duration below 10 years the effect on µ is negati ve while it is positive for a larger marriage duration. With respect to π , married yrs has a small positive effect. Reg arding the spatial effects on µ and α , pronounced north-south differences are evident, with the northern zones exhibiting negati ve and the southern zones positiv e effects on both parameters. For π , the spatial pattern is more v ariable: a strong positi ve ef fect is e vident for the North W est and the South South zones, while the remaining zones show more moderate ne gativ e effects. 5 Conclusion and Discussion Using shrunk optimal step lengths for non-cyclical boosting of GAMLSS has been shown to ensure a balanced ov erall model by enabling a fairer update selection and lev eling the conv ergence speed of the different additive predictors. In the present work, we extended this approach to a larger variety of base- learners, considering jointly dummy-coded penalized linear base-learners for modeling categorical effects, P-spline base-learners representing non-linear ef fects and Markov random ﬁeld base-learners to account for discrete spatial effects. When combined with base-learners that penalize the ov erall size of the ﬁt, larger optimal step lengths were observed compared to unpenalized base-learners, reﬂecting the greater potential for reducing the loss function. This constitutes an additional beneﬁt of shrunk optimal step lengths in this context, as it increases the otherwise slow conv ergence speed of such effects while maintaining a fair base-learner selection. Furthermore, we extended the class of GAMLSS boosted with shrunk optimal step lengths to three-parameter models, speciﬁcally a zero-inﬂated negati ve binomial model for location, scale and shape, motiv ated by our application. Simulation studies and the application further demonstrate that using shrunk optimal step lengths re- duces the number of false positiv es and mitigates differences in regularization across additiv e predictors compared to the ﬁxed step length approach for effects not represented by simple linear base-learners by a similar magnitude as for simple linear base-learners. In the case of a zero-inﬂated negati ve binomial response variable, relativ e regularization of the additiv e predictors and variable selection behavior were consistent with those observed for other response variable distributions. As opposed to the Gaussian case, the predictiv e performance is not improv ed by using shrunk optimal step lengths here, but a similar perfor- mance is achiev ed with a sparser model. The analysis of determinants of antenatal care utilization indicates that wealth, education, access to television, type of residential area and region of residence exert the strongest effects on the conditional mean and the zero-inﬂation probability . For the conditional mean, interactions between these cov ariates hav e a large impact, whereas the zero-inﬂation parameter is primarily dri ven by the main ef fects. The ov erdispersion parameter is most strongly inﬂuenced by access to newspaper , type of residential area and the region of residence. 16 In summary , shrunk optimal step lengths can readily be applied to base-learners beyond simple linear ones as well as to a zero-inﬂated negati ve binomial response variable distribution, where they improve variable selection, lead to a more balanced re gularization across additi ve predictors and enhance estimation efﬁcienc y . In none of the considered settings did shrunk optimal step lengths lead to ov erall inferior results compared to the ﬁxed step length approach. Some limitations inherent to boosting methods remain, which are not related to the step length approach. While the use of shrunk optimal step lengths mitigates the lower relativ e importance of additiv e predictors associated with small negati ve gradients, such predictors can nev ertheless be challenging to estimate. This is, for instance, evident for the overdispersion parameter in the ZINB-GAMLSS. Further research is there- fore warranted to be able to better address this issue. In addition, a comparativ ely large number of false positiv es is selected across the additive predictors, which reﬂects a well-known characteristic of boosting- based approaches. Alternative stopping criteria, such as probing (Thomas et al., 2017), or a deselection procedure that removes cov ariate effects of minor importance a posteriori (Str ¨ omer et al., 2022) could be employed in future work to address this issue. Another inherent limitation of boosting methods is the lack of standard errors for the estimated effects and related inferential statistics. T o obtain these quantities, resampling-based approaches such as bootstrapping could for example be applied. Our implementation of the boosting approach with shrunk optimal step lengths, building on the R pack- age gamboostLSS , is publicly av ailable on GitHub (https://github .com/AlexandraDaub/BoostZINB). As a next step, we plan to integrate the option of using shrunk optimal step lengths directly into the gamboostLSS package to f acilitate application in practice. Future work will further explore the extensions of this approach to additional base-learner types as well as to a wider range of response variable distributions. Moreover , an interesting direction for future research is the application of boosting with shrunk optimal step lengths to more complex model classes within the GAMLSS framework, such as distributional copula regression (Hans et al., 2022) or multiv ariate distributional regression (Str ¨ omer et al., 2023). Lastly , the potential ben- eﬁts of using shrunk optimal step lengths in model classes with a single predictor remain to be explored, particularly gi ven that our ﬁndings suggest that optimal step lengths may also be advantageous for boosting generalized additiv e models in settings with certain types of penalized base-learners. Acknowledgements The authors would like to thank Ezra Gayawan for providing the data and for valuable input regarding the consistency and interpretation of the results. Funding This work was supported by the DFG (Deutsche F orschungsgemeinschaft; Projekt BE 7939/2-2). Refer ences B ¨ uhlmann, P . (2006). Boosting for high-dimensional linear models. The Annals of Statistics , 34(2):1001– 1024. B ¨ uhlmann, P . and Hothorn, T . (2007). Boosting algorithms: Re gularization, prediction and model ﬁtting. Statistical Science , 22(4):477–505. B ¨ uhlmann, P . and Y u, B. (2003). Boosting with the L2 loss. Journal of the American Statistical Association , 98(462):324–339. Daub, A., Mayr, A., Zhang, B., and Bergherr , E. (2025). A balanced statistical boosting approach for GAMLSS via new step lengths. Computational Statistics , 40(8):4741–4773. Eilers, P . H. C. and Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science , 11(2):89–121. Fahrmeir , L., Kneib, T ., Lang, S., and Marx, B. (2013). Regr ession: Models, Methods and Applications . Springer Berlin, Heidelberg. Freund, Y . and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Pr oceedings of the Thirteenth International Confer ence on Machine Learning Theory , pages 148–156, San Francisco, CA. Friedman, J., Hastie, T ., and Tibshirani, R. (2000). Additi ve logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics , 28(2):337–407. 17 Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics , 29(5):1189–1232. Ganegoda, A. and Ev ans, J. (2012). A scaling model for sev erity of operational losses using generalized additiv e models for location scale and shape (GAMLSS). Annals of Actuarial Science , 7(1):61–100. Gayawan, E. and Omolofe, O. T . (2016). Analyzing spatial distribution of antenatal care utilization in West Africa using a geo-additiv e zero-inﬂated count model. Spatial Demography , 4(3):245–262. Hans, N., Klein, N., F aschingbauer, F ., Schneider , M., and Mayr , A. (2022). Boosting distributional copula regression. Biometrics , 79(3):2298–2310. Hasselman, B. (2023). nleqslv: Solve Systems of Nonlinear Equations . R package version 3.3.5. Hastie, T . and Tibshirani, R. (1990). Generalized Additive Models . Chapman & Hall, London. Hastie, T ., T ibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Infer ence, and Pr ediction . Springer New Y ork. Hofner , B., Hothorn, T ., Kneib, T ., and Schmid, M. (2011). A framework for unbiased model selection based on boosting. Journal of Computational and Graphical Statistics , 20(4):956–971. Hofner , B., Mayr, A., Robinzonov , N., and Schmid, M. (2012). Model-based boosting in R: a hands-on tutorial using the R package mboost. Computational Statistics , 29(1–2):3–35. Hofner , B., Mayr , A., and Schmid, M. (2016). gamboostLSS: An R package for model building and v ariable selection in the GAMLSS framew ork. Journal of Statistical Softwar e , 74(1):1–31. Klein, N. (2024). Distrib utional regression for data analysis. Annual Review of Statistics and Its Application , 11:321–346. Kneib, T . (2013). Beyond mean regression. Statistical Modelling , 13(4):275–303. Kneib, T ., Hothorn, T ., and T utz, G. (2009). V ariable selection and model choice in geoadditiv e regression models. Biometrics , 65(2):626–634. Mayr , A., Binder, H., Gefeller , O., and Schmid, M. (2014a). The e volution of boosting algorithms. Methods of Information in Medicine , 53(06):419–427. Mayr , A., Binder, H., Gefeller, O., and Schmid, M. (2014b). Extending statistical boosting: An ov erview of recent methodological dev elopments. Methods of Information in Medicine , 53:428–435. Mayr , A., Fenske, N., Hofner , B., Kneib, T ., and Schmid, M. (2012a). Generalized additiv e models for location, scale and shape for high dimensional data - a ﬂexible approach based on boosting. Journal of the Royal Statistical Society: Series C (Applied Statistics) , 61(3):403–427. Mayr , A., Hofner , B., and Schmid, M. (2012b). The importance of knowing when to stop. Methods of Information in Medicine , 51(02):178–186. National Population Commission (NPC) [Nigeria] and ICF International (2014). Nigeria demographic and health surve y 2013. Abuja, Nigeria, and Rockville, Maryland, USA: NPC and ICF . Rigby , R. A. and Stasinopoulos, D. M. (2005). Generalized additiv e models for location, scale and shape (with discussion). Journal of the Royal Statistical Society: Series C (Applied Statistics) , 54(3):507–554. Rue, H. and Held, L. (2005). Gaussian Markov Random F ields . Chapman and Hall/CRC. Stasinopoulos, M., Rigby , B., and Akantziliotou, C. (2008). Instructions on how to use the gamlss package in R second edition. Str ¨ omer , A., Klein, N., Staerk, C., Klinkhammer, H., and Mayr , A. (2023). Boosting multi variate structured additiv e distributional regression models. Statistics in Medicine , 42(11):1779–1801. Str ¨ omer , A., Staerk, C., Klein, N., W einhold, L., Titze, S., and Mayr, A. (2022). Deselection of base- learners for statistical boosting - with an application to distributional regression. Statistical Methods in Medical Resear ch , 31(2):207–224. 18 Thomas, J., Hepp, T ., Mayr , A., and Bischl, B. (2017). Probing for sparse and fast variable selection with model-based boosting. Computational and Mathematical Methods in Medicine , 2017:1–8. Thomas, J., Mayr, A., Bischl, B., Schmid, M., Smith, A., and Hofner , B. (2018). Gradient boosting for distributional regression: faster tuning and improved variable selection via noncyclical updates. Statistics and Computing , 28(3):673–687. V illarini, G., Smith, J. A., and Napolitano, F . (2010). Nonstationary modeling of a long record of rainfall and temperature ov er Rome. Advances in W ater Resour ces , 33(10):1256–1267. W orld Health Organization (2006). WHO child gro wth standards: methods and dev elopment: length/height- for-age, weight-for-age, weight-for -length, weight-for-height and body mass index-for-age. Geneva, Switzerland: W orld Health Or ganization . W orld Health Organization (2016). WHO r ecommendations on antenatal car e for a positive pre gnancy experience . W orld Health Organization. W orld Health Or ganization et al. (2014). T rends in maternal mortality: 1990 to 2013 – Estimates by WHO, UNICEF, UNFP A, the World Bank and the United Nations Population Division. Zhang, B., Hepp, T ., Gre ven, S., and Bergherr , E. (2022). Adaptiv e step-length selection in gradient boost- ing for Gaussian location and scale models. Computational Statistics , 37(5):2295–2332. 19 Supplement A Deriv ation of the ﬁrst-order condition f or optimal step lengths in ZINB-GAMLSS For a ZINB-GAMLSS, we ha ve a log-likelihood of n X i =1 ℓ ( µ i , α i , π i ; y i ) = X i : y i =0 ln  π i + (1 − π i )(1 + α i µ i ) − 1 α i  + X i : y i > 0 ln(1 − π i ) + y i ln  α i µ i 1 + α i µ i  − 1 α i ln(1 + α i µ i ) + ln Γ  y i + 1 α i  − ln Γ( y i + 1) − ln Γ  1 α i  , where Γ denotes the gamma function. A.1 Derivation of the ﬁrst-order condition of ν ∗ µ In order to determine the optimal step length ν ∗ µ in iteration m , the following optimization problem has to be solved ν ∗ [ m ] µ = arg min ν µ − n X i =1 ℓ  exp  ˆ η [ m − 1] µ,i + ν µ ˆ h ∗ [ m ] µ,i  , ˆ α [ m − 1] i , ˆ π [ m − 1] i ; y i  , (A.1) where for notational con venience we deﬁne here and in the follo wing ˆ η [ m ] µ,i = ˆ η [ m ] µ ( x i ) , ˆ h ∗ [ m ] µ,i = ˆ h ∗ [ m ] µ ( x i ) , ˆ µ [ m ] i = ˆ µ [ m ] ( x i ) , ˆ α [ m ] i = ˆ α [ m ] ( x i ) and ˆ π [ m ] i = ˆ π [ m ] ( x i ) . For the ﬁrst order condition of (A.1) holds 0 = − n X i =1 ∂ ∂ ν µ ℓ  exp  ˆ η [ m − 1] µ,i + ν µ ˆ h ∗ [ m ] µ,i  , ˆ α [ m − 1] i , ˆ π [ m − 1] i ; y i  = − n X i =1 ∂ ℓ ∂ µ [ m ] i ! ∂ µ [ m ] i ∂ η [ m ] µ,i ! ∂ η [ m ] µ,i ∂ ν µ ! . W ith µ [ m ] i = exp  η [ m ] µ,i  and η [ m ] µ,i = η [ m − 1] µ,i + ν µ h ∗ [ m ] µ,i , we obtain for the partial deriv ativ es: ∂ ℓ ∂ µ [ m ] i =                −  1 − π [ m − 1] i   1 + α [ m − 1] i µ [ m ] i  − 1 α [ m − 1] i − 1 π [ m − 1] i +  1 − π [ m − 1] i   1 + α [ m − 1] i µ [ m ] i  − 1 α [ m − 1] i if y i = 0 y i µ [ m ] i  1 + α [ m − 1] i µ [ m ] i  − 1 1 + α [ m − 1] i µ [ m ] i if y i > 0 =              − 1 π [ m − 1] i 1 − π [ m − 1] i  1 + α [ m − 1] i µ [ m ] i  1 α [ m − 1] i +1 +  1 + α [ m − 1] i µ [ m ] i  if y i = 0 y i − µ [ m ] i µ [ m ] i  1 + α [ m − 1] i µ [ m ] i  if y i > 0 ∂ µ [ m ] i ∂ η [ m ] µ,i = exp  η [ m ] µ,i  = µ [ m ] i ∂ η [ m ] µ,i ∂ ν µ = h ∗ [ m ] µ,i 1 The ﬁrst order condition of (A.1) thus is 0 = − ∂ ∂ ν µ n X i =1 ℓ  exp  ˆ η [ m − 1] µ,i + ν µ ˆ h ∗ [ m ] µ,i  , ˆ α [ m − 1] i , ˆ π [ m − 1] i ; y i       ν µ = ν ∗ [ m ] µ = − X i : y i =0 − µ [ m ] i h ∗ [ m ] µ,i π [ m − 1] i 1 − π [ m − 1] i  1 + α [ m − 1] i µ [ m ] i  1 α [ m − 1] i +1 +  1 + α [ m − 1] i µ [ m ] i  − X i : y i > 0 µ [ m ] i h ∗ [ m ] µ,i  y i − µ [ m ] i  µ [ m ] i  1 + α [ m − 1] i µ [ m ] i  = X i : y i =0 exp  η [ m − 1] µ,i + ν ∗ [ m ] µ h ∗ [ m ] µ,i  h ∗ [ m ] µ,i π [ m − 1] i 1 − π [ m − 1] i  1 + α [ m − 1] i exp  η [ m − 1] µ,i + ν ∗ [ m ] µ h ∗ [ m ] µ,i  1 α [ m − 1] i +1 +  1 + α [ m − 1] i exp  η [ m − 1] µ,i + ν ∗ [ m ] µ h ∗ [ m ] µ,i  − X i : y i > 0 h ∗ [ m ] µ,i  y i − exp  η [ m − 1] µ,i + ν ∗ [ m ] µ h ∗ [ m ] µ,i  1 + α [ m − 1] i exp  η [ m − 1] µ,i + ν ∗ [ m ] µ h ∗ [ m ] µ,i  . A.2 Derivation of the ﬁrst-order condition of ν ∗ α In order to determine the optimal step length ν ∗ α in iteration m , the following optimization problem has to be solved ν ∗ [ m ] α = arg min ν α − n X i =1 ℓ  ˆ µ [ m − 1] i , exp  ˆ η [ m − 1] α,i + ν α ˆ h ∗ [ m ] α,i  , ˆ π [ m − 1] i ; y i  . (A.2) For the ﬁrst order condition of (A.2) holds 0 = − n X i =1 ∂ ℓ ∂ ν α  ˆ µ [ m − 1] i , exp  ˆ η [ m − 1] α,i + ν α ˆ h ∗ [ m ] α,i  , ˆ π [ m − 1] i ; y i  = − n X i =1 ∂ ℓ ∂ α [ m ] i ! ∂ α [ m ] i ∂ η [ m ] α,i ! ∂ η [ m ] α,i ∂ ν α ! W ith α [ m ] i = exp  η [ m ] α,i  and η [ m ] α,i = η [ m − 1] α,i + ν α h ∗ [ m ] α,i , we obtain for the partial deriv ativ es ∂ ℓ ∂ α [ m ] i =                                 1 − π [ m − 1] i   1 + α [ m ] i µ [ m − 1] i  − 1 α [ m ] i  − 1 α [ m ] i  µ [ m − 1] i 1+ α [ m ] i µ [ m − 1] i + 1 α [ m ] i 2 ln  1 + α [ m ] i µ [ m − 1] i   π [ m − 1] i +  1 − π [ m − 1] i   1 + α [ m ] i µ [ m − 1] i  − 1 α [ m ] i if y i = 0 y i µ [ m − 1] i α [ m ] i µ [ m − 1] i − µ [ m − 1] i 1 + α [ m ] i µ [ m − 1] i ! − µ [ m − 1] i α [ m ] i  1 + α [ m ] i µ [ m − 1] i  + 1 α [ m ] i 2 ln  1 + α [ m ] i µ [ m − 1] i  + ψ y i + 1 α [ m ] i !   − 1 α [ m ] i 2   − ψ 1 α [ m ] i !   − 1 α [ m ] i 2   if y i > 0 =                    1 α [ m ] i  1 α [ m ] i ln  1 + α [ m ] i µ [ m − 1] i  − µ [ m − 1] i 1+ α [ m ] i µ [ m − 1] i  π [ m − 1] i 1 − π [ m − 1] i  1 + α [ m ] i µ [ m − 1] i  1 α [ m ] i + 1 if y i = 0 y i − µ [ m − 1] i α [ m ] i  1 + α [ m ] i µ [ m − 1] i  + 1 α [ m ] i 2 " ln  1 + α [ m ] i µ [ m − 1] i  − ψ y i + 1 α [ m ] i ! + ψ 1 α [ m ] i !# if y i > 0 ∂ α [ m ] i ∂ η [ m ] α,i = exp  η [ m ] α,i  = α [ m ] i ∂ η [ m ] α,i ∂ ν α = h ∗ [ m ] α,i , 2 where ψ denotes the digamma function. The ﬁrst order condition of (A.2) thus is 0 = − ∂ ∂ ν α n X i =1 ℓ  ˆ µ [ m − 1] i , exp  ˆ η [ m − 1] α,i + ν α ˆ h ∗ [ m ] α,i  , ˆ π [ m − 1] i ; y i       ν α = ν ∗ [ m ] α = − X i : y i =0 1 α [ m ] i  1 α [ m ] i ln  1 + α [ m ] i µ [ m − 1] i  − µ [ m − 1] i 1+ α [ m ] i µ [ m − 1] i  π [ m − 1] i 1 − π [ m − 1] i  1 + α [ m ] i µ [ m − 1] i  1 α [ m ] i + 1 α [ m ] i h ∗ [ m ] α,i − X i : y i > 0   y i − µ [ m − 1] i α [ m ] i  1 + α [ m ] i µ [ m − 1] i  + 1 α [ m ] i 2 " ln  1 + α [ m ] i µ [ m − 1] i  − ψ y i + 1 α [ m ] i ! + ψ 1 α [ m ] i !#   α [ m ] i h ∗ [ m ] α,i = − X i : y i =0 h ∗ [ m ] α,i  ln  1+ α [ m ] i µ [ m − 1] i  α [ m ] i − µ [ m − 1] i 1+ α [ m ] i µ [ m − 1] i  π [ m − 1] i 1 − π [ m − 1] i  1 + α [ m ] i µ [ m − 1] i  1 α [ m ] i + 1 − X i : y i > 0 h ∗ [ m ] α,i  y i − µ [ m − 1] i  1 + α [ m ] i µ [ m − 1] i + h ∗ [ m ] α,i ln  1 + α [ m ] i µ [ m − 1] i  α [ m ] i + h ∗ [ m ] α,i α [ m ] i " − ψ y i + 1 α [ m ] i ! + ψ 1 α [ m ] i !# = − X i : y i =0 h ∗ [ m ] α,i  ln  1+ µ [ m − 1] i exp  η [ m − 1] α,i + ν ∗ [ m ] α h ∗ [ m ] α,i  exp  η [ m − 1] α,i + ν ∗ [ m ] α h ∗ [ m ] α,i  − µ [ m − 1] i 1+ µ [ m − 1] i exp  η [ m − 1] α,i + ν ∗ [ m ] α h ∗ [ m ] α,i   π [ m − 1] i 1 − π [ m − 1] i  1 + µ [ m − 1] i exp  η [ m − 1] α,i + ν ∗ [ m ] α h ∗ [ m ] α,i  1 exp ( η [ m − 1] α,i + ν ∗ [ m ] α h ∗ [ m ] α,i ) + 1 − X i : y i > 0 h ∗ [ m ] α,i  y i − µ [ m − 1] i  1 + µ [ m − 1] i exp  η [ m − 1] α,i + ν ∗ [ m ] α h ∗ [ m ] α,i  + h ∗ [ m ] α,i ln  1 + µ [ m − 1] i exp  η [ m − 1] α,i + ν ∗ [ m ] α h ∗ [ m ] α,i  exp  η [ m − 1] α,i + ν ∗ [ m ] α h ∗ [ m ] α,i  + h ∗ [ m ] α,i exp  η [ m − 1] α,i + ν ∗ [ m ] α h ∗ [ m ] α,i    − ψ   y i + 1 exp  η [ m − 1] α,i + ν ∗ [ m ] α h ∗ [ m ] α,i    + ψ   1 exp  η [ m − 1] α,i + ν ∗ [ m ] α h ∗ [ m ] α,i      . A.3 Derivation of the ﬁrst-order condition of ν ∗ π In order to determine the optimal step length ν ∗ π in iteration m , the following optimization problem has to be solved ν ∗ [ m ] π = arg min ν π − n X i =1 ℓ  ˆ µ [ m − 1] i , ˆ α [ m − 1] i , logit − 1  ˆ η [ m − 1] π ,i + ν π ˆ h ∗ [ m ] π ,i  ; y i  . (A.3) For the ﬁrst order condition of (A.3) holds 0 = − n X i =1 ∂ ℓ ∂ ν π  ˆ µ [ m − 1] i , ˆ α [ m − 1] i , logit − 1  ˆ η [ m − 1] π ,i + ν π ˆ h ∗ [ m ] π ,i  ; y i  = − n X i =1 ∂ ℓ ∂ π [ m ] i ! ∂ π [ m ] i ∂ η [ m ] π ,i ! ∂ η [ m ] π ,i ∂ ν π ! W ith π [ m ] i = exp  η [ m ] π,i  1+exp  η [ m ] π,i  ) and η [ m ] π ,i = η [ m − 1] π ,i + ν π h ∗ [ m ] π ,i , we obtain for the partial deriv ativ es: 3 ∂ ℓ ∂ π [ m ] i =                1 −  1 + α [ m − 1] i µ [ m − 1] i  − 1 α [ m − 1] i π [ m ] i +  1 − π [ m ] i   1 + α [ m − 1] i µ [ m − 1] i  − 1 α [ m − 1] i if y i = 0 − 1 1 − π [ m ] i if y i > 0 =                1  1+ α [ m − 1] i µ [ m − 1] i  − 1 α [ m − 1] i 1 −  1+ α [ m − 1] i µ [ m − 1] i  − 1 α [ m − 1] i + π [ m ] i if y i = 0 − 1 1 − π [ m ] i if y i > 0 ∂ π [ m ] i ∂ η [ m ] π ,i = exp  η [ m ] π ,i  1 + exp  η [ m ] π ,i  −   exp  η [ m ] π ,i  1 + exp  η [ m ] π ,i    2 = π [ m ] i  1 − π [ m ] i  ∂ η [ m ] π ,i ∂ ν π = h ∗ [ m ] π ,i The ﬁrst order condition of (A.3) thus is 0 = − ∂ ∂ ν π n X i =1 ℓ  ˆ µ [ m − 1] i , ˆ α [ m − 1] i , logit − 1  ˆ η [ m − 1] π ,i + ν π ˆ h ∗ [ m ] π ,i  ; y i       ν π = ν ∗ [ m ] π = − X i : y i =0          1  1+ α [ m − 1] i µ [ m − 1] i  − 1 α [ m − 1] i 1 −  1+ α [ m − 1] i µ [ m − 1] i  − 1 α [ m − 1] i + π [ m ] i          π [ m ] i  1 − π [ m ] i  h ∗ [ m ] π ,i − X i : y i > 0 − 1 1 − π [ m ] i π [ m ] i  1 − π [ m ] i  h ∗ [ m ] π ,i = X i : y i =0  1 − π [ m ] i  h ∗ [ m ] π ,i  1+ α [ m − 1] i µ [ m − 1] i  − 1 α [ m − 1] i π [ m ] i   1 −  1+ α [ m − 1] i µ [ m − 1] i  − 1 α [ m − 1] i   + 1 − X i : y i > 0 π [ m ] i h ∗ [ m ] π ,i = X i : y i =0  1 1+exp  η [ m ] π,i   h ∗ [ m ] π ,i  exp  − η [ m ] π ,i  + 1   1+ α [ m − 1] i µ [ m − 1] i  − 1 α [ m − 1] i   1 −  1+ α [ m − 1] i µ [ m − 1] i  − 1 α [ m − 1] i   + 1 − X i : y i > 0   1 − 1 1 + exp  η [ m ] π ,i    h ∗ [ m ] π ,i = X i : y i =0  1 1+exp  η [ m − 1] π,i + ν ∗ [ m ] π h ∗ [ m ] π,i   h ∗ [ m ] π ,i  exp  − η [ m − 1] π ,i − ν ∗ [ m ] π h ∗ [ m ] π ,i  + 1   1+ α [ m − 1] i µ [ m − 1] i  − 1 α [ m − 1] i   1 −  1+ α [ m − 1] i µ [ m − 1] i  − 1 α [ m − 1] i   + 1 − X i : y i > 0   1 − 1 1 + exp  η [ m − 1] π ,i + ν ∗ [ m ] π h ∗ [ m ] π ,i    h ∗ [ m ] π ,i . 4 B Simulation B.1 Gaussian setting The effects f µ ( z i ) and f σ ( z i ) are speciﬁed by: (a) categorical cate gorical effect f µ ( z i ) =                − 2 if z i = 1 − 1 . 5 if z i = 2 0 if z i = 3 1 . 5 if z i = 4 2 if z i = 5 f σ ( z i ) =                − 0 . 4 if z i = 1 − 0 . 2 if z i = 2 0 if z i = 3 0 . 2 if z i = 4 0 . 4 if z i = 5 , where Z i ∼ U { 1 , ..., 5 } (b) non-linear ef fect f µ ( z i ) = 8 sin  π 2 + z i  − 6 . 5 f σ ( z i ) = 1 6 (2 . 85 z i ) 3 − 2 . 85 z i , where Z i ∼ U ( − 1 , 1) (c) discrete spatial ef fect: W e consider the neighborhood structure in Nigeria with 6 regions z i ∈ { 1 , ..., 6 } . The continuous spatial effect f ( s 1 , s 2 ) = s 1 + s 2 is a veraged o ver each region and the resulting regional effects are centered around 0 and scaled to the chosen effect size. This results in a spatial effect of approximately f µ ( z i ) =                    − 0 . 01 if z i = 1 3 if z i = 2 1 . 41 if z i = 3 − 1 . 03 if z i = 4 − 1 . 45 if z i = 5 − 1 . 91 if z i = 9 f σ ( z i ) =                    − 0 . 001 if z i = 1 0 . 3 if z i = 2 0 . 141 if z i = 3 − 0 . 103 if z i = 4 − 0 . 145 if z i = 5 − 0 . 191 if z i = 6 , where Z i ∼ U { 1 , ..., 6 } . B.2 ZINB setting The effects f µ ( z i ) and f σ ( z i ) are speciﬁed by: (a) categorical cate gorical effect f µ ( z i ) =                0 . 2 if z i = 1 0 . 1 if z i = 2 0 if z i = 3 − 0 . 1 if z i = 4 − 0 . 2 if z i = 5 f α ( z i ) =                0 . 25 if z i = 1 0 . 15 if z i = 2 0 if z i = 3 − 0 . 15 if z i = 4 − 0 . 25 if z i = 5 f π ( z i ) =                0 . 8 if z i = 1 0 . 6 if z i = 2 0 if z i = 3 − 0 . 6 if z i = 4 − 0 . 8 if z i = 5 , where Z i ∼ U { 1 , ..., 5 } (b) non-linear ef fect f µ ( z i ) = − 0 . 7(ln( z i + 1 . 15) − 0 . 9 z i ) − 0 . 03) f α ( z i ) = 2 . 1 sin  π 2 + z i  − 1 . 767 f π ( z i ) = 1 3 (2 z i ) 3 − 2 z i , where Z i ∼ U ( − 1 , 1) 5 (c) discrete spatial ef fect W e consider the neighborhood structure in Nigeria with 6 regions z i ∈ { 1 , ..., 6 } . The continuous spatial effect f ( s 1 , s 2 ) = s 1 + s 2 is a veraged o ver each region and the resulting regional effects are centered around 0 and scaled to the chosen effect size. This yields approximately f µ ( z i ) =                    0 . 001 if z i = 1 − 0 . 3 if z i = 2 − 0 . 141 if z i = 3 0 . 103 if z i = 4 0 . 145 if z i = 5 0 . 191 if z i = 6 f α ( z i ) =                    − 0 . 001 if z i = 1 0 . 3 if z i = 2 0 . 141 if z i = 3 − 0 . 103 if z i = 4 − 0 . 145 if z i = 5 − 0 . 191 if z i = 6 f π ( z i ) =                    − 0 . 001 if z i = 1 0 . 3 if z i = 2 0 . 141 if z i = 3 − 0 . 103 if z i = 4 − 0 . 145 if z i = 5 − 0 . 191 if z i = 6 , where Z i ∼ U { 1 , ..., 6 } . B.3 Further Results f or the Gaussian Setting µ σ x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 z 11 z 12 z 13 z 14 z 15 z 16 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 z 11 z 12 z 13 z 14 z 15 z 16 −0.4 0.0 0.4 −2 0 2 coefficient method ml fixed shrunk optimal Figure 11: Distrib ution of the coefﬁcient estimates in the Gaussian simulation setting (3.1) with an infor- mativ e spatial effect. The red horizontal lines represent the true coefﬁcients 6 T able 5: Number of selections of the covariate effects in the Gaussian simulation setting (3.1) with infor- mativ e spatial effect for 100 simulation runs. The informative ef fects are marked in bold η µ η σ ﬁxed opt ﬁxed opt x 1 97 100 83 12 x 2 100 100 86 14 x 3 91 98 100 100 x 4 99 100 100 100 x 5 3 7 100 100 x 6 2 5 100 100 x 7 6 10 76 13 x 8 3 4 71 4 x 9 2 2 88 13 x 10 4 6 84 10 z inf 100 100 100 100 z non-inf – – – – CRPS distribution CRPS point neg. log − likelihood MSE categorical smooth spatial ml fixed opt ml fixed opt ml fixed opt ml fixed opt 70 80 90 100 110 125 150 175 200 70 75 80 85 90 3400 3450 3500 3550 3400 3500 3600 3400 3450 3500 4.25 4.50 4.75 5.00 5.0 5.5 6.0 4.2 4.4 4.6 4.8 0.04 0.08 0.12 0.16 0.05 0.10 0.025 0.050 0.075 0.100 0.125 method predictive perf or mance Figure 12: Distribution of di fferent predictiv e performance measures (columns) in the Gaussian simulation setting (3.1) with different additional ef fects (rows) 7 B.4 Further Results f or the ZINB Setting categorical spatial µ α π x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 z 10 z 11 z 13 z 14 z 20 z 21 z 23 z 24 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 z 11 z 12 z 13 z 14 z 15 z 16 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 z 10 z 11 z 13 z 14 z 20 z 21 z 23 z 24 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 z 11 z 12 z 13 z 14 z 15 z 16 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 z 10 z 11 z 13 z 14 z 20 z 21 z 23 z 24 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 z 11 z 12 z 13 z 14 z 15 z 16 −0.25 0.00 0.25 0.50 −0.5 0.0 0.5 −2 −1 0 1 2 coefficient method ml fixed shrunk optimal Figure 13: Distribution of the coefﬁcient estimates in the ZINB simulation setting (3.2) with categorical effects (left) as well as an informativ e spatial ef fect (right). The red horizontal lines represent the true coefﬁcients 8 fixed shrunk optimal µ α π −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −0.25 0.00 0.25 0.50 −0.6 −0.4 −0.2 0.0 0.2 −0.4 0.0 0.4 z par tial eff ect Figure 14: Partial ef fects of the informative non-linear ef fect in the ZINB simulation setting (3.2). The red dashed lines represent the true partial effect T able 6: Number of selections of the cov ariate effects in the ZINB simulation setting (3.2) with additional informativ e (columns 1-6) and non-informativ e spatial effect (columns 7-12) for 100 simulation runs. The informativ e effects are marked in bold categorical non-linear η µ η α η π η µ η α η π ﬁxed opt ﬁxed opt ﬁxed opt ﬁxed opt ﬁxed opt ﬁxed opt x 1 100 100 12 13 5 5 100 100 5 10 4 3 x 2 100 100 87 95 46 45 100 100 87 93 36 33 x 3 100 100 92 92 100 100 100 100 93 96 100 100 x 4 100 100 88 95 100 100 100 100 84 96 100 100 x 5 45 19 100 100 100 100 52 11 98 99 100 100 x 6 45 12 2 3 100 100 33 9 2 5 100 100 x 7 51 15 1 7 4 3 25 3 2 2 1 1 x 8 56 16 3 8 2 3 25 1 0 3 1 0 x 9 54 16 4 6 4 5 29 4 0 3 1 1 x 10 45 10 3 8 4 4 35 6 0 6 3 3 z 1 100 100 82 96 100 100 – – – – – – z 2 – – – – – – 64 11 0 6 1 2 9 CRPS distribution CRPS point neg. log − likelihood categorical smooth spatial ml fixed opt ml fixed opt ml fixed opt 8500 9000 9500 10000 10500 8500 9000 9500 10000 10500 8500 9000 9500 10000 10500 2.10 2.15 2.20 2.25 2.30 2.10 2.15 2.20 2.25 2.30 2.10 2.15 2.20 2.25 2.30 0.025 0.050 0.075 0.100 0.125 0.025 0.050 0.075 0.100 0.125 0.025 0.050 0.075 0.100 0.125 method predictive perf or mance Figure 15: Distribution of different predictive performance measures (columns) in the ZINB simulation setting (3.2) with different additional effects (rows). For better visibility , some outliers of the maximum likelihood method are excluded 10 C A pplication C.1 Inf ormation on the Data T able 7: Overvie w of the variables and their summary statistics for the antenatal care data from Nigeria variable short description mean/frequency standard de v . min max antenatal number of antenatal care visits 5.25 6.11 0 30 age age in years 28.46 7.08 13 50 married yrs marriage duration in years 11.63 7.51 0 39 education 0 = no education 47.1% 1 = primary school education 20.4% 2 = higher education 32.5% wealth 0 = poorest 22.4% 1 = poor 23.1% 2 = middle 19.8% 3 = richer 18.4% 4 = richest 16.2% work 1 = employed 69.4% 0 = unemployed 30.6 % television 1 = access to television 45.8% 0 = no access 54.2% radio 1 = access to radio 59.1% 0 = no access 40.9% newspaper 1 = access to newspaper 14.3% 0 = no access 85.7% urban 1 = urban area of residence 33.1% 0 = rural area of residence 66.9% zone 221 = North Central 15.5% 222 = North East 20.1% 223 = North W est 32.0% 224 = South East 8.1% 225 = South South 11.1% 226 = South W est 13.2% 11 C.2 Further Results T able 8: Selected coefﬁcients for the antenatal care data from Nigeria using different step length approaches. The shrunk optimal approach is referred to by “opt” η µ η α η π ﬁxed opt ﬁxed opt ﬁxed opt urban 0.016 0.000 -0.192 -0.204 -0.619 -0.641 newsp 0.020 0.025 -0.266 -0.258 -0.143 -0.140 radio 0.048 0.034 -0.051 -0.073 -0.175 -0.170 telv 0.047 0.058 -0.033 -0.041 -0.472 -0.482 work 0.000 0.000 0.000 0.000 -0.218 -0.208 education0 -0.033 -0.027 0.000 0.000 0.603 0.594 education2 0.019 0.020 0.000 0.000 -0.457 -0.450 wealth0 -0.033 -0.028 0.013 0.033 0.519 0.497 wealth1 -0.041 -0.037 -0.007 -0.002 0.408 0.382 wealth3 0.028 0.025 -0.034 -0.058 -0.298 -0.289 wealth4 0.033 0.031 -0.053 -0.089 -0.517 -0.479 urban:radio -0.015 0.000 0.000 0.000 0.000 0.000 urban:telv -0.007 0.000 0.000 0.000 0.000 0.000 urban:work 0.006 0.006 0.000 0.000 0.000 0.000 urban:wealth0 -0.096 -0.048 0.000 0.000 0.039 0.038 urban:wealth1 -0.043 -0.025 0.000 0.000 0.000 0.000 urban:wealth3 -0.009 0.000 0.000 0.000 0.000 0.000 urban:wealth4 0.022 0.023 0.000 0.000 -0.117 -0.051 urban:education0 -0.005 0.000 0.000 0.000 0.000 0.000 urban:education2 0.042 0.046 0.000 0.000 0.000 0.000 newsp:w ork 0.021 0.010 0.000 0.000 0.000 0.000 newsp:wealth0 0.092 0.000 0.000 0.000 0.000 0.000 newsp:wealth1 0.013 0.000 0.000 0.000 0.000 0.000 newsp:education0 -0.105 0.000 0.000 0.000 0.000 0.000 radio:wealth0 0.000 0.000 0.000 0.000 0.023 0.015 radio:wealth3 0.004 0.000 0.000 0.000 0.000 0.000 radio:wealth4 0.000 0.000 -0.023 -0.034 -0.067 -0.093 radio:education2 0.000 0.016 0.000 0.000 0.000 0.000 telv:wealth1 -0.022 -0.013 0.000 0.000 0.000 0.000 telv:wealth4 0.031 0.022 0.000 0.000 0.000 0.000 telv:education0 -0.005 0.000 0.000 0.000 0.000 0.000 telv:education2 0.040 0.025 0.000 0.000 0.000 0.000 work:wealth0 -0.028 0.000 0.000 0.000 0.000 0.000 work:wealth1 -0.018 -0.004 0.000 0.000 0.000 0.000 work:wealth3 0.021 0.013 0.000 0.000 0.000 0.000 work:wealth4 0.000 0.008 0.000 0.000 0.000 0.000 wealth0:education0 0.000 0.000 0.000 0.000 0.563 0.552 wealth0:education2 -0.038 0.000 0.000 0.000 0.000 0.000 wealth1:education0 0.000 0.000 0.000 0.000 0.115 0.118 wealth1:education2 0.000 0.000 0.000 0.031 0.000 0.000 wealth4:education0 0.000 0.000 0.000 -0.089 0.000 0.000 zone221 -0.052 -0.047 0.045 0.084 -0.131 -0.133 zone222 -0.356 -0.372 -0.583 -0.679 -0.286 -0.284 zone223 -0.381 -0.386 -0.580 -0.686 0.662 0.661 zone224 0.157 0.147 0.291 0.330 -0.581 -0.543 zone225 0.085 0.116 0.499 0.567 0.772 0.722 zone226 0.548 0.542 0.327 0.384 -0.436 -0.423 12 CRPS neg. log − likelihood fixed opt fixed opt 0.0 2.5 5.0 7.5 10.0 12.5 0 5 10 15 20 method predictive perf or mance Figure 16: Distribution of different predicti ve performance measures applied on each observation for the validation set of the antenatal care data from Nigeria (columns) using dif ferent step length approaches. For better visibility , fe w outliers are excluded 13 C.3 Comparison of Boosting Appr oaches with Direct Maximum Likelihood Esti- mation without Interactions T able 9: Coefﬁcients for the antenatal care data from Nigeria without interactions using different step length approaches. “ml” refers to the direct maximum likelihood estimation via the gamlss package and the shrunk optimal approach is abbreviated by “opt” η µ η α η π ml ﬁxed opt ml ﬁxed opt ml ﬁxed opt urban -0.005 0.029 0.032 -0.093 -0.188 -0.202 -0.667 -0.611 -0.630 newsp 0.020 0.039 0.034 -0.280 -0.268 -0.256 -0.461 -0.115 -0.119 radio 0.052 0.033 0.035 -0.071 -0.045 -0.058 -0.249 -0.176 -0.179 telv 0.039 0.076 0.066 0.014 -0.025 -0.045 -0.263 -0.444 -0.441 work 0.017 0.000 0.000 0.081 0.000 0.000 -0.325 -0.225 -0.230 education0 -0.054 -0.047 -0.044 -0.249 0.000 0.000 0.878 0.741 0.741 education2 0.068 0.058 0.057 0.072 0.000 0.000 -0.325 -0.503 -0.503 wealth0 -0.087 -0.048 -0.037 0.176 0.018 0.028 1.151 0.972 0.971 wealth1 -0.080 -0.060 -0.052 -0.041 -0.005 -0.002 0.620 0.445 0.443 wealth3 0.064 0.033 0.023 -0.202 -0.040 -0.050 -0.305 -0.368 -0.368 wealth4 0.117 0.079 0.070 -0.293 -0.071 -0.089 -1.453 -0.668 -0.659 zone221 -0.055 -0.052 -0.047 0.077 0.048 0.074 -0.105 -0.148 -0.149 zone222 -0.317 -0.352 -0.371 -0.563 -0.598 -0.655 -0.435 -0.301 -0.312 zone223 -0.346 -0.375 -0.384 -0.557 -0.595 -0.656 0.403 0.658 0.650 zone224 0.145 0.147 0.141 0.233 0.307 0.325 -0.333 -0.622 -0.618 zone225 0.047 0.079 0.115 0.490 0.512 0.555 0.656 0.837 0.852 zone226 0.527 0.552 0.547 0.319 0.326 0.357 -0.185 -0.425 -0.423 ml fixed shrunk optimal µ α π 20 30 40 50 20 30 40 50 20 30 40 50 −0.1 0.0 −0.1 0.0 0.1 0.2 0.3 0.4 −0.5 0.0 0.5 age par tial eff ect Figure 17: Partial effects of the mother’ s age for the antenatal care data from Nigeria modeled without interactions using different step length approaches (columns). “ml” refers to the direct maximum likelihood estimation via the gamlss package 14 ml fixed shrunk optimal µ α π 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 −0.05 0.00 0.05 −0.4 −0.3 −0.2 −0.1 0.0 −0.5 0.0 0.5 1.0 married_yrs par tial eff ect Figure 18: Partial ef fects of the marriage duration for the antenatal care data from Nigeria modeled without interactions using different step length approaches (columns). “ml” refers to the direct maximum likelihood estimation via the gamlss package µ α π ml fixed shrunk optimal coefficients −0.4 0.0 0.4 0.8 Figure 19: Spatial effects for the antenatal care data from Nigeria modeled without interactions using different step length approaches (rows). “ml” refers to the direct maximum likelihood estimation via the gamlss package 15

Estimating Zero-inflated Negative Binomial GAMLSS via a Balanced Gradient Boosting Approach with an Application to Antenatal Care Data from Nigeria

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment