Least Squares Approximation for a Distributed System
In this work, we develop a distributed least squares approximation (DLSA) method that is able to solve a large family of regression problems (e.g., linear regression, logistic regression, and Cox's model) on a distributed system. By approximating the…
Authors: Xuening Zhu, Feng Li, Hansheng Wang
Least Squares Appro ximation for a Distributed System Xuening Zh u 1 , F eng Li 2 , and Hansheng W ang 3 1 Sc ho ol of Data Science, F udan Univ ersit y , Shanghai, China. 2 Sc ho ol of Statistics and Mathematics, Cen tral Univ ersit y of Finance and Economics, Beijing, China. 3 Guangh ua Sc ho ol of Managemen t, P eking Univ ersit y , Beijing, China. Abstract In this w ork, w e develop a distributed least squares appro ximation (DLSA) metho d that is able to solve a large family of regression problems (e.g., linear regression, logis- tic regression, and Co x’s mo del) on a distributed system. By appro ximating the lo cal ob jective function using a lo cal quadratic form, w e are able to obtain a combined estimator b y taking a weigh ted av erage of lo cal estimators. The resulting estimator is prov ed to b e statistically as efficient as the global estimator. Moreov er, it requires only one round of communication. W e further conduct a shrink age estimation based on the DLSA estimation using an adaptive Lasso approac h. The solution can b e easily obtained by using the LARS algorithm on the master no de. It is theoretically sho wn that the resulting estimator p ossesses the oracle prop erty and is selection con- sisten t by using a newly designed distributed Bay esian information criterion (DBIC). The finite sample p erformance and computational efficiency are further illustrated b y an extensive numerical study and an airline dataset. The airline dataset is 52 GB in size. The entire metho dology has b een implemen ted in Python for a de-facto standard Spark system. The prop osed DLSA algorithm on the Spark system tak es 26 min utes to obtain a logistic regression estimator, which is more efficient and memory friendly than conv entional metho ds. KEY W ORDS: Distributed System; Least Squares Approximation; Shrink age Es- timation; Distributed BIC. 1 F eng Li is the corresp onding author. Xuening Zhu is supp orted b y the National Natural Science F oundation of China (nos. 11901105, 71991472, U1811461), the Shanghai Sailing Program for Y outh Science and T echnology Excellence (19YF1402700), and funds from F udan Universit y . F eng Li’s research is supported by the National Natural Science F oundation of China (no. 11501587). Hansheng W ang’s researc h is partially supp orted by National Natural Science F oundation of China (No. 11831008). 1 1 In tro duction Mo dern data analysis often needs to address huge datasets. In man y cases, the size of the dataset could b e to o large to b e conv eniently handled by a single computer. Consequently , the dataset must b e stored and pro cessed on many connected computer no des, whic h thereafter are referred to as a distributed system. More precisely , a distributed system refers to a large cluster of computers, which are typically connected with eac h other via wire proto cols suc h as RPC and HTTP ( Zaharia et al. , 2012 ). Consequen tly , they are able to communicate with each other and accomplish the in tended data analysis tasks at huge scales in a collectiv e manner. By using a distributed system, w e are able to break a large-scale computation problem in to man y small pieces and then solve them in a distributed manner. A k ey challenge faced b y statistical computation on a distributed system is the communication cost. The comm unication cost refers to the w all-clo ck time cost needed for data communication b e- t w een different computer no des, which could b e exp ensiv e in distributed systems ( Zhang et al. , 2013 ; Shamir et al. , 2014 ; Jordan et al. , 2019 ). In this w ork, w e consider a “master- and-w ork er”-type distributed system with strong w ork ers. W e assume that the w ork ers are strong in the sense they are mo dern computers with reasonable storage and comput- ing capacities. F or example, a work er with 32 CPU cores, 128 GB RAM, and 512 GB SSD hard disk could b e a v ery strong w orker. As one will see later, the most widely used distributed en vironmen t, Hado op ( Apac he Soft ware F oundation , 2019a , version 2.7.2) and Spark ( Apache Soft w are F oundation , 2019b , v ersion 2.3.1), b elong to this category . Typi- cally , the w orkers do not comm unicate with each other directly . Ho wev er, they should b e connected to a common master no de, which is another computer with outstanding capac- ities. Consequen tly , most data should b e distributed to work ers, and most computations should b e conducted by the work ers. This enables us to solv e a large-scale computation problem in a distributed manner. In contrast, the master should take the resp onsibilit y to co ordinate with differen t w orkers. F or this “master-and-work er”-type distributed system, the comm unication cost is mostly b et w een the master and work ers. One can easily verify that go o d algorithms for some 2 simple momen t estimates (e.g., sample mean) can b e easily dev elop ed using this t yp e of distributed system. F or example, to compute the sample mean on a distributed system, one can first compute the sample mean on eac h work er, which is kno wn as a map process. Then, eac h w orker rep orts to the master the resulting sample mean and the asso ciated sample size. Thereafter, the master can compute the ov erall sample mean by a w eigh ted a v erage of the sample means from eac h work er, whic h is kno wn as a reduce pro cess. Suc h a “MapReduce” algorithm requires only one “master-and-work er” communication for eac h w ork er. It requires no direct comm unication b etw een w ork ers. Because most computations are accomplished b y the w orkers, it also makes go o d use of the strong w ork er capacities. As a result, the algorithm can b e considered effective. Unfortunately , cases such as the sample mean are rather rare in statistical analysis. Most statistical algorithms do not ha v e an analytical solution (e.g., the maximum lik eliho o d estimation of a logistic regression mo del) and th us require multiple iterations (e.g., Newton-Raphson iteration or sto c hastic-gradient- descen t-t yp e algorithms). These iterations unfortunately lead to substan tial “master-and- w ork er” comm unication, whic h is communicationally exp ensiv e. Therefore, developing al- gorithms that are highly efficien t computationally , communicationally and statistically for distributed systems has b ecome a problem of great in terest. In the literature, the common wisdom for addressing a distributed statistical problem can b e classified into tw o categories. The first category is the “one-shot” (OS) or “em bar- rassingly parallel” approac h, which requires only one round of comm unication. Sp ecifically , the lo cal w orker computes the estimators in parallel and then comm unicate to the master to obtain an a v erage global estimator ( Zhang et al. , 2013 ; Liu and Ihler , 2014 ; Lee et al. , 2015 ; Battey et al. , 2015 ; F an et al. , 2017 ; Chang et al. , 2017b , a ). Although this approac h is highly efficien t in terms of comm unication, it migh t not achiev e the b est efficiency in statistical estimation in most o ccasions ( Shamir et al. , 2014 ; Jordan et al. , 2019 ). The second approach includes iterativ e algorithms, which require m ultiple rounds of commu- nication b et w een the master and the work ers. This approach typically requires m ultiple iterations to be tak en so that the estimation efficiency can b e refined to matc h the global (or centralized) estimator ( Shamir et al. , 2014 ; W ang et al. , 2017a , b ; Jordan et al. , 2019 ). 3 In addition, see Y ang et al. ( 2016 ); Heinze et al. ( 2016 ); Smith et al. ( 2018 ); Li et al. ( 2019 ) for distributed statistical mo delling metho ds when the data are distributed according to features rather than samples. The aforementioned tw o approac hes are also studied for the sparse learning problem using ` 1 shrink age estimation. F or the first approach, Lee et al. ( 2015 ) inv estigated the distributed high-dimensional sparse regression using the OS approac h by combining lo cal debiased ` 1 estimates. Battey et al. ( 2015 ) revisited the same problem but further con- sidered distributed testing and estimation metho ds in a unified lik eliho o d framew ork, in whic h a refitted estimation is used to obtain an oracle con vergence rate. F or the second approac h, b oth W ang et al. ( 2017a ) and Jordan et al. ( 2019 ) ha v e dev elop ed iterativ e al- gorithms to solv e the sparse estimation problem, and they theoretically prov ed that the error bounds match the centralized estimator. Beyond the ` 1 shrink age estimation, Chen and Xie ( 2014 ) studied a p enalized lik eliho o d estimator with more general p enalt y function forms in a high-dimensional setting. How ever, to the b est of our kno wledge, there are no guaran tees that simultaneously ensure the mo del selection consistency ( F an and Li , 2001 ) and establish a criterion for consistent tuning parameter selection ( W ang et al. , 2007 ). In addition, all of the ab ov e metho ds assume independent and iden tical samples stored b y each w ork er, whic h is questionable in practice b ecause the distributed dataset migh t exp erience great heterogeneit y from w orker to w orker. W e w ould lik e to remark that the heterogeneity cannot b e av oided b ecause it is mainly due to the practical need to record data across time or space (for example). Ignoring the heterogeneity will lead to suboptimal or even biased estimation result. In this w ork, we aim to develop a nov el metho dology to address a sparse estimation problem with low dimensions ( p < n , where n is the lo cal sample size). Under the high dimensional setting, we refer to the recent work of Li et al. ( 2020 ) for a distributed pre- feature screening pro cedure, whic h consumes a comm unication cost of O ( p ). Hence, the feature dimension can b e greatly reduced and then we could launc h our estimation pro- cedure. Sp ecifically , w e assume the data possessed by differen t w orkers are allow ed to b e heterogeneous but share the same regression relationship. The prop osed metho d b orro ws 4 the idea of the least squares appro ximation (LSA, W ang and Leng , 2007 ) and can b e used to handle a large class of parametric regression models on a distributed system. Sp ecifically , let Y ∈ R b e the resp onse of interest, let X be the asso ciated predictor, and let θ ∈ R p b e the corresp onding regression coefficient. The ob jective is to estimate the regression parameter θ and conduct a v ariable selection on a distributed system that has one master and man y strong work ers. Assume data, denoted by ( Y i , X i ), with 1 ≤ i ≤ N , that are distributed across differen t work ers. F urther, assume that the sample size on eac h w orker is sufficiently large and of the same order. Under this setting, we prop ose a distributed LSA (DLSA) metho d. The key idea is as follows: (1) First, we estimate the parameter θ on eac h w ork er separately by using lo cal data on distributed w ork ers. This can b e done efficien tly b y using standard statistical estima- tion metho ds (e.g., maximum lik eliho o d estimation). By assuming that the sample size of eac h work er is sufficiently large, the resulting estimator and its asymptotic co v ariance estimate should b e consisten t but not statistically efficien t, as compared with the global estimates. (2) Next, each work er passes the lo cal estimator of θ and its asymptotic cov ariance esti- mate to the master. Because w e do not consider a high-dimensional mo del setting, the comm unication cost in this regard should b e negligible. (3) Once the master receiv es all the local estimators from the work ers, a w eighted least squares-t yp e ob jective function can b e constructed. This can b e view ed as a lo cal quadratic appro ximation of the global log-lik eliho o d functions. As one can exp ect, the resulting estimator shares the same asymptotic cov ariance with the full-size MLE metho d (i.e., the global estimator) under appropriate regularit y conditions. The ma jor steps of the DLSA metho d are further illustrated in Figure 1 . As one can see, the DLSA metho d reduces comm unication costs mainly by using only one round of comm unication and av oids further iterativ e steps. Giv en the DLSA ob jective function on the master no de, we can further conduct shrink age estimation on the master. This is done by formulating an adaptiv e Lasso-type ( Zou , 2006 ; Zhang and Lu , 2007 ) ob jective 5 Figure 1: Illustration of the DLSA metho d. function. The ob jective functions can b e easily solved b y the LARS algorithm ( Efron et al. , 2004 ) with minimal computation cost on the master. Th us, no communication is required. Accordingly , a solution path can b e obtained on the master no de. Thereafter, the b est estimator can b e selected from the solution path in conjunction with the prop osed distributed Ba yesian information criterion (DBIC). W e theoretically sho w that the resulting estimation is selection consistent and as efficien t as the oracle estimator, which is the global estimator obtained under the true mo del. T o summarize, w e aim to make the following imp ortan t con tributions to the existing literature. First, we prop ose a master with strong w ork ers (MSW) distributed system framew ork, which solves a large-scale computation problem in a communication-efficien t w a y . Second, giv en this MSW system, w e propose a no vel DLSA metho d, which easily handles a large class of classical regression mo dels such as linear regression, generalized linear regression, and Co x’s model. Third, due to the simple quadratic form of the ob jective function, the analytical solution path can b e readily obtained using the LARS algorithm on the master. Then, the best mo del can b e easily selected b y the DBIC criterion. Finally , but also most imp ortantly , the prop osed DLSA metho d fully tak es adv antage of the sp ecialt y of the MSW system, whic h pushes the intensiv e computation to the work ers and therefore is as computationally , comm unicationally and statistically efficien t as p ossible. F urthermore, w e w ould lik e to make a remark here that although the prop osed DLSA is designed for a distributed system, it can also be applied to a single computer when there are memory 6 constrain ts ( Chen et al. , 2018 ; W ang et al. , 2018 ). The remainder of this article is organized as follows. Section 2 introduces the mo del setting and the least squares appro ximation metho d. Section 3 presents a communication- efficien t shrink age estimation and a distributed BIC criterion. Numerical studies are giv en in Section 4. An application to U.S. Airline data with datasets greater than 52 GB is illustrated using the DLSA method on the Spark system in Section 5. The article concludes with a brief discussion in Section 6. All technical details are delegated to the App endix. 2 Statistical mo delling on distributed systems 2.1 Mo del and Notations Supp ose in the distributed system that there are in total N observ ations, whic h are indexed as i = 1 , · · · , N . The i th observ ation is denoted as Z i = ( X > i , Y i ) > ∈ R p +1 , where Y i ∈ R is the resp onse of interest and X i ∈ R p is the corresp onding cov ariate vector. Sp ecifically , the observ ations are distributed across K lo cal work ers. Define S = { 1 , · · · , N } to b e all sample observ ations. Decomp ose S = ∪ K k =1 S k , where S k collects the observ ations distributed to the k th work er. Ob viously , we should ha ve S k 1 ∩ S k 2 = ∅ for an y k 1 6 = k 2 . Define n = N /K as the a verage sample size for each work er. Then, w e assume |S k | = n k and that all n k div erge in the same order O ( n ). Sp ecifically , c 1 ≤ min k n k /n ≤ max k n k /n ≤ c 2 for some p ositiv e constants c 1 and c 2 . W e kno w immediately that N = P n k . In practice, due to the data storing strategy , the data in differen t w orkers could b e quite heterogeneous, e.g., they might b e collected according to spatial regions. Despite the heterogeneity here, w e assume they share the same regression relationship, and the parameter of interest is giv en b y θ 0 ∈ R p . W e fo cus on the case in which p is fixed. Let L ( θ ; Z ) b e a plausible t wice-differen tiable loss function. Define the global loss function as L ( θ ) = N − 1 P N i =1 L ( θ ; Z i ), whose global minimizer is b θ = arg min L ( θ ) and the true v alue is θ 0 . It is assumed that b θ admits the following asymptotic rule √ N ( b θ − θ 0 ) → d N (0 , Σ) for some p ositive definite matrix Σ ∈ R p × p as N → ∞ . If L ( θ ; Z ) is the negativ e log- 7 lik eliho o d function, then b θ is the global MLE estimator. Correspondingly , define the local loss function in the k th work er as L k ( θ ) = n − 1 k P i ∈S k L ( θ ; Z i ), whose minimizer is b θ k = arg min θ L k ( θ ). W e assume that √ n k ( b θ k − θ 0 ) → d N (0 , Σ k ) as n k → ∞ for a p ositiv e definite matrix Σ k . The goal is to conduct statistical analysis based on the data on the lo cal w orker and minimize the comm unication cost as muc h as p ossible. 2.2 Least Squares Appro ximation and V ariance Optimalit y In this section, w e motiv ate our approac h through least squares approximation to the global loss function, whic h tak es a lo cal quadratic form. T o motiv ate this idea, we b e- gin by decomposing and appro ximating the global loss function using T aylor’s expansion tec hniques as follo ws: L ( θ ) = N − 1 K X k =1 X i ∈S k L ( θ ; Z i ) = N − 1 K X k =1 X i ∈S k n L ( θ ; Z i ) − L ( b θ k ; Z i ) o + C 1 ≈ N − 1 K X k =1 X i ∈S k ( θ − b θ k ) > ¨ L ( b θ k ; Z i )( θ − b θ k ) + C 2 , (2.1) where the last equation uses the fact that ˙ L k ( b θ k ) = 0, and C 1 and C 2 are some constants. T ypically , the minimizer b θ k will ac hiev e the conv ergence rate √ n k . Intuitiv ely , the quadratic form in ( 2.1 ) should b e a go o d lo cal approximation of the global loss function ( W ang and Leng , 2007 ). This inspires us to consider the following w eighted least squares ob jective function: e L ( θ ) = N − 1 X k ( θ − b θ k ) > n X i ∈S k ¨ L ( b θ k ; Z i ) o ( θ − b θ k ) , def = X k ( θ − b θ k ) > α k b Σ − 1 k ( θ − b θ k ) , 8 where α k = n k / N . This leads to a w eighted least squares estimator (WLSE), which tak es an analytical form as follo ws: e θ = arg min θ e L ( θ ) = X k α k b Σ − 1 k − 1 X k α k b Σ − 1 k b θ k . (2.2) It is remark able that the estimator ˜ θ in ( 2.2 ) can be easily computed on a distributed system. Specifically , the lo cal work er sends b θ k and b Σ k to the master no de, and then, the master no de pro duces the WLSE by ( 2.2 ). As a result, the ab o v e WLSE requires only one round of comm unication. Hence, it is highly efficien t in terms of comm unication. Note that instead of taking a simple av erage of lo cal estimators b θ k in the literature, the analytical solution in ( 2.2 ) tak es a w eighted av erage of b θ k using weigh ts b Σ − 1 k . This will result in a higher statistical efficiency if the data are stored heterogeneously . T o inv estigate the asymptotic prop erties of the WLSE, w e assume the follo wing conditions. (C1) (P arameter Sp ace) The parameter space Θ is a compact and conv ex subset of R p . In addition, the true v alue θ 0 lies in the in terior of Θ. (C2) (Cov aria tes Distribution) Assume the cov ariates X i ( i ∈ S k ) from the k th w ork er are indep enden tly and identically distributed from the distribution F k ( x ). (C3) (Identifiability) F or any δ > 0, there exists > 0 such that lim n →∞ inf P n inf k θ ∗ − θ 0 k≥ δ, 1 ≤ k ≤ K L k ( θ ∗ ) − L k ( θ 0 ) ≥ o = 1 , and E n ∂ L k ( θ ) ∂ θ θ = θ 0 o = 0 . (C4) (Local Convexity) Define Ω k ( θ ) = E n ∂ L ( θ ; Z i ) ∂ θ ∂ L ( θ ; Z i ) ∂ θ > i ∈ S k o = E n ∂ 2 L k ( θ ; Z i ) ∂ θ ∂ θ > i ∈ S k o Assume that Ω k ( θ ) is nonsingular at the true v alue θ 0 . In addition, let Σ k = { Ω k ( θ 0 ) } − 1 , and Σ = { P k α k Ω( θ 0 ) } − 1 . 9 (C5) (Smoothness) Define B ( δ ) = { θ ∗ ∈ Θ |k θ ∗ − θ 0 k ≤ δ } as a ball around the true v alue θ 0 with radius δ > 0. Assume for almost all Z ∈ R p that the loss function L ( θ ; Z ) admits all third deriv ativ es ∂ 3 L ( θ ; Z ) / ( ∂ θ i ∂ θ j ∂ θ l ) for all θ ∈ B ( δ ). In addition, assume that there exist functions M ij l ( Z ) and δ > 0 such that ∂ 3 ∂ θ i ∂ θ j ∂ θ l L ( θ ∗ ; Z ) ≤ M ij l ( Z ) , for all θ ∗ ∈ B ( δ ) , (2.3) where E { M ij l ( Z m ) | m ∈ S k } < ∞ for all 1 ≤ i, j, l ≤ p and 1 ≤ k ≤ K . The ab o v e conditions are standard conditions to establish the asymptotic prop erties for M -estimators. First, Condition (C1) assumes the parameter space to b e con v ex ( Jordan et al. , 2019 ). Next, Condition (C2) concerns the distribution of the co v ariates { X i : i ∈ S k } . Sp ecifically , there are different F k ( x ) for differen t 1 ≤ k ≤ K . In particular, it allows for the heterogeneous distribution of cov ariates across work ers. W e w ould like to remark that heterogeneit y is a common phenomenon in distributed systems, and it has b een ignored in m uc h of the literature. Condition (C3) assures the identifiabilit y of the lo cal loss functions across all work ers. Finally , Conditions (C4) and (C5) are standard regularity conditions of the loss functions, whic h require certain degrees of lo cal conv exity and smo othness of the loss functions. These conditions are widely assumed in the literature to guarantee asymptotic conv ergence of the estimators ( F an and Li , 2001 ; Lehmann and Casella , 2006 ; Jordan et al. , 2019 ). Giv en the conditions, w e can establish the asymptotic prop erties of WLSE in the fol- lo wing Prop osition 1 and Theorem 1 . Prop osition 1. Assume Conditions (C1)–(C5). Then, we have √ N ( e θ − θ 0 ) = V ( θ 0 ) + B ( θ 0 ) (2.4) with c ov { V ( θ 0 ) } = Σ and B ( θ 0 ) = O p ( K/ √ N ) , wher e Σ = ( P K k =1 α k Σ − 1 k ) − 1 . The pro of of Prop osition 1 is given in App endix A.1. Prop osition 1 separates √ N ( e θ − θ 0 ) in to tw o parts, namely , the v ariance part and the bias part. P articularly , one should note 10 that the v ariance order is the same as the global estimator b θ , which is O ( N − 1 ), while the bias order is related to the num b er of lo cal work ers K . In addition, the co v ariance of V ( θ 0 ) is exactly the same with the global estimator (i.e., Σ). Consequently , if the local sample size is sufficien tly large, the bias should b e sufficiently small, and thus, the estimation efficiency will b e the same as the global estimator. W e state this result in the following theorem. Theorem 1. (Global Asymptotic Normality) Assume c onditions (C1)–(C5) and further assume n/ N 1 / 2 → ∞ . Then, we have √ N ( e θ − θ 0 ) → d N (0 , Σ) , which achieves the same asymptotic normality as the glob al estimator b θ . The pro of of Theorem 1 is giv en in App endix A.2. It can b e concluded that w e should require the lo cal sample size to b e of order larger than √ N , which is easy to satisfy in practice. Otherwise, w e should ha ve N /n 2 = K /n → ∞ . This implies that the n um b er of w ork ers is even larger than the av erage lo cal sample size n . This is obviously not the case in practice if w e hav e limited computational resources. 2.3 Tw o-Step Estimation In the previous analysis, to guarantee a √ N -consistency rate of the WLSE, w e require that n k / √ N → ∞ . The assumption might not b e suitable if w e ha ve sufficient computa- tional resources (e.g., a large n umber of work ers). As a result, it could violate the condition that K /n → 0 if K is m uc h larger than n . T o allo w the case with large K and small n , we further prop ose a t w o-step estimator to refine e θ . The details are given as follows. In the first step, we obtain WLSE e θ using one round of comm unication. Next, the WLSE is broadcasted from the master no de to the w orkers. In the second step, on the k th w ork, w e p erform a one-step iteration using e θ as the initial v alue to obtain b θ (2) k = e θ − ∂ 2 L k ( θ ) ∂ θ ∂ θ > θ = e θ − 1 ∂ L k ( θ ) ∂ θ θ = e θ . (2.5) Subsequen tly , the lo cal estimator b θ (2) k as w ell as b Σ (2) k = { ∂ 2 L k ( e θ ) /∂ θ ∂ θ > } − 1 are then trans- mitted from the work er to the ma ster. Similar as the design of WLSE, w e obtain a t wo-step 11 WLSE (TWLSE) on the master as e θ (2) = X k α k b Σ (2) − 1 k − 1 X k α k b Σ (2) − 1 k b θ (2) k . (2.6) As we will sho w in the theoretical analysis, TWLSE enjoys smaller bias rate than WLSE. It is mainly b ecause that it tak es a global estimator as an initial v alue with a further one- step iteration. Hence it is capable to b orrow the p o wer of the global estimator and further reduce the bias from O ( n − 1 ) to O ( n − 2 ). W e state the result rigorously in the following Theorem. Theorem 2. Assume Condition (C1)–(C5). Then we have √ N ( e θ (2) − θ 0 ) = V 2 ( θ 0 ) + B 2 ( θ 0 ) with c ov { V 2 ( θ 0 ) } = Σ and B 2 ( θ 0 ) = O p ( n − 2 N 1 / 2 ) . In addition, we have √ N ( e θ (2) − θ 0 ) → d N (0 , Σ) if we assume n/ N 1 / 4 → ∞ . The pro of of Theorem 2 is giv en in App endix A.3. The assumption n/ N 1 / 4 → ∞ is m uc h milder than WLSE and in the meanwhile it allows K = O ( N 3 / 4 ). W e would like to remark that the same theoretical result can b e achiev ed if we use one-shot estimator in our first step to reduce the comm unication cost. One could see that our TWLSE can b e easily extended to m -WLSE with m steps of iterations. Sp ecifically , we could rep eat step ( 2.5 ) and ( 2.6 ) for m rounds to obtain b θ ( m ) k and e θ ( m ) resp ectiv ely . P articularly , if we let m/ log N → ∞ , w e could allo w for an extreme case that n = O (1). This is suitable for the case when w e ha v e memory constraint of lo cal w ork ers. The theoretical result is giv en in the following Prop osition. Prop osition 2. Assume Condition (C1)–(C5) and m/ log N → ∞ . Then we have √ N ( e θ ( m ) − θ 0 ) → d N (0 , Σ) . The pro of of Proposition 2 is given in App endix A.4. T o achiev e the desired prop erty , w e trade off the communication cost to reduce lo cal computational burden with a large n um b er of w orkers. 12 3 Comm unication-efficien t shrink age estimation 3.1 Distributed Adaptiv e Lasso Estimation and Oracle Prop erty V ariable selection is a classical but critically imp ortan t problem. That is b ecause in practice, the num b er of av ailable co v ariates is typically large, but only a small num b er of co v ariates are related to the resp onse. Giv en an appropriate v ariable selection tec hnique, one can discov er the imp ortant v ariables with high probabilit y . In recen t decades, v arious v ariable selection tec hniques hav e b een w ell studied ( Tibshirani , 1996 ; F an and Li , 2001 ; Zou , 2006 ; W ang and Leng , 2007 ; Zhang , 2010 ). Ho w ever, how to conduct v ariable selection on a distributed system has not b een sufficien tly in v estigated. Existing approac hes mostly fo cus on the ` 1 shrink age estimation and dev elop corresponding algorithms ( Lee et al. , 2015 ; Battey et al. , 2015 ; W ang et al. , 2017a ; Jordan et al. , 2019 ). Ho wev er, to the b est of our kno wledge, there are three problems that remain unsolv ed on a distributed system: (a) most w orks do not establish the oracle properties of the shrink age estimators, (b) no consisten t tuning parameter selection criterion is given or inv estigated, and (c) the computation will b e hea vy if one needs to conduct estimation and select the tuning parameters sim ultaneously . T o solve the ab ov e problems, we first define some notations. Without loss of generality , w e assume the first d 0 (0 < d 0 < p ) to be nonzero, i.e., θ j 6 = 0 for 1 ≤ j ≤ d 0 and θ j = 0 for j > d 0 . Corresp ondingly , w e denote M T = { 1 , · · · , d 0 } to b e true model. In addition, let M = { i 1 , · · · , i d } b e an arbitrary candidate mo del with size |M| = d . In addition, for an arbitrary v ector v = ( v j : 1 ≤ j ≤ p ) > , define v ( M ) = ( v i : i ∈ M ) > ∈ R |M| and v ( −M ) = ( v i : i 6∈ M ) > ∈ R p −|M| . F or an arbitrary matrix M = ( m ij ), define M ( M ) = ( m j 1 j 2 : j 1 , j 2 ∈ M ) ∈ R |M|×|M| . F or simultaneous v ariable selection and parameter estimation, w e follo w the idea of W ang and Leng ( 2007 ) and consider the adaptiv e Lasso ob jective function on the master ( Zou , 2006 ; Zhang and Lu , 2007 ), Q λ ( θ ) = e L ( θ ) + X j λ j | θ j | . (3.1) By the adaptiv e Lasso metho d, different amoun ts of shrink age λ j are imp osed on each 13 estimator to impro ve the estimation efficiency ( Zou , 2006 ; Zou and Li , 2008 ). Compared to the LSA approac h of W ang and Leng ( 2007 ), w e ha ve the following key differences. First, e θ is the combined WLSE from lo cal work ers. Second, b Σ is constructed b y the lo cal asymptotic cov ariance estimators b Σ k . Consequently , to achiev e a global con vergence rate, one needs to carefully balance the lo cal conv ergence rate of b θ k and b Σ k with that of the global ones. Remark 1. Under the high dimensional setting with lar ge p , the c ommunic ation of b Σ k is c ostly and inefficient. Under such a c ase, we r e c ommend to c onduct a pr e-fe atur e scr e ening pr o c e dur e ( Li et al. , 2020 ) to scr e ening imp ortant fe atur es by using one r ound of c ommu- nic ation. That c onsumes a c ommunic ation c ost of O ( p ) . This enables us to r e duc e the fe atur e dimension p gr e atly to a fe asible size. Next, we c ould c ommunic ate the c ovarianc e estimator of imp ortant fe atur es to the master to c omplete shrinkage estimation using ( 3.1 ), which is c ommunic ation friend ly under the high dimensional setting. We give the details of the pr e-fe atur e scr e ening pr o c e dur e in Se ction 4.2. Define e θ λ = arg min θ Q λ ( θ ). Then, we can establish the √ N -consistency as w ell as the selection consistency result of e θ λ under certain conditions of λ = ( λ 1 , · · · , λ p ) > . Theorem 3. Assume the c onditions (C1)–(C5) and n/ N 1 / 2 → ∞ . L et a λ = max { λ j , j ≤ d 0 } and b λ = min { λ j , j > d 0 } . Then, the fol lowing r esult holds. a. ( √ N -consistency). If √ N a λ → p 0 , then e θ λ − θ = O p ( N − 1 / 2 ) . b. (Selection Consistency). If √ N a λ → p 0 and √ N b λ → p ∞ , then P ( e θ ( −M T ) λ = 0) → 1 (3.2) The proof of Theorem 3 is giv en in App endix B.1. Note here that a λ con trols the largest amoun t of penalty on the true nonzero parameters. Consequently , this amoun t cannot b e to o large; otherwise, it will result in a highly biased estimator. In con trast, b λ is responsible for pro ducing sparse solutions of irrelev ant co v ariates. Therefore, b λ should b e sufficiently large to pro duce an effectiv e amoun t of shrink age. Remark 2. Note that the obje ctive function of ( 3.1 ) c onstitutes of two p arts. The first 14 p art is the weighte d le ast squar es typ e obje ctive function, and the se c ond p art is a p enalty term. Henc e the The or em 3 and the fol lowing the or etic al pr op erties ar e establishe d b ase d on The or em 1 . If we c ould al low for mor e steps of iter ations (with TWLSE or m -WLSE), we c ould further r elax the c ondition ab out n her e as in The or em 2 and Pr op osition 2 . By Theorem 3 , we kno w that with probabilit y tending to one, w e hav e e θ ( −M T ) λ = 0 . Mean while, e θ ( M T ) λ − θ ( M T ) 0 = O p ( N − 1 / 2 ). It is then natural to ask whether the statistical efficiency of e θ ( M T ) λ can b e as go od as the oracle estimator, which is the global estimator obtained under the true model, i.e., b θ oracl e = arg min θ ∈ R p ,θ j =0 , ∀ j 6∈M T L ( θ ). T o this end, we require the follo wing technical condition. (C6) ( Cov ariance Assumption) Define the global unp enalized estimator as b θ M = arg min { θ ∈ R p : θ j =0 , ∀ j 6∈M} L ( θ ) . Assume for the global estimator b θ M with M ⊃ M T that √ N ( b θ ( M ) M − θ ( M ) 0 ) → d N (0 , Σ M ) = N (0 , Ω − 1 M ), where Σ M ∈ R |M|×|M| is a p ositiv e- definite matrix, and b θ ( M ) M = ( b θ M ,j : j ∈ M ) > ∈ R |M| . F urther assume for any M ⊃ M T that Ω M = Ω ( M ) M F holds, where M F = { 1 , · · · , p } denotes the whole set. Condition (C6) do es not seem v ery in tuitive. Nev ertheless, it is a condition that is well sat- isfied by most maxim um lik eliho o d estimators. F or instance, consider a logistic regression mo del with resp onse Y i ∈ { 0 , 1 } , i.e., P ( Y i = 1 | X i ) def = p ( X ( M T ) i ) = exp( θ ( M T ) > 0 X ( M T ) i ) 1 + exp( θ ( M T ) > 0 X ( M T ) i ) . F or an o verfitted mo del, the in verse asymptotic co v ariance of √ N b θ ( M ) M is Ω M = P K k =1 α k E { p ( X ( M T ) i )(1 − p ( X ( M T ) i )) X ( M ) i X ( M ) > i } , which is a submatrix of Ω M F = P K k =1 α k E { p ( X ( M T ) i ) (1 − p ( X ( M T ) i )) X ( M F ) i X ( M F ) > i } . Hence Condition (C6) holds. By similar argument, one can show that for any regression mo del with likelihoo d function of the form Q i f ( Y i , X > i θ ), the Condition (C6) is satisfied. A more detailed discussion has b een pro vided by W ang and Leng ( 2007 ). W e then hav e the oracle prop erty in the following theorem. Theorem 4. (Oracle Pr oper ty) Assume Conditions (C1)–(C6) and n/ N 1 / 2 → ∞ . 15 L et √ N a λ → p 0 and √ N b λ → p ∞ ; then, it holds that √ N e θ ( M T ) λ − θ ( M T ) → d N 0 , Σ M T . (3.3) By Theorem 3 and 4 , w e know that as long as the tuning parameters are approximately selected, the resulting estimator is selection consisten t and as efficien t as the oracle estima- tor. It is remark able that tuning a total of p parameters sim ultaneously is not feasible in practice. T o fix this problem, we follow the tradition of Zou ( 2006 ) and W ang et al. ( 2007 ) to sp ecify λ j = λ 0 | e θ j | − 1 . Since e θ j is √ N -consistent, then as long as as λ 0 satisfies the condition λ 0 √ N → 0 and λ 0 N → ∞ , then the conditions √ N a λ → p 0 and √ N b λ → p ∞ are satisfied. Thereafter, the original problem of tuning parameter selection for λ can be replaced b y selection for λ 0 . 3.2 The Distributed Ba y es Information Criterion Although it has b een shown that asymptotically , the oracle prop ert y can b e guaran teed as long as the tuning parameters are approximately selected, it is still unclear how to conduct v ariable selection in practice. That motiv ates us to design a BIC-type criterion that can select the true mo del consistently in a completely data-driv en manner ( Zhang and Lu , 2007 ; Chen and Chen , 2008 ; Zou and Zhang , 2009 ; W ang et al. , 2013 ). Sp ecifically , to consisten tly reco v er the sparsity pattern, we consider a distributed Ba yesian information criterion (DBIC)-based criterion as follo ws: DBIC λ = ( e θ λ − e θ ) > b Σ − 1 ( e θ λ − e θ ) + log N × d f λ , (3.4) where b Σ = ( P K k =1 α k b Σ − 1 k ) − 1 and d f λ is the n um b er of nonzero elements in e θ λ . The design of the DBIC criterion is in the spirit of the BIC criterion used in W ang and Leng ( 2007 ). The difference is that the DBIC uses the WLSE estimator e θ and the av erage of distributed co v ariance estimators b Σ to construct the least squares ob jective function. In tuitiv ely , if e θ and b Σ approximate the global estimator b θ = arg max L ( θ ) and asymptotic co v ariance v ery well, then the DBIC criterion should b e able to facilitate consisten t tuning parameter selection. Sp ecifically , the resulting mo del should b e selection consistent ( Shao , 16 1997 ). T o formally in vestigate the theoretical p erformance of DBIC, w e first define some nota- tions. First, w e define the set of nonzero elemen ts of b θ λ b y M λ . Given a tuning parameter λ , M λ could be underfitted, correctly fitted or o verfitted. W e could then ha ve the following partition: R − = { λ ∈ R p : M λ 6⊃ M T } , R 0 = { λ ∈ R p : M λ = M T } , R + = { λ ∈ R p : M λ ⊃ M T , M λ 6 = M T } , where R − denotes the underfitted mo del, and R + denotes an ov erfitted mo del. W e sho w in the follo wing Theorem that the DBIC can consistently identify the true mo del. Theorem 5. Assume Conditions (C1)–(C6) and n/ N 1 / 2 → ∞ . Define a r efer enc e tuning p ar ameter se quenc e { λ N ∈ R p } , wher e the first d 0 elements of λ N ar e 1 / N and the r emaining elements ar e log N / √ N . Then, we have P inf λ ∈ R − ∪ R + DBIC λ > DBIC λ N → 1 . By Theorem 3 and 4 , we know that with probabilit y tending to one, w e should ha v e M λ N = M T . Consequently , the sequence λ N here plays a role as a reference sequence that leads to the true mo del. Accordingly , Theorem 5 implies that the optimal λ selected b y the DBIC will consistently iden tify the true mo del. This is b ecause an y λ leading to an inconsisten t mo del selection result should p erform worse than λ N in terms of DBIC v alues. Namely , denote the estimated set selected by DBIC is c M DB I C , then Theorem 5 implies that c M DB I C = M T with probabilit y tending to 1. 4 Numerical studies 4.1 Sim ulation Mo dels and Settings T o demonstrate the finite sample p erformance of the DLSA metho d, we conduct a n um b er of simulation studies in this section. Five classical regression models are presented, and the corresp onding DLSA algorithms are implemented. F or each mo del, we consider 17 t w o typical settings to v erify the n umerical p erformance of the prop osed metho d. They represen t tw o different data storing strategies together with comp eting metho ds. The first strategy is to distribute data in a complete random manner. Th us, the co v ariates on differen t w orkers are indep enden t and iden tically distributed (i.i.d). In con trast, the second strategy allo ws for co v ariate distribution on differen t w orkers to be heterogeneous. The estimation efficiency as well as the as the v ariable selection accuracy are ev aluated. Examples are giv en as follows. Example 1. (Linear Regression). W e first consider one of the most p opular regression analysis to ols, i.e., linear regression. In particular, we generate the con tinuous resp onse Y i b y a linear relationship with the cov ariates X i as follo ws: Y i = X > i θ 0 + i , where the noise term ε i is indep enden tly generated using a standard normal distribution N (0 , 1). F ollowing F an and Li ( 2001 ), the true parameter is set as θ 0 = (3 , 1 . 5 , 0 , 0 , 2 , 0 , 0 , 0) > . Example 2. (Logistic Regression). The logistic regression is a classical mo del that addresses binary resp onses ( Hosmer Jr et al. , 2013 ). In this example, w e generate the resp onse Y i indep enden tly b y the Bernoulli distribution given the cov ariate X i as P ( Y i = 1 | X i ) = exp( X > i θ 0 ) 1 + exp( X > i θ 0 ) . W e follow W ang and Leng ( 2007 ) to set the true parameter θ 0 = (3 , 0 , 0 , 1 . 5 , 0 , 0 , 2 , 0) > . Example 3. (Poisson Regression). In this example, w e consider the P oisson regression, whic h is used to mo del counted resp onses ( Cameron and T rivedi , 2013 ). The resp onses are generated according to the P oisson distribution as P ( Y | X i , θ 0 ) = λ Y i Y i ! exp( − λ ) , (4.1) where λ = exp( X > i θ 0 ). The true parameter θ 0 is set to (0 . 8 , 0 , 0 , 1 , 0 , 0 , − 0 . 4 , 0 , 0) > . F or eac h example, tw o different data storage strategies are considered. They lead to differen t co v ariate distributions F x ( x ). Sp ecifically , the following t wo settings are in vesti- 18 gated. • Setting 1 (i.i.d Cov aria tes). W e first consider the setting in which the data are distributed indep endently and iden tically across the work ers. Specifically , the co v ariates X ij (1 ≤ i ≤ N , 1 ≤ j ≤ p ) are sampled from the standard normal distribution N (0 , 1). • Setting 2 (Heter ogeneous Cov aria tes). Next, we lo ok at the case whereb y the cov ariates distributed across eac h work er are heterogeneous. This is a common case in practice. Specifically , on the k th w ork er, the cov ariates are sampled from the m ultiv ariate normal distribution N ( µ k , Σ k ), where µ k = k /K , and Σ k = ( σ k,ij ) = ( ρ | j 1 − j 2 | k ) with ρ k sampled from U [0 . 2 , 0 . 3]. The i.i.d setting is consistent with the existing metho ds such as the one-shot estimation metho d. The second heterogenous data setting is not widely considered in literature but more realistic in practice. W e ev aluate the n umerical p erformances of the prop osed metho d and existing metho ds under b oth settings. 4.2 Pre-F eature Screening for High Dimensional Data F or high dimensional data, transmitting co v ariance estimators b Σ k ∈ R p × p could b e costly . T o reduce the feature dimension, we use a pre-feature screening pro cedure to screening important features b efore w e conduct mo del estimation. As a consequence, the comm unication cost can be well con trolled. In a recent work of Li et al. ( 2020 ), they de- sign a distributed feature screening metho d using comp onen twise debiasing approach. The metho d first expresses sp ecific correlation measures using a non-linear function of sev eral comp onen t parameters, and then conduct distributed unbiased estimation of eac h comp o- nen t parameter. The metho d is shown to hav e b etter performance than simple a v erage of correlation measures among w orkers. Sp ecifically , w e express the correlation measure b et w een the resp onse Y and the j th feature as ω j = g ( ν j 1 , · · · , ν j s ), where g is pre-sp ecified function and ν j 1 , · · · , ν j s are s comp onen ts. Supp ose w e obtain the estimation of the ν j m on the j th w ork er as b ν ( k ) j m . Then by distributed estimation we can obtain ν j m = K − 1 P K k =1 b ν ( k ) j m and this enables us 19 to resem ble b ω j = g ( ν j 1 , · · · , ν j s ) on the master. As a result, the pre-feature screening pro cedure will use one round of communication with comm unication cost of O ( p ). After the screening pro cedure, we broadcast the indexes of screened imp ortan t features to work ers for further estimation. F or illustration purp ose, w e giv e the follo wing t w o examples to explain that how the correlation measure is calculated. W e also include the Kendall τ rank correlation ( Li et al. , 2012a ) and SIRS correlation ( Zh u et al. , 2011 ) for comparison. See Li et al. ( 2020 ) for more details. 1. P earson Correlation. The P earson correlation is used by F an and Lv ( 2008 ) in the sure indep endence screening (SIS) pro cedure. In this case we hav e ω j = g ( ν 1 , · · · , ν j 5 ) = E ( X ij Y i ) − E ( X ij ) E ( Y i ) q ( E X 2 ij − E 2 ( X ij ))( E Y 2 i − E 2 ( Y i )) , where ν j 1 = E ( X ij Y i ), ν j 2 = E ( X ij ), ν j 3 = E ( Y i ), ν j 4 = E ( X 2 ij ), and ν j 5 = E ( Y 2 i ). Using simple moments estimation we could estimate the ab ov e fiv e comp onents. F or instance we ha v e b ν ( k ) j 1 = n − 1 k P i ∈S k X ij Y i . As a result, the estimated b ω j is exactly the same with using the whole dataset in this case. 2. Distance Correlation. The distance correlation (DC) is used by Li et al. ( 2012b ) as a mo del-free screening index. In this case ω j can b e expressed as ω j = g ( ν 1 , · · · , ν 8 ) = ν j 1 + ν j 2 ν j 3 − 2 ν j 4 q ( ν j 5 + ν 2 j 2 − 2 ν j 6 )( ν j 7 + ν 2 j 3 − 2 ν j 8 ) , where ν j 1 = E {| Y i − Y 0 i | · | X ij − X 0 ij |} , ν j 2 = E {| Y i − Y 0 i |} , ν j 3 = E {| X ij − X 0 ij |} , ν j 4 = E { E ( | Y i − Y 0 i | | Y i ) E ( | X ij − X 0 ij | | X ij ) } , ν j 5 = E {| Y i − Y 0 i | 2 } , ν j 6 = E { E 2 ( | Y i − Y 0 i | 2 | Y i ) } , ν j 7 = E {| X ij − X 0 ij | 2 } , ν j 8 = E { E 2 ( | X ij − X 0 ij | | X ij ) } , and ( Y 0 i , X 0 ij ) is an indep enden t cop y of ( Y i , X ij ). On the k th w ork er, we could estimate ν j 1 to ν j 8 using lo cal dataset. F or 20 instance, w e can estimate ν j 1 as follo ws, b ν ( k ) j 1 = 1 n k ( n k − 1) X i 1 6 = i 2 ,i 1 ,i 2 ∈S k | Y i 1 − Y i 2 | · | X i 1 j − X i 2 j | . W e refer to Li et al. ( 2012b ) for detailed estimations for ν j 2 to ν j 8 . Under the high dimensional setting we set ( N , p, K ) = (2000 , 10 3 , 5) and (10 4 , 10 4 , 8) resp ectiv ely . Hence the second case is more challenging with N = p . Corresp ondingly , the non-zero co efficien ts are set as θ 0 j = U 1 · sign( U 2 ) for 1 ≤ j ≤ 15, where U 1 ∼ U [0 . 5 , 2] and U 2 ∼ U [ − 0 . 2 , 0 . 8]. The numerical p erformances are ev aluated for linear regression ( Example 1 ) and logistic regression ( Example 2 ) mo dels resp ectively . After the pre- screening procedure, w e select top p 0 =40 features with highest screening measure v alues ( b ω j ) as our candidate imp ortan t feature set c M screen . Based on c M screen , we further conduct p ost-estimation with our DLSA metho d. W e would like to further remark that if the p ost- inference is needed, the pre-feature screening pro cedure may introduce p ost-selectiv e bias ( Berk et al. , 2013 ; Lee and T aylor , 2014 ; T a ylor and Tibshirani , 2015 ; Lee et al. , 2016 ). Suc h bias ma y not b e a ma jor issue for estimation, but is critical for statistical inference. Hence, selection-adjusted statistical inference should be in vestigated and devised. W e lea ve this problem as a future researc h topic. 4.3 P erformance Measurements In this section, w e giv e p erformance measurements and summarize sim ulation results with resp ect to the estimation efficiency as well as the v ariable selection accuracy . The sample sizes are set as N = (20 , 100) × 10 3 . Corresp ondingly , the num b er of work ers is set to K = (10 , 20). F or a reliable ev aluation, we rep eat the exp eriment R = 500 times. T ake the WLSE for example. F or the r th replication, denote b θ ( r ) and e θ ( r ) as the global estimator and WLSE, resp ectiv ely . T o measure the estimation efficiency , we calculate the ro ot mean square error (RMSE) for the j th estimator as RMSE e θ,j = { R − 1 P r k e θ ( r ) j − θ 0 j k 2 } 1 / 2 . The RMSE for the global estimator b θ ( r ) can be defined similarly . Then, the relativ e estimation efficiency (REE) with resp ect to the global estimator is giv en by REE j = RMSE b θ,j / RMSE e θ,j for 21 j = 1 , · · · , p . Next, for eac h 1 ≤ j ≤ p , we could construct a 95% confidence in terv al for θ 0 j as CI ( r ) j = ( e θ ( r ) j − z 0 . 975 f SE ( r ) j , e θ ( r ) j + z 0 . 975 f SE ( r ) j ), where f SE ( r ) j is the j th diagonal element of b Σ = ( P k α k b Σ − 1 k ) − 1 , and z α is the α th quan tile of a standard normal distribution. Then the co v erage probability (CP) is computed as CP j = R − 1 P R r =1 I ( θ j ∈ CI ( r ) j ). Next, based on the TWLSE, w e further conduct shrink age estimation on the master no de. Let c M ( r ) b e the set of selected v ariables in the r th replication using the DBIC. Corresp ondingly , e θ ( r ) λ is the shrink age estimator. T o measure the sparse disco very accuracy , w e calculate the av erage mo del size (MS) as MS = R − 1 P r | c M ( r ) | . Next, the p ercentage of the true mo del b eing correctly iden tified is given b y CM= R − 1 P r I ( c M ( r ) = M T ). In addition, to further inv estigate the estimation accuracy , we calculate the REE of the shrink age estimation with resp ect to the global estimator as REE s j = RMSE b θ,j / RMSE e θ λ ,j for j ∈ M T . Lastly , for the high dimensional setting, w e further ev aluate the p erformances of four screening measures and p ost-estimation result. Sp ecifically , we select top 40 features with highest screening measure v alues ( b ω j ), which is denoted as c M ( r ) screen in the r th replication. The co verage rate is defined as CR screen = R − 1 P R r =1 I ( M T ⊂ c M ( r ) screen ). Similarly , one could define CR select for the post shrink age estimation result. Due to the high dimension- alit y , w e define an ov erall REE= ( P j RMSE b θ,j ) / ( P j RMSE e θ,j ) for the estimator e θ . 4.4 Sim ulation Results W e compare the prop osed DLSA method with (a) the OS estimator ( Zhang et al. , 2013 ), and (b) the CSL estimator ( Jordan et al. , 2019 ). Sp ecifically , w e denote the DLSA metho d with tw o-step estimation (TWLSE) as TDLSA metho d. The sim ulation results for the i.i.d setting is presented in T able 7 – 8 (in the App endix). In addition, the results for the heterogenous setting are summarized in T able 1 – 3 . First, in the i.i.d case, one can observ e that all three metho ds are as efficien t as the global estimator when N is increased. F or example, for the linear regression (i.e., T able 7 ), all the metho ds could achiev e REE ≈ 1 when N = 100 , 000 and K = 20. Ho wev er, in the heterogeneous setting (i.e., Setting 2), the finite sample p erformances of the competing metho ds are quite differen t. The prop osed DLSA and TDLSA metho d ac hieve higher efficiency than the other metho ds, which is also 22 asymptotically efficient as the global estimator. F urthermore, the TDLSA metho d is more efficien t than the DLSA metho d. F or instance, the REEs of the TDLSA estimation for the logistic regression (i.e., T able 2 ) is near 1 in the second setting with N = 100 , 000 and K = 20, while the REE for θ 1 of the OS, CSL, DLSA metho ds are approximately 0.77, 0.12, and 0.92, resp ectively . Although the OS estimator is less efficien t, it is still consistent as N increases. The CSL metho d performs w orst under this situation. That is because it only uses the lo cal Hessian matrix; this could result in a highly biased estimator. Lastly , the CPs for b oth DLSA and TDLSA metho ds are all around 95%, while the CP for the CSL metho d is largely underestimated esp ecially under the heterogenous setting. With resp ect to the shrink age estimation, one can observ e that the adaptive Lasso estimator is able to ac hiev e higher estimation efficiency in the second setting than the global estimator. F or example, the REE for the shrink age DLSA (SDLSA) method could b e ev en higher than 1 for P oisson mo del in Setting 2 (i.e., T able 3 ). In addition, we observ e that the prop osed DBIC do es a great job at iden tifying the nonzero v ariables with high accuracy . Next, we summarize the simulation results for the high dimensional setting in T able 10 – 11 for linear and logistic regression models. Specifically , the screening accuracies of four screening metho ds are presented in T able 10 and the p ost-estimation result is given in T able 11 . Among all screening metho ds, the SIS and DC are able to achiev e higher screening cov erage rate. F or the p ost-estimation result, b oth TDLSA and SDLSA metho ds ha v e b etter estimation accuracy than other metho ds. 5 Application to airline data F or illustration purp oses, w e study a large real-w orld dataset. Specifically , the dataset considered here is the U.S. Airline Dataset. The dataset is a v ailable at http://stat- computing. org/dataexpo/2009 . It con tains detailed flight information about U.S. airlines from 1987 to 2008. The task is to predict the dela yed status of a flight giv en all other fligh t information with a logistic regression mo del. Eac h sample in the data corresp onds to one flight record, whic h consists of a binary resp onse v ariable for dela y ed status ( Delay ed ), and departure time, arriv al time, distance of the flight, fligh t date, delay status at departures, carrier in- formation, origin and destination as regressors. The complete v ariable details are described 23 T able 1: Sim ulation results for Example 1 (Setting 2) with 500 replications. The numerical p erformances are ev aluated for different sample sizes N ( × 10 3 ) and num b ers of work ers K . The REE is rep orted for all estimators. F or the CSL, DLSA, TDLSA metho d, the CP is further rep orted in paren theses. Finally , the MS and CM are rep orted for the SDLSA metho d to ev aluate the v ariable selection accuracy . Est. θ 1 θ 2 θ 3 θ 4 θ 5 θ 6 θ 7 θ 8 MS CM N = 20, K = 10 OS 0.98 1.00 0.99 1.00 1.00 0.99 0.99 0.98 CSL 0.94 0.93 0.94 0.94 0.95 0.96 0.93 0.96 (94.2) (93.6) (95.4) (92.2) (92.8) (92.6) (91.0) (91.2) DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 (96.0) (94.8) (96.8) (94.0) (94.6) (94.0) (94.0) (93.0) TDLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 (96.0) (94.8) (96.8) (94.0) (94.6) (94.0) (94.0) (93.0) SDLSA 1.08 - - 1.13 - - 1.19 - 3.01 1.00 N = 100, K = 20 OS 1.00 0.99 1.00 1.00 1.00 0.99 1.00 0.99 CSL 0.93 0.91 0.92 0.91 0.93 0.92 0.92 0.92 (93.2) (91.0) (92.0) (92.6) (92.0) (91.6) (92.4) (90.6) DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 (95.4) (95.4) (94.6) (94.6) (95.6) (93.8) (94.4) (94.6) TDLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 (95.4) (95.4) (94.6) (94.6) (95.6) (93.8) (94.4) (94.6) SDLSA 1.10 - - 1.13 - - 1.16 - 3.01 1.00 24 T able 2: Sim ulation results for Example 2 (Setting 2) with 500 replications. The numerical p erformances are ev aluated for different sample sizes N ( × 10 3 ) and num b ers of work ers K . The REE is rep orted for all estimators. F or the CSL, DLSA, TDLSA metho d, the CP is further rep orted in paren theses. Finally , the MS and CM are rep orted for the SDLSA metho d to ev aluate the v ariable selection accuracy . Est. θ 1 θ 2 θ 3 θ 4 θ 5 θ 6 θ 7 θ 8 MS CM N = 20, K = 10 OS 0.76 0.87 0.87 0.79 0.89 0.91 0.78 0.88 CSL 0.15 0.17 0.18 0.16 0.19 0.17 0.16 0.17 (40.8) (43.0) (42.6) (42.0) (44.8) (42.0) (43.6) (40.0) DLSA 0.88 1.01 1.01 0.94 1.01 1.01 0.90 1.01 (90.0) (95.8) (95.8) (92.6) (96.0) (96.0) (92.4) (95.0) TDLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 (95.6) (96.0) (95.6) (95.0) (95.6) (95.4) (95.0) (95.0) SDLSA 1.02 - - 1.09 - - 1.05 - 3.00 1.00 N = 100, K = 20 OS 0.77 0.95 0.95 0.84 0.95 0.93 0.82 0.93 CSL 0.12 0.13 0.13 0.12 0.14 0.13 0.12 0.12 (31.8) (29.0) (29.2) (29.8) (30.0) (26.2) (29.4) (29.2) DLSA 0.92 1.00 1.01 0.94 1.00 1.01 0.92 1.01 (92.6) (94.6) (94.2) (93.2) (93.2) (96.0) (92.8) (95.4) TDLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 (95.6) (94.8) (94.2) (94.0) (93.4) (95.6) (94.6) (95.4) SDLSA 1.02 - - 1.11 - - 1.07 - 3.00 1.00 25 T able 3: Sim ulation results for Example 3 (Setting 2) with 500 replications. The numerical p erformances are ev aluated for different sample sizes N ( × 10 3 ) and num b ers of work ers K . The REE is rep orted for all estimators. F or the CSL, DLSA, TDLSA metho d, the CP is further rep orted in paren theses. Finally , the MS and CM are rep orted for the SDLSA metho d to ev aluate the v ariable selection accuracy . Est. θ 1 θ 2 θ 3 θ 4 θ 5 θ 6 θ 7 θ 8 MS CM N = 20, K = 10 OS 0.76 0.87 0.87 0.79 0.89 0.91 0.78 0.88 CSL 0.15 0.17 0.18 0.16 0.19 0.17 0.16 0.17 (40.8) (43.0) (42.6) (42.0) (44.8) (42.0) (43.6) (40.0) DLSA 0.88 1.01 1.01 0.94 1.01 1.01 0.90 1.01 (90.0) (95.8) (95.8) (92.6) (96.0) (96.0) (92.4) (95.0) TDLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 (95.6) (96.0) (95.6) (95.0) (95.6) (95.4) (95.0) (95.0) SDLSA 1.02 - - 1.09 - - 1.05 - 3.00 1.00 N = 100, K = 20 OS 0.77 0.95 0.95 0.84 0.95 0.93 0.82 0.93 CSL 0.12 0.13 0.13 0.12 0.14 0.13 0.12 0.12 (31.8) (29.0) (29.2) (29.8) (30.0) (26.2) (29.4) (29.2) DLSA 0.92 1.00 1.01 0.94 1.00 1.01 0.92 1.01 (92.6) (94.6) (94.2) (93.2) (93.2) (96.0) (92.8) (95.4) TDLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 (95.6) (94.8) (94.2) (94.0) (93.4) (95.6) (94.6) (95.4) SDLSA 1.02 - - 1.11 - - 1.07 - 3.00 1.00 26 T able 4: V ariable description for the U.S. airline data. Non-categorical numerical v ariables are standardized to ha ve mean zero and v ariance one. V ariable Description V ariable used in the mo del Dela y ed Whether the flight is delay ed, 1 for Y es; 0 for No. Used as the resp onse v ariable Y ear Y ear b etw een 1987 and 2008 Used as n umerical v ariable Month Whic h mon th of the year Con v erted to 11 dummies Da y ofMonth Whic h da y of the month Used as n umerical v ariable Da y ofWeek Whic h da y of the week Con v erted to 6 dummies DepTime Actual departure time Used as n umerical v ariable CRSDepTime Sc heduled departure time Used as n umerical v ariable CRSArrTime Sc heduled arriv al time Used as n umerical v ariable ElapsedTime Actual elapsed time Used as n umerical v ariable Distance Distance betw een the origin and des- tination in miles Used as n umerical v ariable Ca rrier Fligh t carrier co de for 29 carriers T op 7 carries conv erted to 7 dummies Destination Destination of the flight (total 348 categories) T op 75 destination cities con- v erted to 75 dummies Origin Departing origin (total 343 cate- gories) T op 75 origin cities con v erted to 75 dummies in T able 4 . The data contain six contin uous v ariables and fiv e categorical v ariables. The categorical v ariables are conv erted to dummies with appropriate dimensions. W e treat the Y ear and Da y ofMonth v ariables as n umerical to capture the time effects. T o capture p ossible seasonal patterns, w e also con vert the time v ariables Month and Day ofWeek to dummies. Ultimately , a total of 181 v ariables are used in the mo del. The total sample size is 113.9 million observ ations. This leads to the ra w dataset b eing 12 GB on a hard drive. After the dumm y transformation described in T able 4 , the ov erall in-memory size is o ver 52 GB, ev en if all the dummies are stored in a sparse matrix format. Th us, this dataset can hardly b e handled b y a single computer. All the n umerical v ariables are standardized to hav e a mean of zero and a v ariance of one. 5.1 The Spark System and MLE T o demonstrate our metho d, w e set up a Spark-on-Y ARN cluster on the Alibaba cloud serv er ( https://www.alibabacloud.com/products/emapreduce ). This is a stan- 27 dard industrial-lev el arc hitecture setup for a distributed system. The system consists of one master no de and t wo w orker no des. Eac h no de contains 32 virtual cores, 128 GB of RAM and t w o 80 GB SSD lo cal hard driv es. The dataset is stored on the Hadoop data file system (HDFS). W e use such hardware specification to prev en t the failure of Spark’s default algorithm due to the well-kno wn out-of-memory issue for comparison purp oses. It is w orth men tioning that our DLSA algorithm w orks w ell on w ork ers with less than 64GB RAM. Because the system’s RAM is larger than the ra w data size, one ma y w onder whether the logistic regression task can b e run on a single no de. Unfortunately , this is infeasible in practice. This is b ecause muc h more memory (typically > 128 GB with a double-precision floating-p oin t format) is needed for op erating on matrices of suc h a h uge size. Even for the Spark system, a task of this magnitude cannot b e directly p erformed using an existing algorithm library (e.g., Spark ML) unless each work er no de is equipped with at least 128G RAM. This is b ecause Spark is a v ery memory-in tensive system. F or example, to compute a single distributed matrix with a size of approximately 1 GB in memory , one might need eac h w ork er to hav e 2-5 GB of memory in practice. This ov erhead memory consumption grows significan tly as the size of the data matrix increases. F or discussions, see the Spark do c- umen tation at https://spark.apache.org/docs/latest/tuning.html#memory- tuning and Chen and Guestrin ( 2016 ). W e compared our approach with Spark’s builtin distributed algorithm (implemented in the spark.ml.classification.LogisticRegression mo dule). If one insists on com- puting the traditional MLE based on the en tire dataset with a single no de, we emplo y a sto c hastic gradien t descent (SGD) algorithm ( Zhang , 2004 ) in the Python scikit-learn mo dule ( P edregosa et al. , 2011 ) to allo w for a memory-constraint situation. F ortunately , b oth the prop osed DLSA and OS metho ds allow us to dev elop a user- friendly Spark algorithm with minimal computer resources. Our DLSA works well for a Spark system with only 64G RAM. As the algorithm is designed in a batc h manner, it is highly efficient under memory constrain ts. The algorithm is dev elop ed with the Spark Python API (PySpark) and run on a Spark system (version > 2.3) (see Algorithm 1 for 28 details). It can b e freely downloaded from https://github.com/feng- li/dlsa . W e also pro vide the implementation of the aforemen tioned algorithms in T able 5 in the rep ository . W e then use our algorithm to fit a logistic regression mo del for modelling the delay ed status of a fligh t. T o this end, the entire dataset is randomly partitioned in to 1139 subgroups. The sample size for eac h subgroup is appro ximately 100 , 000. Next, for each subgroup of data, w e create a virtual work er (i.e., an executor in the Spark system) so that the computation for each w ork er can b e conducted in a parallel manner. By doing so, the computation p o wer of the en tire Spark system can b e maximized for b oth DLSA and OS metho ds. The corresp onding log-lik eliho o d v alues, computational time as well as the estimated comm unication time are rep orted in T able 5 , resp ectively . W e remark that the computing time for the MLE includes the data sh uffling time with our DLSA and OS algorithms. This serves as an imp ortant b enc hmark to gauge the p erformance of the other comp eting metho ds (e.g., DLSA and OS metho ds). Comparing these results with that of the traditional MLE, we find that the traditional MLE is extremely difficult to compute without a distributed system. T able 5 also depicts that the log-lik eliho o d v alue of the DLSA is the b est with an affordable RAM consumption. The Spark’s ML module tak es longer time than DLSA b ecause DLSA only requires one round communication and the parameter optimization is done within each executor. A serialized SGD takes more that 15 hours and obtains an inferior result (i.e., smaller log-lik eliho o d v alue). 5.2 V ariable Selection Results with BIC W e next apply the prop osed shrink age DLSA metho d (referred to as SDLSA) with the BIC criterion to conduct v ariable selection. It is remark able that this can b e fully conducted on the master, and no further comm unication is needed. It tak es only 0 . 2 seconds to accomplish the task. After the shrink age estimation, w e are able to reduce the 181 v ariables to 157 v ariables. The detailed results are summarized in T able 6 . First, with resp ect to time effects, b oth y early and seasonal trends are found. The co efficien t for the Y ear is 6.12, whic h implies that as the year pro ceeds, the airline delays b ecome more sev ere. Next, the passengers are 29 Algorithm 1: Spark implementation Input: The mo del function for mo delling each partitioned dataset Output: The weigh ted least squares estimator ˜ θ , co v ariance matrix b Σ, DBIC DBIC λ Steps: Step (1) . Pre-determine the o v erall cluster av ailable memory as M ram , the total n um b er of CPU cores as C cores , and the total data size to b e pro cessed as D total ; Step (2) Define the num b er of batched ch unks N chunks to allo w for out-of-memory data pro cessing. W e recommend that N chunks b e at least greater than 3 × D total / M ram in a Spark system. Step (3) . Define the n um b er of partitions P partition = D total / ( N chunks × C cores ). Step (4) . Define a mo del function whereby the input is an n × ( p + 2) Python P andas DataF rame con taining the resp onse v ariable, co v ariates and partition id, and the output is a p × ( p + 1) P andas DataF rame whereby the first column is b θ k and the remaining columns store b Σ − 1 k . Step (5) . for i in 1: N chunks do (a) . T ransfer the data ch unk to Spark’s distributed DataF rame if the data are stored in another format. (b) . Randomly assign an in teger partition lab el from { 1 , ..., P partition } to eac h ro w of the Spark DataF rame. (c) . Repartition the DataF rame in the distributed system if the data are not partitioned b y the partition lab el. (d) . Group the Spark DataF rames b y the assigned partition lab el. (e) . Apply the mo del function to each group ed dataset with Spark’s Gr oup e d map Pandas UDFs API and obtain a ( pP partition ) × ( p + 1) distributed Spark DataF rame R i . end Step (6) . Aggregate R i o v er b oth partitions and ch unks and return the p × ( p + 1) matrix R f inal . Step (7) . Return ˜ θ b y Equation ( 2.2 ), b Σ, and DBIC λ b y Equation ( 3.4 ). • Because the final step in the DLSA algorithm is carried out on the master no de and b ecause data transformation from w ork er no des to the master no de is required, a sp ecial to ol called “Apac he Arrow” ( https://arrow.apache.org/ ) is plugged-in to our system to allo w efficient data transformation b etw een Spark’s distributed DataF rame and Python’s Pandas DataF rame. 30 T able 5: Log likelihoo d, computational time and communication time comparison with the airline data. Note that the inevitable o v erhead for initializing the spark task is also included in the computational time. It is difficult to monitor the exact comm unication time p er co de in teraction in Spark due to its lazy ev aluation tec hnique used. The o verall communication time is estimated from the computational time by subtracting the computational time used on the master mac hine and the av erage computational time used on the work ers. Algorithm Log likelihoo d Computational time (min) Comm unication time (min) W ork er’s RAM No. of w orkers DLSA − 1 . 62 × 10 8 26.2 3.8 64GB 2 TDLSA − 1 . 62 × 10 8 28.7 4.4 64GB 2 OS − 1 . 65 × 10 8 25.3 2.3 64GB 2 Spark ML − 1 . 63 × 10 8 54.7 18.6 128GB 2 Serialized SGD − 1 . 71 × 10 8 913.2 — 64GB 1 lik ely to encounter dela ys in May , Octob er, No v ember, and December (with co efficients of 6.03, 0.8, 0.4, and 0.19, resp ectively). In terms of days of the week, more delays are exp ected for certain w orking da ys, i.e., T uesday , W ednesda y and F riday (with coefficients of 0.6, 0.85 and 0.39, respectively) compared to week ends. Finally , within a given da y , w e find the co efficien ts for b oth the scheduled departure time ( CRSDepTime , − 0 . 12) and the sc heduled arriv al time ( CRSArrTime , − 0 . 13) are negative, indicating that late departing and arriving fligh ts suffer less so from delays. Next, with resp ect to the airline carriers, AA (American Airlines), DL (Delta Air Lines), NW (Northw est Airlines), and UA (United Airlines) hav e estimated co efficien ts of 0.49, 0.18, 0.39, and 0.70, whic h indicates ho w more likely they are to b e delay ed compared with other airlines. In addition, one ma y b e interested in with which airp orts are more likely to ha v e dela yed fligh ts. According to our estimation results, the top fiv e origin airports that cause delays are IAH (George Bush In tercon tinen tal Airp ort), LGA (LaGuardia Airport), PHL (Philadelphia International Airp ort), RDU (Raleigh-Durham In ternational Airp ort), ONT (On tario In ternational Airp ort), and SMF (Sacramento International Airp ort), with co efficien ts of 0 . 82, 0 . 87, 0 . 94, 1 . 58, and 1 . 59, resp ectiv ely . The top five destination airp orts that cause dela ys are PBI (P alm Beac h International Airp ort), MCI (Kansas City In ter- national Airp ort), DCA (Ronald Reagan W ashington National Airport), SAN (San Diego In ternational Airp ort), and MEM (Memphis In ternational Airp ort), with co efficien ts of 31 0 . 99, 1 . 00, 1 . 07, 1 . 15, and 1 . 16, resp ectiv ely . 6 Concluding remarks In this article, we dev elop a no vel DLSA algorithm that is able to p erform large-scale statistical estimation and inference on a distributed system. The DLSA metho d can b e applied to a large family of regression models (e.g., logistic regression, P oisson regression, and Co x’s mo del). First, it is shown that the DLSA estimator is as statistically optimal as the global estimator. Moreo v er, it is computationally efficien t and only requires one round of comm unication. F urthermore, w e dev elop the corresp onding shrink age estimation b y using an adaptiv e Lasso approach. The oracle prop erty is theoretically prov en. A new DBIC measure for distributed v ariable selection, whic h only needs to b e p erformed on the master and re- quires no further comm unication, is designed. W e pro v e the DBIC measure to b e selection consisten t. Finally , numerical studies are conducted with fiv e classical regression examples. In addition, a Spark to olbox is developed, whic h is sho wn to be computationally efficien t b oth through sim ulation and in airline data analysis. T o facilitate future research, we now discuss several in teresting topics. First, the DLSA metho d requires the ob jective function to ha ve contin uous second-order deriv ativ es. This assumption might b e restrictive and cannot b e satisfied for certain regression mo dels, e.g., the quantile regression. Consequently , the relaxation of this assumption can be inv estigated, and corresp onding distributed algorithms should b e designed for such regression mo dels. Second, the dimension considered in our framew ork is finite. As a natural extension, one could study the shrink age estimation properties in high-dimensional settings. F urthermore, as w e commen ted on our pre-feature screening pro cedure, further p ost-selection statistical inference should b e inv estigated ( Berk et al. , 2013 ; T aylor and Tibshirani , 2015 ; Lee et al. , 2016 ). Third, the algorithm is designed for indep enden t data. In practice, dep enden t data (e.g., time series data and netw ork data) are frequen tly encoun tered. It is thus interesting to dev elop corresp onding algorithms b y considering the dep endency structure. Lastly , note that the p enalt y term giv en in ( 3.1 ) can b e added to other surrogate loss functions as in Jordan et al. ( 2019 ). Hence the theoretical analysis in Section 3 as a great p oten tial to 32 T able 6: Co efficients for the logistic mo del estimated with SDLSA using the BIC. The carrier and airp ort abbreviations are assigned b y the International Air T ransp ort Asso ciation. W e denote “Airp ort”, “Origin” and “Destination” by “A”, “O” and “D”, resp ectively . The v ariances for all co efficien ts are all smaller than 0 . 001. The notation *** indicates 0.001 lev el of significance. In tercept Y ear CRSArrTime Distance CRSDepTime Da yofMon th ElapsedTime DepTime − 0 . 30 6 . 12 ∗∗∗ − 0 . 13 − 5 . 68 ∗∗∗ − 0 . 12 − 0 . 06 0 . 08 0 . 15 Mon th F eb Mar Apr Ma y Jun Jul Aug Sep Oct No v Dec − 0 . 13 − 0 . 01 0 . 03 6 . 03 ∗∗∗ − 0 . 28 − 0 . 09 − 0 . 02 − 0 . 07 0 . 8 ∗∗∗ 0 . 4 0 . 19 Da y of W eek T ue W ed Thu F ri Sat Sun 0 . 6 ∗∗∗ 0 . 85 ∗∗∗ − 0 . 57 0 . 39 0 . 22 0 . 26 Carrier AA CO DL NW UA US WN 0 . 49 ∗∗∗ − 0 . 16 0 . 18 0 . 39 0 . 7 ∗∗∗ − 0 . 43 − 0 . 6 A ABQ ANC A TL A US BDL BHM BNA BOS BUF BUR BWI O − 0 . 46 − 0 . 14 − 0 . 02 0 . 39 0 . 38 0 . 13 − 0 . 17 0 . 13 − 0 . 55 − 0 . 01 0 . 54 D − 0 . 61 0 . 78 − 0 . 84 − 0 . 73 − 0 . 74 − 0 . 46 − 0 . 28 0 . 51 − 0 . 50 − 0 . 90 0 . 69 A CLE CL T CMH CV G D AL D A Y DCA DEN DFW DTW ELP O 0 . 49 − 0 . 87 − 0 . 80 0 . 29 − 0 . 08 0 . 40 − 0 . 07 − 0 . 22 0 . 53 0 . 32 − 0 . 56 D 0 . 14 − 0 . 46 0 . 64 − 1 . 60 − 0 . 72 − 0 . 79 1 . 07 ∗∗∗ 0 . 37 − 1 . 04 − 0 . 19 − 0 . 54 A EWR FLL GSO HNL HOU IAD IAH IND JAX JFK LAS O 0 . 21 − 0 . 99 0 . 11 0 . 19 − 0 . 51 − 0 . 29 0 . 82 ∗∗∗ 0 . 46 − 0 . 71 0 . 07 − 0 . 94 D − 0 . 44 − 1 . 70 0 . 77 0 . 79 − 1 . 62 0 . 00 − 0 . 54 − 0 . 27 − 0 . 12 − 0 . 70 − 0 . 29 A LAX LGA MCI MCO MD W MEM MIA MKE MSP MSY O AK O 0 . 38 0 . 87 ∗∗∗ 0 . 61 0 . 52 − 0 . 48 0 . 71 0 . 39 − 0 . 64 − 0 . 10 0 . 21 − 1 . 04 D − 0 . 20 0 . 13 1 . 00 ∗∗∗ 0 . 30 0 . 10 1 . 16 ∗∗∗ − 0 . 11 − 0 . 83 − 0 . 61 0 . 97 − 1 . 47 A OKC OMA ONT ORD ORF PBI PDX PHL PHX PIT PVD O − 0 . 71 − 0 . 85 1 . 58 ∗∗∗ − 0 . 41 − 0 . 67 − 1 . 12 − 0 . 13 0 . 94 ∗∗∗ 0 . 37 0 . 18 − 0 . 42 D 0 . 69 − 0 . 74 − 1 . 49 − 0 . 72 − 0 . 79 0 . 99 ∗∗∗ − 0 . 64 − 0 . 54 0 . 83 0 . 33 − 0 . 40 A RDU RIC RNO ROC RSW SAN SA T SDF SEA SFO SJC O − 5 . 80 0 . 02 − 0 . 06 0 . 20 0 . 05 0 . 00 − 0 . 04 0 . 00 0 . 24 0 . 24 0 . 14 D 0 . 97 0 . 83 0 . 71 − 1 . 07 0 . 37 1 . 15 ∗∗∗ 0 . 78 − 0 . 35 − 0 . 79 − 0 . 51 0 . 21 A SJU SLC SMF SNA STL SYR TP A TUL TUS O − 0 . 14 0 . 59 1 . 59 ∗∗∗ − 0 . 15 0 . 14 − 1 . 21 − 0 . 28 − 0 . 36 − 1 . 51 D 0 . 84 0 . 09 − 0 . 03 0 . 02 0 . 38 − 0 . 06 0 . 31 0 . 43 0 . 03 33 extend to other metho dologies. 7 Ac kno wledgemen ts The authors thank the editor, asso ciate editor, and tw o referees for their insightful commen ts that ha v e led to significant impro v ement of this article. W e also thank Dr. Zhih ui Jin for his constructive suggestions. References Apac he Soft ware F oundation (2019a), “Apache Hado op (v ersion 2.7.2),” . — (2019b), “Apac he Spark (version 2.3.1),” . Battey , H., F an, J., Liu, H., Lu, J., and Zhu, Z. (2015), “Distributed estimation and inference with statistical guaran tees,” arXiv pr eprint arXiv:1509.05457 . Berk, R., Bro wn, L., Buja, A., Zhang, K., Zhao, L., et al. (2013), “V alid p ost-selection inference,” The A nnals of Statistics , 41, 802–837. Cameron, A. C. and T rivedi, P . K. (2013), R e gr ession analysis of c ount data , v ol. 53, Cam bridge univ ersity press. Chang, X., Lin, S.-B., and W ang, Y. (2017a), “Divide and conquer lo cal a v erage regression,” Ele ctr onic Journal of Statistics , 11, 1326–1350. Chang, X., Lin, S.-B., and Zhou, D.-X. (2017b), “Distributed semi-sup ervised learning with k ernel ridge regression,” The Journal of Machine L e arning R ese ar ch , 18, 1493–1514. Chen, J. and Chen, Z. (2008), “Extended Bay esian information criteria for mo del selection with large mo del spaces,” Biometrika , 95, 759–771. Chen, T. and Guestrin, C. (2016), “Xgb o ost: A scalable tree b o osting system,” in Pr o c e e d- ings of the 22nd acm sigkdd international c onfer enc e on know le dge disc overy and data mining , A CM, pp. 785–794. Chen, X., Liu, W., and Zhang, Y. (2018), “Quan tile regression under memory constraint,” arXiv pr eprint arXiv:1810.08264 . Chen, X. and Xie, M.-g. (2014), “A split-and-conquer approach for analysis of extraordi- narily large data,” Statistic a Sinic a , 1655–1684. Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004), “Least angle regression,” 34 A nnals of Statistics , 32, 407–499. F an, J. and Li, R. (2001), “V ariable selection via nonconca ve p enalized lik eliho o d and its oracle prop erties,” Journal of the A meric an Statistic al Asso ciation , 96, 1348–1360. F an, J. and Lv, J. (2008), “Sure indep endence screening for ultra-high dimensional feature space (with discussion),” Journal of the R oyal Statistic al So ciety, Series B , 70, 849–911. F an, J., W ang, D., W ang, K., and Zhu, Z. (2017), “Distributed estimation of principal eigenspaces,” arXiv pr eprint arXiv:1702.06488 . Heinze, C., McWilliams, B., and Meinshausen, N. (2016), “Dual-loco: Distributing statis- tical estimation using random pro jections,” in Artificial Intel ligenc e and Statistics , pp. 875–883. Hosmer Jr, D. W., Lemeshow, S., and Sturdiv ant, R. X. (2013), Applie d lo gistic r e gr ession , v ol. 398, John Wiley & Sons. Jordan, M. I., Lee, J. D., and Y ang, Y. (2019), “Communication-efficien t distributed sta- tistical inference,” Journal of the A meric an Statistic al Asso ciation , 114, 668–681. Kleiner, A., T alwalk ar, A., Sark ar, P ., and Jordan, M. I. (2014), “A scalable b o otstrap for massiv e data,” Journal of the R oyal Statistic al So ciety: Series B: Statistic al Metho dolo gy , 795–816. Lee, J. D., Sun, D. L., Sun, Y., T aylor, J. E., et al. (2016), “Exact p ost-selection inference, with application to the lasso,” A nnals of Statistics , 44, 907–927. Lee, J. D., Sun, Y., Liu, Q., and T aylor, J. E. (2015), “Comm unication-efficient sparse regression: a one-shot approach,” arXiv pr eprint arXiv:1503.04337 . Lee, J. D. and T aylor, J. E. (2014), “Exact p ost mo del selection inference for marginal screening,” arXiv pr eprint arXiv:1402.5596 . Lehmann, E. L. and Casella, G. (2006), The ory of p oint estimation , Springer Science & Business Media. Li, G., P eng, H., Zhang, J., Zhu, L., et al. (2012a), “Robust rank correlation based screen- ing,” The A nnals of Statistics , 40, 1846–1877. Li, R., Zhong, W., and Zhu, L. (2012b), “F eature screening via distance correlation learn- ing,” Journal of the A meric an Statistic al Asso ciation , 107, 1129–1139. 35 Li, X., Li, R., Xia, Z., and Xu, C. (2019), “Distributed feature screening via component wise debiasing,” arXiv pr eprint arXiv:1903.03810 . — (2020), “Distributed F eature Screening via Comp onent wise Debiasing.” Journal of Ma- chine L e arning R ese ar ch , 21, 1–32. Liu, Q. and Ihler, A. T. (2014), “Distributed estimation, information loss and exp onen tial families,” in A dvanc es in neur al information pr o c essing systems , pp. 1098–1106. P edregosa, F., V aro quaux, G., Gramfort, A., Mic hel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P ., W eiss, R., Dub ourg, V., V anderplas, J., P assos, A., Cournap eau, D., Brucher, M., P errot, M., and Duchesna y , E. (2011), “Scikit-learn: Mac hine Learning in Python,” Journal of Machine L e arning R ese ar ch , 12, 2825–2830. Sengupta, S., V olgushev, S., and Shao, X. (2016), “A subsampled double b o otstrap for massiv e data,” Journal of the A meric an Statistic al Asso ciation , 111, 1222–1232. Shamir, O., Srebro, N., and Zhang, T. (2014), “Communication-efficien t distributed op- timization using an appro ximate newton-type metho d,” in International c onfer enc e on machine le arning , pp. 1000–1008. Shao, J. (1997), “An asymptotic theory for linear mo del selection,” Statistic a Sinic a , 221– 242. Smith, V., F orte, S., Ma, C., T ak´ a ˇ c, M., Jordan, M. I., and Jaggi, M. (2018), “CoCoA: A General F ramework for Comm unication-Efficien t Distributed Optimization,” Journal of Machine L e arning R ese ar ch , 18, 1–49. T aylor, J. and Tibshirani, R. J. (2015), “Statistical learning and selectiv e inference,” Pr o- c e e dings of the National A c ademy of Scienc es , 112, 7629–7634. Tibshirani, R. (1996), “Regression shrink age and selection via the lasso,” Journal of the R oyal Statistic al So ciety. Series B , 267–288. V olgushev, S., Chao, S.-K., Cheng, G., et al. (2019), “Distributed inference for quantile regression pro cesses,” The Annals of Statistics , 47, 1634–1662. W ang, H. and Leng, C. (2007), “Unified LASSO estimation b y least squares appro xima- tion,” Journal of the A meric an Statistic al Asso ciation , 102, 1039–1048. W ang, H., Li, R., and Tsai, C.-L. (2007), “T uning parameter selectors for the smo othly 36 clipp ed absolute deviation metho d,” Biometrika , 94, 553–568. W ang, H., Zh u, R., and Ma, P . (2018), “Optimal subsampling for large sample logistic regression,” Journal of the A meric an Statistic al Asso ciation , 113, 829–844. W ang, J., Kolar, M., Srebro, N., and Zhang, T. (2017a), “Efficien t distributed learning with sparsity ,” in Pr o c e e dings of the 34th International Confer enc e on Machine L e arning- V olume 70 , JMLR. org, pp. 3636–3645. W ang, J., W ang, W., and Srebro, N. (2017b), “Memory and comm unication efficient dis- tributed sto c hastic optimization with minibatc h pro x,” in Pr o c e e dings of the 2017 Con- fer enc e on L e arning The ory , eds. Kale, S. and Shamir, O., Amsterdam, Netherlands: PMLR, v ol. 65 of Pr o c e e dings of Machine L e arning R ese ar ch , pp. 1882–1919. W ang, L., Kim, Y., and Li, R. (2013), “Calibrating non-con vex p enalized regression in ultra-high dimension,” A nnals of statistics , 41, 2505. Y ang, J., Mahoney , M. W., Saunders, M., and Sun, Y. (2016), “F eature-distributed sparse regression: a screen-and-clean approac h,” in A dvanc es in Neur al Information Pr o c essing Systems , pp. 2712–2720. Y u, Y., Chao, S.-K., and Cheng, G. (2020), “Simultaneous Inference for Massive Data: Distributed Bo otstrap,” arXiv pr eprint arXiv:2002.08443 . Zaharia, M., Cho wdh ury , M., Das, T., Da ve, A., Ma, J., McCauley , M., F ranklin, M. J., Shenk er, S., and Stoica, I. (2012), “Resilient distributed datasets: A fault-toleran t ab- straction for in-memory cluster computing,” in Pr o c e e dings of the 9th USENIX c onfer enc e on Networke d Systems Design and Implementation , USENIX Asso ciation, pp. 2–2. Zhang, C.-H. (2010), “Nearly unbiased v ariable selection under minimax concav e p enalt y ,” A nnals of Statistics , 38, 894–942. Zhang, H. H. and Lu, W. (2007), “Adaptiv e Lasso for Cox’s prop ortional hazards mo del,” Biometrika , 94, 691–703. Zhang, T. (2004), “Solving large scale linear prediction problems using sto c hastic gradi- en t descen t algorithms,” in Pr o c e e dings of the twenty-first international c onfer enc e on Machine le arning , ACM, p. 116. Zhang, Y., Duc hi, J. C., and W ainwrigh t, M. J. (2013), “Comm unication-efficien t algo- 37 rithms for statistical optimization,” The Journal of Machine L e arning R ese ar ch , 14, 3321–3363. Zh u, L.-P ., Li, L., Li, R., and Zhu, L.-X. (2011), “Mo del-free feature screening for ultrahigh- dimensional data,” Journal of the A meric an Statistic al Asso ciation , 106, 1464–1475. Zou, H. (2006), “The adaptiv e lasso and its oracle prop erties,” Journal of the Americ an Statistic al Asso ciation , 101, 1418–1429. Zou, H. and Li, R. (2008), “One-step sparse estimates in nonconcav e p enalized likelihoo d mo dels,” A nnals of statistics , 36, 1509. Zou, H. and Zhang, H. H. (2009), “On the adaptive elastic-net with a diverging num b er of parameters,” A nnals of statistics , 37, 1733. 38 App endix A App endix A.1: Pro of of Prop osition 1 Note that e θ − θ 0 tak es the form e θ − θ 0 = X k α k b Σ − 1 k − 1 X k α k b Σ − 1 k ( b θ k − θ 0 ) . Define b Σ k ( θ ) = { ∂ 2 L k ( θ ) /∂ θ ∂ θ > } − 1 . In the following section, w e denote b Σ k b y b Σ k ( b θ k ) to mak e it clearer. By Slutsky’s Theorem, to pro v e ( 2.4 ), it suffices to verify that X k α k { b Σ k ( b θ k ) } − 1 → p Σ − 1 , (A.1) √ N h X k α k { b Σ k ( b θ k ) } − 1 ( b θ k − θ 0 ) i = V ∗ ( θ 0 ) + B ∗ ( θ 0 ) , (A.2) where co v { V ∗ ( θ 0 ) } = Σ − 1 and B ∗ ( θ 0 ) = O p ( K/ √ N ). W e pro v e them in the following. 1. Proof of ( A.1 ). Recall that b θ k is a √ n -consisten t estimator of θ 0 . This enables us to conduct a T aylor’s expansion of b Σ − 1 k ( b θ k ) at θ 0 , whic h yields b Σ − 1 k ( b θ k ) − Σ − 1 k = b Σ − 1 k ( b θ k ) − b Σ − 1 k ( θ 0 ) + b Σ − 1 k ( θ 0 ) − Σ − 1 k = X j ∂ 3 L k ( θ ) ∂ θ j ∂ θ ∂ θ > θ = θ ∗ ( θ ∗ j − θ j ) + b Σ − 1 k ( θ 0 ) − Σ − 1 k where θ ∗ lies on the line joining θ 0 and b θ k . By Condition (C5), we hav e ∂ 3 L k ( θ ) ∂ θ j ∂ θ ∂ θ > θ = θ ∗ = O p (1). Therefore, the order of the first term is O p (1 / √ n k ). In addition, we hav e b Σ − 1 k ( θ 0 ) − Σ − 1 k = b Σ − 1 k ( θ 0 ) − E { b Σ − 1 k ( θ 0 ) } = O p ( n − 1 / 2 k ). Consequently , it can b e derived that b Σ − 1 k ( b θ k ) − Σ − 1 k = O p ( n − 1 / 2 k ). F urther note that α k = n k / N and P k α k = 1. Then, w e ha ve b Σ − Σ = X k α k [ { b Σ k ( b θ k ) } − 1 − Σ − 1 k ] = O p ( n − 1 / 2 ) = o p (1) . (A.3) Hence ( A.1 ) is pro ven. 39 2. Pr oof of ( A.2 ). Recall that b θ k is the lo cal minimizer of L k ( θ ). Therefore, it holds that 0 = ∂ L k ( θ ) ∂ θ θ = b θ k = ∂ L k ( θ ) ∂ θ θ = θ 0 + 1 n k X i ∈S k ∂ 2 L ( θ ; Z i ) ∂ θ ∂ θ > θ = θ 0 ( b θ k − θ 0 ) + 1 2 n k X i ∈S k p X j =1 ( θ ∗ − θ 0 ) > ∂ 3 L ( θ ; Z i ) ∂ θ j ∂ θ ∂ θ > θ = θ ∗ ( θ ∗ − θ 0 ) , where θ ∗ lies b et w een θ 0 and b θ k . By standard arguments, ∂ L k ( θ ) ∂ θ θ = θ 0 = O p ( n − 1 / 2 k ) , b Σ − 1 k ( θ 0 ) = E n ∂ 2 L ( θ ; Z i , i ∈ S k ) ∂ θ ∂ θ > o θ = θ 0 + O p ( n − 1 / 2 k ) = Σ − 1 k + O p ( n − 1 / 2 k ) . ∂ 3 L k ( θ ) ∂ θ j ∂ θ ∂ θ > θ = θ ∗ = E n ∂ 3 L ( θ ; Z i , i ∈ S k ) ∂ θ ∂ θ > ∂ θ j θ = θ ∗ o + o p (1) . F urther note that b θ k − θ 0 = O p ( n − 1 / 2 k ). Then, we ha ve b θ k − θ 0 = − Σ k ∂ L k ( θ ) ∂ θ θ = θ 0 + B k ( θ 0 ) n k + O p 1 n k , where B k ( θ 0 ) = O (1) is the bias term. Then, it holds that √ N X k α k b Σ − 1 k ( b θ k )( b θ k − θ 0 ) = √ N X k α k Σ − 1 k n − Σ k ∂ L k ( θ ) ∂ θ θ = θ 0 + B k ( θ 0 ) n k + O p ( n − 1 k ) o + √ N X k α k { b Σ − 1 k ( b θ k ) − Σ − 1 k } ( b θ k − θ 0 ) = − 1 √ N X k n k ∂ L k ( θ ) ∂ θ θ = θ 0 + 1 √ N X k Σ − 1 k B k ( θ 0 ) + O p K √ N , (A.4) where the second equation is implied b y √ N P k α k { b Σ − 1 k ( b θ k ) − Σ − 1 k } ( b θ k − θ 0 ) = O p ( K/ √ N ). By condition (C4), it can b e concluded that co v { 1 √ N P k n k ∂ L k ( θ ) ∂ θ | θ = θ 0 } = Σ − 1 . Conse- quen tly , ( A.2 ) can b e pro ven. 40 App endix A.2: Pro of of Theorem 1 By Slutsky’s Theorem, the asymptotic normalit y is directly implied by ( A.1 ) and ( A.2 ) with V ∗ ( θ 0 ) → d N (0 , Σ − 1 ) and B ∗ ( θ 0 ) = o p (1). First, by Condition (C6) and the Ly apuno v cen tral limit theorem, w e ha ve V ∗ ( θ 0 ) → d N (0 , Σ − 1 ). Next, by the condition that n √ N , w e ha ve K √ N , and thus, B ∗ ( θ 0 ) = o p (1). App endix A.3: Pro of of Theorem 2 Note that w e hav e e θ (2) = X k α k b Σ (2) − 1 k − 1 X k b Σ (2) − 1 k b θ (2) k . It suffices to sho w that, X k α k { b Σ (2) k } − 1 → p Σ − 1 , (A.5) √ N h X k α k { b Σ (2) k } − 1 ( b θ (2) k − θ 0 ) i = V ∗ 2 ( θ 0 ) + B ∗ 2 ( θ 0 ) , (A.6) √ N ( e θ (2) − θ 0 ) → d N ( 0 , Σ) , (A.7) where co v { V ∗ 2 ( θ 0 ) } = Σ − 1 and B ∗ 2 ( θ 0 ) = O p (1 / ( n 2 √ N )). W e pro v e them in the follo wing three steps. Step 1. b θ (2) k − θ 0 = O p ( n − 1 / 2 ) . Note that w e hav e b θ (2) k = e θ − ∂ 2 L k ( e θ ) ∂ θ ∂ θ > − 1 ∂ L k ( e θ ) ∂ θ . By p erforming a T aylor’s expansion of ∂ L k ( e θ ) /∂ θ at θ 0 , w e could obtain, ∂ L k ( e θ ) ∂ θ = ∂ L k ( θ 0 ) ∂ θ + ∂ 2 L k ( θ ) ∂ θ ∂ θ > ( e θ − θ 0 ) = ∂ L k ( θ 0 ) ∂ θ + n ∂ 2 L k ( θ ) ∂ θ ∂ θ > − ∂ L k ( e θ ) ∂ θ ∂ θ > o ( e θ − θ 0 ) + ∂ 2 L k ( e θ ) ∂ θ ∂ θ ( e θ − θ 0 ) , 41 where θ lies on the line joining e θ and θ 0 . As a result, w e hav e b θ (2) k = θ 0 − ∂ 2 L k ( e θ ) ∂ θ ∂ θ > − 1 ∂ L k ( θ 0 ) ∂ θ − ∂ 2 L k ( e θ ) ∂ θ ∂ θ > − 1 n ∂ 2 L k ( θ ) ∂ θ ∂ θ > − ∂ L k ( e θ ) ∂ θ ∂ θ > o ( e θ − θ 0 ) def = θ 0 − ∆ k 1 − ∆ k 2 (A.8) Using Prop osition 1 w e ha ve ∆ k 1 = O p ( n − 1 / 2 ). Then it suffices to derive the rate of ∆ k 2 . By using T aylor’s expansion again we ha ve ∂ 2 L k ( θ ) ∂ θ ∂ θ > − ∂ L k ( e θ ) ∂ θ ∂ θ > = X j ∂ 3 L k ( θ (2) ) ∂ θ ∂ θ > ∂ θ j ( θ j − e θ j ) , where θ (2) lies on the line joining θ and and e θ . By Prop osition 1 we can obtain ∂ 3 L k ( θ (2) ) /∂ θ ∂ θ > ∂ θ j → p E { ∂ 3 L k ( θ 0 ) /∂ θ ∂ θ > ∂ θ j } def = L (3) kj and ∂ 2 L k ( e θ ) /∂ θ ∂ θ > → p Σ − 1 k . By Prop osition 1 , we hav e ∂ 2 L k ( e θ ) ∂ θ ∂ θ > − 1 n ∂ 2 L k ( θ ) ∂ θ ∂ θ > − ∂ L k ( e θ ) ∂ θ ∂ θ > o ( e θ − θ 0 ) = Σ k X j L (3) kj n V ∗ ( j ) ( θ 0 ) + B ∗ ( j ) ( θ 0 ) o { 1 + o p (1) } , (A.9) where V ∗ ( j ) ( θ 0 ) = { V ( θ 0 ) B j ( θ 0 ) + B ( θ 0 ) V j ( θ 0 ) } / ( n √ N ) = O p (1 / ( n √ N )), and B ∗ ( j ) ( θ 0 ) = B ( θ 0 ) B j ( θ 0 ) = O p ( n − 2 ). Step 2. Proof of ( A.5 ). F ollowing the same pro cedure of pro ving ( A.1 ) and using b θ (2) k − θ 0 = O p ( n − 1 / 2 ) prov ed in the first step w e could obtain the result. Step 3. Proof of ( A.6 ). By ( A.8 ) and ( A.9 ) w e hav e X k α k { b Σ (2) k } − 1 ( b θ (2) k − θ 0 ) = X k α k h − ∂ L k ( θ 0 ) ∂ θ − n ∂ 2 L k ( θ ) ∂ θ ∂ θ > − ∂ L k ( e θ ) ∂ θ ∂ θ > o ( e θ − θ 0 ) i = − X k α k ∂ L k ( θ 0 ) ∂ θ − X k α k X j L (3) kj n V ∗ ( j ) ( θ 0 ) + B ∗ ( j ) ( θ 0 ) o { 1 + o p (1) } 42 Define V ∗ 2 ( θ 0 ) def = − √ N X k α k ∂ L k ( θ 0 ) ∂ θ V 22 ( θ 0 ) def = − √ N X k α k X j L (3) kj V ∗ ( j ) ( θ 0 ) { 1 + o p (1) } B ∗ 2 ( θ 0 ) def = − √ N X k α k X j L (3) kj B ∗ ( j ) ( θ 0 ) { 1 + o p (1) } . Then we hav e co v { V ∗ 2 ( θ 0 ) } = Σ − 1 , V 22 ( θ 0 ) = o p (1), and B ∗ 2 ( θ 0 ) = O p ( n − 2 N 1 / 2 ) by Step 2 . Step 4. Proof of ( A.7 ). b y Condition (C6) and the Ly apuno v cen tral limit theorem, w e ha v e V ∗ 2 ( θ 0 ) → d N (0 , Σ − 1 ). Under the condition that n 2 / N 1 / 2 → ∞ , we conclude that B ∗ 2 ( θ 0 ) = o p (1). Using the Slut- sky’s Theorem w e could obtain the result. App endix A.4: Pro of of Prop osition 2 The result can b e prov ed by induction metho d. By Theorem 2 , the bias term of e θ (2) reduces from O p ( n − 1 ) to O p ( n − 2 ) when we tak e one more round of iteration. Applying the pro of of Theorem 2 sequen tially for m times, we can show that the bias term of e θ ( m ) is in the order of O p ( n − m ). If n − m N − 1 / 2 , i.e., m log N / log n , then the bias term can b e ignored. As a result, following the same pro of pro cedure of Theorem 2 , w e can obtain the result. App endix B App endix B.1: Pro of of Theorem 3 1. Proof of √ N -consistency. Note that the ob jective function Q λ ( θ ) in ( 3.1 ) is a strictly conv ex function. Then, the lo cal minimizer is also a global minimizer. T o establish √ N -consistency results, it suffices to v erify the follo wing result ( F an and Li , 2001 ), i.e., for an arbitrarily small > 0, there exists a sufficien tly large constant C such that lim N inf P n inf u ∈ R p : k u k = C Q λ ( θ 0 + N − 1 / 2 u ) > Q ( θ 0 ) o > 1 − . (B.1) 43 Let u = ( u 1 , · · · , u p ) > and b ∆ N = P k α k b Σ − 1 k ( b θ k ) √ N ( θ 0 − b θ k ) , Then, w e hav e N n Q λ ( θ 0 + N − 1 / 2 u ) − Q λ ( θ 0 ) o = u > b Σ − 1 u + 2 u > b ∆ N + N p X j =1 λ j | θ 0 j + N − 1 / 2 u j | − N p X j =1 λ j | θ 0 j | ≥ u > b Σ − 1 u + 2 u > b ∆ N + N d 0 X j =1 λ j | θ 0 j + N − 1 / 2 u j | − | θ 0 j | ≥ u > b Σ − 1 u + 2 u > b ∆ N − d 0 ( √ N a N ) k u k . (B.2) where the second equality holds because w e assume θ 0 j = 0 for j > d 0 . F urther note that w e assume that √ N a N → p 0 . Consequen tly , the last term ( B.2 ) is o p (1). Next, note that the first term of ( B.2 ) is low er b ounded by λ − 1 max ( b Σ) C 2 b ecause k u k = C . By ( A.3 ), we ha ve λ max ( b Σ) → p λ max (Σ). Consequently , with probability tending to 1, w e ha ve the first term uniformly larger than 0 . 5 λ − 1 max (Σ) C 2 , which is p ositiv e due to Condition (C4). In addition, b y K / √ N → 0, we hav e b ∆ N = O p (1). Consequen tly , as long as C is sufficien tly large, the first term will dominate the last t wo terms. Then, the result of ( B.1 ) is pro ven. 2. Proof of Selection Consistency. It suffices to verify that P ( e θ λ,j = 0) → 1 for an y d 0 < j ≤ p . Note that Q λ ( θ ) can b e rewritten as Q λ ( θ ) = ( θ − e θ ) > b Σ − 1 ( θ − e θ ) + X j λ j | θ j | + C , where C is a constant. Define b Ω = b Σ − 1 , and b Ω ( j ) denotes the j th ro w of the matrix b Ω. If e θ λ,j 6 = 0 for some j > d 0 , then the partial deriv ativ e can b e calculated as √ N ∂ Q λ ( θ ) ∂ θ j θ = e θ λ = 2 b Ω ( j ) > √ N ( e θ λ − e θ ) + √ N λ j sign( e θ λ,j ) . (B.3) Note that b Ω → p Σ − 1 and √ N ( e θ λ − e θ ) = √ N ( e θ λ − θ 0 ) − √ N ( e θ − θ 0 ) = O p (1), b y ( A.1 ), Theorem 1 , and Theorem 3 (a). Consequen tly , the first term ( B.3 ) is O p (1). Next, by this condition, w e know that √ N λ j ≥ √ N b λ → ∞ for j > d 0 . Because e θ λ,j 6 = 0, w e hav e sign( e θ λ,j ) = 1 or -1; th us, the second term ( B.3 ) go es to infinity . Ob viously , the equation 44 will not b e equal to zero. This implies P ( e θ λ,j = 0) → 1 as a result. App endix B.2: Pro of of Theorem 4 W e first rewrite the asymptotic cov ariance Σ into the following blo ck matrix form: Σ = Σ 11 Σ 12 Σ 21 Σ 22 , where Σ 11 ∈ R d 0 × d 0 . Similarly , we partition its inv erse matrix Ω in to 4 corresponding parts, Ω = (Ω 11 , Ω 12 ; Ω 21 , Ω 22 ). By Theorem 3 , with probabilit y tending to 1, we hav e e θ ( −M T ) λ = 0. Therefore, e θ ( M T ) λ should b e the global minimizer of the ob jective function, Q λ, 0 ( θ ( M T ) ) =( θ ( M T ) − e θ ( M T ) ) > b Ω 11 ( θ ( M T ) − e θ ( M T ) ) − 2( θ ( M T ) − e θ ( M T ) ) > b Ω 12 e θ ( −M T ) + e θ ( −M T ) > b Ω 22 e θ ( −M T ) + d 0 X j =1 λ j | θ j | By Theorem 3 , it can b e concluded that with probabilit y tending to 1, e θ ( M T ) λ should b e nonzero (otherwise, the √ N -consistency result in Theorem 3 will not hold). As a result, The partial deriv ativ e ∂ Q λ ( θ ) /∂ θ j should exist for 1 ≤ j ≤ d 0 , whic h yields 0 = 1 2 ∂ Q λ, 0 ( θ ( M T ) ) ∂ θ ( M T ) θ ( M T ) = e θ ( M T ) λ = b Ω 11 ( e θ ( M T ) λ − e θ ( M T ) ) − b Ω 12 e θ ( −M T ) + D ( e θ ( M T ) λ ) . (B.4) where D ( e θ ( M T ) λ ) is a d 0 -dimensional v ector, with its j th comp onen t giv en b y 0 . 5 λ j sign( e θ λ,j ). By ( B.4 ), it can b e deriv ed that √ N ( e θ ( M T ) λ − θ ( M T ) 0 ) = √ N ( e θ ( M T ) − θ ( M T ) 0 ) + b Ω − 1 11 b Ω 12 ( √ N e θ ( −M T ) ) − b Ω − 1 11 √ N D ( e θ ( M T ) λ ) = √ N ( e θ ( M T ) − θ ( M T ) 0 ) + Ω − 1 11 Ω 12 ( √ N e θ ( −M T ) ) + o p (1) , (B.5) where the second equality follo ws b ecause √ N e θ ( −M T ) = O p (1) by Theorem 1 , b Ω 11 → p Ω 11 and b Ω 12 → p Ω 12 b y ( A.1 ), and √ N λ j = o p (1) (1 ≤ j ≤ d 0 ) b y Theorem 3 . F urthermore, 45 b y the matrix inv erse form ula, it holds that Ω − 1 11 Ω 12 = − Σ 12 Σ − 1 22 . Consequently , we hav e √ N ( e θ ( M T ) λ − θ ( M T ) 0 ) = √ N ( e θ ( M T ) − θ ( M T ) 0 ) − Σ 12 Σ − 1 22 ( √ N e θ ( −M T ) ) + o p (1) . By Theorem 1 , we hav e that the ab o ve is asymptotically normal with a mean of 0, and the in v erse asymptotic cov ariance matrix is given b y (Σ 11 − Σ 12 Σ − 1 22 Σ 21 ) − 1 = Ω 11 . By condition (C6), w e ha ve Ω 11 = Ω M T . Consequently , the estimator e θ ( M ) λ shares the same asymptotic distribution with the oracle estimator b θ ( M T ) M T . App endix B.3: Pro of of Theorem 5 T o establish the selection consistency property of the DBIC, w e consider the follo wing t w o cases for an y M λ 6 = M T . The first case is the underfitted case, and the second case is the o v erfitted case. 1. Underfitted Model. Note that λ N satisfies the condition in Theorem 3 . Con- sequen tly , we ha ve that e θ λ N is √ N -consistent. W e th us also hav e DBIC λ N = o p (1). F or M 6⊃ M T , it can b e deriv ed that DBIC λ = ( e θ λ − e θ ) > b Σ − 1 ( e θ λ − e θ ) + d f λ (log N ) / N ≥ ( e θ λ − e θ ) > b Σ − 1 ( e θ λ − e θ ) > Define e θ M = arg min θ ∈ R p : θ j =0 , ∀ j 6∈M ( θ − e θ ) > b Σ − 1 ( θ − e θ ) as the unp enalized estimator given the mo del iden tified b y M . Consequen tly , by definition, we should ha ve ( e θ λ − e θ ) > b Σ − 1 ( e θ λ − e θ ) > ≥ ( e θ M λ − e θ ) > b Σ − 1 ( e θ M λ − e θ ) ≥ min M6⊃M T ( e θ M − e θ ) > b Σ − 1 ( e θ M − e θ ) → p min M6⊃M T ( e θ M − e θ ) > Σ − 1 ( e θ M − e θ ) , where the last conv ergence is due to ( A.1 ). Because Σ is p ositiv e definite by Condition (C4), w e ha v e min M6⊃M T ( e θ M − e θ ) > Σ − 1 ( e θ M − e θ ) > 0 with probability tending to 1. One could then conclude immediately that P (inf λ ∈ R − DBIC λ > DBIC λ N ) → 1. 2. Overfitted Model. W e next consider the ov erfitted mo del. In contrast, let λ b e an arbitrary tuning parameter that o ver selects the parameters. W e then ha ve d f λ − d 0 ≥ 1. 46 It can then b e concluded that N (DBIC λ − DBIC λ N ) = N ( e θ λ − e θ ) > b Σ − 1 ( e θ λ − e θ ) − N ( e θ λ N − e θ ) > b Σ − 1 ( e θ λ N − e θ ) + ( d f λ − d 0 ) log N ≥ N ( e θ M λ − e θ ) > b Σ − 1 ( e θ M λ − e θ ) − N ( e θ λ N − e θ ) > b Σ − 1 ( e θ λ N − e θ ) + log N ≥ inf M⊃M T N ( e θ M − e θ ) > b Σ − 1 ( e θ M − e θ ) − N ( e θ λ N − e θ ) > b Σ − 1 ( e θ λ N − e θ ) + log N . (B.6) First note that for M ⊃ M T , we hav e that e θ M is √ N -consistent. As a result, the first term of ( B.6 ) is O p (1). Similarly , by Theorem 3 , e θ λ N is √ N -consistent, Th us, the second term ( B.6 ) is also O p (1). As a result, ( B.6 ) div erges to infinit y as N → ∞ . This implies that P (inf λ ∈ R + DBIC λ > DBIC λ N ) → 1. This completes the pro of. App endix B.4: Comparison of the Shrink age Estimations One could consider an alternative approac h to conduct v ariable selection. Based on the asymptotic theory giv en in Theorem 1 , w e could p erform sim ultaneous h yp othesis testing for the v ariable selections. Sp ecifically , for eac h 1 ≤ j ≤ p , we could construct a 95% confidence interv al for θ 0 j as CI j = ( e θ j − z 0 . 975 f SE j , e θ j + z 0 . 975 f SE j ), where f SE j is the j th diagonal elemen t of b Σ = ( P k α k b Σ − 1 k ) − 1 , and z α is the α th quantile of a standard normal distribution. As a result, if 0 6∈ CI j , we could iden tify the j th v ariable as an imp ortan t v ariable. W e compare the finite sample p erformance of the ab ov e v ariable selection metho d with the prop osed shrink age estimation metho d for the logistic regression mo del. W e set p = 30, and for 1 ≤ j ≤ 15, w e sp ecify θ 0 j = U 1 sign( U 2 ), where U 1 and U 2 are generated from uniform distribution U [0 . 5 , 2] and U [ − 0 . 2 , 0 . 8] resp ectiv ely . In addition, for 15 < j ≤ p , we set θ 0 j = 0. W e also denote the p ercentage for co vering true mo del and identifying the true mo del as CR and CM resp ectiv ely . As sho wn in T able 12 , b oth metho ds are able to co ver the true mo del with high accuracy . How ever, the hypothesis testing metho d tends to ov er select the imp ortan t features while the SDLSA metho d could still guarantee the v ariable selection consistency prop ert y . 47 App endix B.5: In terlea ve with Bo otstrap Metho ds An alternative w ay to deal with statistical inference for massiv e dataset is to use b oot- strap metho ds. Recen tly , a surge of researc hes hav e in vestigated related metho dologies. F or instance, Kleiner et al. ( 2014 ) and Sengupta et al. ( 2016 ) prop ose to use subsample and re- sample metho ds to conduct statistical inference to reduce computational cost of traditional b o otstrap metho ds. V olgushev et al. ( 2019 ) and Y u et al. ( 2020 ) further discuss ho w to incorp orate the b o otstrap metho ds into distributed learning algorithms. Particularly , w e note that if we ha v e a large num b er of work ers, we could adopt the idea of V olgushev et al. ( 2019 ) to transmit estimators b θ k to the master and estimate the asymptotic co v ariance b Σ as b Σ = K − 1 P K k =1 ( b θ k − θ )( b θ k − θ ) > , where θ = K − 1 P K k =1 b θ k . The adv antage here is to a v oid the communication of the second order deriv ative of L k ( θ ). How ever, the estimation requires a large num b er of w orkers to b e v alid and in addition, the estimated asymptotic co v ariance is for the one-shot estimator. Hence, the metho dology cannot b e directly applied esp ecially for our heterogenous data setting. As an alternativ e, w e can in terleav e the b o otstrap metho ds on work ers for the estimation of the co v ariance b Σ k . The metho d can be useful when it is hard to obtain an analytical form of the asymptotic co v ariance b Σ k . T o sho w that it is a possible direction for future study , w e conduct a sim ulation study for the logistic regression mo del. Sp ecifically , we use the subsampled double b ootstrap (SDB) estimation metho d on w orker s to obtain the co v ariance estimator b Σ boot k . Next, we substitute b Σ boot k to b Σ k in ( 2.2 ) to obtain the b o otstrap ed-WLSE (BWLSE) estimator. W e set the b o otstrap sample size b = n γ with γ = 0 . 9 and the resampling times R = n k . The estimation REE with resp ect to global estimator is presented in T able 9 . Under the i.i.d data setting, all the three metho ds are comparable when N = 20 , 000 and K = 10. Ho w ever, under the heterogenous data setting, one could observ e that the b o otstrap metho d is b etter than the OS metho d and is comparable to the DLSA metho d. App endix B.6: Supplemen tary results for sim ulation studies 48 T able 7: Sim ulation results for Example 1 (Setting 1) with 500 replications. The numerical p erformances are ev aluated for different sample sizes N ( × 10 3 ) and num b ers of work ers K . The REE is rep orted for all estimators. F or the CSL, DLSA, TDLSA metho d, the CP is further rep orted in paren theses. Finally , the MS and CM are rep orted for the SDLSA metho d to ev aluate the v ariable selection accuracy . Est. θ 1 θ 2 θ 3 θ 4 θ 5 θ 6 θ 7 θ 8 MS CM N = 20, K = 10 OS 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 CSL 0.96 0.93 0.95 0.93 0.95 0.96 0.93 0.96 (94.0) (94.0) (94.8) (92.8) (92.8) (93.2) (90.8) (90.4) DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 (96.4) (96.0) (96.4) (94.6) (94.6) (94.4) (94.0) (93.8) TDLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 (96.4) (96.0) (96.4) (94.6) (94.6) (94.4) (94.0) (93.8) SDLSA 1.03 - - 1.07 - - 1.08 - 3.01 1.00 N = 100, K = 20 OS 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 CSL 0.95 0.92 0.93 0.94 0.95 0.93 0.92 0.96 (93.8) (92.2) (91.4) (92.6) (92.2) (92.2) (92.0) (94.6) DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 (95.4) (94.8) (94.8) (95.6) (95.2) (94.0) (94.6) (95.2) TDLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 (95.4) (94.8) (94.8) (95.6) (95.2) (94.0) (94.6) (95.2) SDLSA 1.03 - - 1.05 - - 1.08 - 3.01 1.00 49 T able 8: Sim ulation results for Example 2 (Setting 1) with 500 replications. The numerical p erformances are ev aluated for different sample sizes N ( × 10 3 ) and num b ers of work ers K . The REE is rep orted for all estimators. F or the CSL, DLSA, TDLSA metho d, the CP is further rep orted in paren theses. Finally , the MS and CM are rep orted for the SDLSA metho d to ev aluate the v ariable selection accuracy . Est. θ 1 θ 2 θ 3 θ 4 θ 5 θ 6 θ 7 θ 8 MS CM N = 20, K = 10 OS 0.84 0.98 0.99 0.90 0.98 0.99 0.88 0.99 CSL 0.88 0.98 0.94 0.91 0.95 0.94 0.86 0.96 (91.0) (93.6) (93.0) (93.8) (92.0) (92.4) (90.2) (92.0) DLSA 0.93 1.01 1.01 0.96 1.01 1.01 0.94 1.01 (92.2) (95.6) (94.8) (94.6) (94.2) (95.0) (92.8) (93.2) TDLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 (94.8) (95.2) (95.0) (95.4) (93.8) (95.0) (95.2) (93.4) SDLSA 1.00 - - 1.01 - - 1.03 - 3.01 1.00 N = 100, K = 20 OS 0.87 1.00 1.00 0.91 1.00 1.00 0.92 0.99 CSL 0.88 0.98 0.91 0.93 0.94 0.93 0.90 0.96 (90.6) (94.0) (91.8) (92.4) (94.8) (93.2) (91.8) (94.8) DLSA 0.92 1.00 1.00 0.98 1.00 1.00 0.92 1.00 (92.0) (95.4) (96.8) (93.8) (97.2) (95.8) (92.2) (95.6) TDLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 (95.0) (94.8) (96.8) (94.4) (97.0) (95.6) (95.4) (95.6) SDLSA 1.01 - - 1.04 - - 1.03 - 3.00 1.00 50 T able 9: Sim ulation results (with 100 replications) for logistic regression mo del by in terlea v- ing the b o otstrap metho ds. The REE of one-shot (OS), DLSA, and bo otstrap estimators are rep orted. N K Est. θ 1 θ 2 θ 3 θ 4 θ 5 θ 6 θ 7 θ 8 Setting 1: i.i.d Cov aria tes 5 8 OS 0.70 0.97 0.97 0.78 0.94 0.97 0.73 0.96 DLSA 0.90 1.03 1.02 0.90 1.03 1.02 0.95 1.02 Bo otstrap 0.68 1.02 1.04 0.71 1.01 1.04 0.76 1.01 20 10 OS 0.90 0.99 0.99 0.91 0.98 1.00 0.92 0.99 DLSA 0.89 1.00 1.01 0.97 1.00 1.00 0.92 1.01 Bo otstrap 0.86 0.97 1.00 0.95 1.00 0.99 0.90 0.99 Setting 2: Heterogenous Co v aria tes 5 8 OS 0.54 0.85 0.82 0.63 0.78 0.87 0.62 0.77 DLSA 0.77 1.02 1.02 0.86 1.04 1.04 0.84 1.04 Bo otstrap 0.52 0.97 1.00 0.65 0.99 1.01 0.61 1.05 20 10 OS 0.68 0.88 0.99 0.79 0.91 0.91 0.75 0.89 DLSA 0.90 1.02 1.01 1.01 1.00 1.01 0.92 1.02 Bo otstrap 0.83 1.04 0.99 0.98 0.98 0.99 0.86 1.00 T able 10: Simulation results for screening result with 100 replications. The CRs of 4 screening metho ds are rep orted. N K SIS SIRS K C DC SIS SIRS K C DC Linear Regression Logistic Regression Setting 1: i.i.d Cov aria tes 2000 5 1.00 0.71 0.87 0.91 0.99 0.48 0.73 0.89 10000 8 1.00 0.92 1.00 0.99 1.00 0.81 0.73 0.99 Setting 2: Heterogenous Co v aria tes 2000 5 0.98 0.73 0.79 0.84 0.95 0.55 0.73 0.85 10000 8 1.00 0.86 0.91 0.95 1.00 0.77 0.73 0.94 51 T able 11: Sim ulation results for linear and logistic regression mo dels after screening pro- cedure (using SIS) with 100 replications. The numerical p erformances are ev aluated for differen t sample sizes N ( × 10 3 ) and n um b ers of work ers K . The CR screen , REE (of five mo delling metho ds for all co v ariates), CR select , CM and are rep orted. N K CR screen OS CSL DLSA TDLSA SDLSA CR select CM Linear Regression Setting 1: i.i.d Cov aria tes 2 5 1.00 0.97 0.92 1.00 1.00 1.02 0.97 0.97 10 8 1.00 0.98 0.92 1.00 1.00 1.02 1.00 1.00 Setting 2: Heterogenous Co v aria tes 2 5 0.98 1.00 1.00 1.00 1.00 1.00 0.74 0.74 10 8 1.00 0.98 0.87 1.00 1.00 1.01 1.00 1.00 Logistic Regression Setting 1: i.i.d Cov aria tes 2 5 0.99 0.18 0.03 0.45 1.09 0.96 0.87 0.82 10 8 1.00 0.37 0.30 0.58 1.07 1.08 1.00 0.98 Setting 2: Heterogenous Co v aria tes 2 5 0.95 0.55 0.05 0.71 0.95 0.91 0.42 0.39 10 8 1.00 0.44 0.32 0.58 1.02 0.98 0.99 0.98 T able 12: Sim ulation results for comparison of the shrink age estimation with the hypoth- esis testing approach with 500 replications. The co verage rate (CR) and p ercentage of iden tifying the true mo del (CM) are presen ted. Setting 1: i.i.d Case Setting 2: Heterogenous Case Testing SDLSA Testing SDLSA N K CR CM CR CM CR CM CR CM 20 5 1.00 0.47 1.00 0.94 1.00 0.47 1.00 0.90 100 10 1.00 0.49 1.00 0.99 1.00 0.46 1.00 0.99 52
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment