Reduced-bias estimation of spatial econometric models with incompletely geocoded data

Reduced-bias estimation of spatial econometric mo dels with incompletely geo co ded data Giusepp e Arbia, Maria Michela Dic kson Giusepp e Espa, Diego Giuliani, Fla vio Santi 6th September 2019 Abstract The application of state-of-the-art spatial econometric mo dels requires that the information ab out the spatial co ordinates of statistical units is completely accurate, which is usually the case in the context of areal data. With micro-geographic point-lev el data, ho w ever, such information is inevitably aﬀected by lo cational errors, that can be generated in ten tionally b y the data producer for priv acy protection or can b e due to inaccuracy of the geoco ding procedures. This unfortunate circumstance can potentially limit the use of the spatial econometric modelling framew ork for the analysis of micro data. Indeed, some recent contributions (see e.g Arbia, Espa and Giuliani 2016) ha v e shown that the presence of lo cational errors ma y ha v e a non-negligible impact on the results. In particular, wrong spatial coordinates can lead to down w ard bias and increased v ariance in the estimation of mo del parameters. This con tribution aims at developing a strategy to reduce the bias and produce more reliable inference for spatial econometrics mo dels with lo cation errors. The v alidity of the prop osed approach is assessed by means of a Monte Carlo sim ulation study under diﬀeren t real-case scenarios. The study results show that the metho d is promising and can mak e the spatial econometric modelling of micro-geographic data possible. 1 In tro duction T raditional spatial econometric mo dels are based on the implicit assumption that the information ab out the spatial lo cation of statistical units is completely accurate. Whilst this circumstance is the norm in the con text of areal data (such as municipalities, coun ties or regions), it is rarely met when the observ ations are p oin ts in space (suc h as ﬁrms, houses or facilities), whose lo cations may b e either missing or aﬀected b y lo cational errors (see Zimmerman 2008; Zimmerman and Li 2010; Arbia, Espa and Giuliani 2019). Although geolo cation may fail for some units b ecause of technical reasons, incomplete p ositioning arises more frequen tly in geo coding processes, esp ecially in those circumstances where units’ co ordinates are obtained b y matching units’ p ostal addresses with georeferenced street maps (see e.g. Krav ets and Hadden 2007). Clearly , the qualit y of the resulting geolo cation depends both on the correctness and completeness of postal addresses, as well as on the eﬀectiveness 1 2 2 MODELLING APPRO ACH AND NOT A TION of matching algorithms and soft wares, nonetheless, if p osition of some units is uncertain, this fact should b e prop erly considered in the estimation pro cess. When an incomplete address is geo co ded, unit’s p osition is conv entionally imputed to the cen troid of the area where unit is lo cated, as it can b e known from address information. Such areas ma y b e coun ties, municipalities, or, more frequen tly , ZIP co de areas Zimmerman 2008. F rom a statistical p oin t of view, the presence of lo cational errors due to coarsened lo cations ma y ha ve a signiﬁcan t impact on parameter estimates of spatial econometric mo dels based on the Cliﬀ- Ord approach Cliﬀ and Ord 1969, as p ositional errors lead to down ward biased estimates for the spatial autoregressive parameters and inconsistent estimates for cov ariates co eﬃcien ts Arbia, Espa and Giuliani 2016. This pap er tackles the problem of estimating spatial mo dels where part of units are aﬀected b y coarsening. In particular, we fo cus on the Spatial Lag Mo del (see e.g, Arbia 2014). The prop osed estimation strategy mo dels b oth the spatial sto c hastic pro cess and the coarsening mechanism by means of a mark ed p oin t process whose in tensity function is estimated according to the coarsened-data estimator prop osed b y Zimmerman 2008. Mo del is ﬁtted through the maximisation of a doubly-marginalised likelihoo d function of the marked p oin t pro cess, whic h cleans out the eﬀects of coarsening. The ﬁrst marginalisation of the likelihoo d function allows the dimensionality of the spatial econometric mo del to b e consistently reduced to non-coarsened p oin ts and it is deriv ed analytically . The second marginalisation is p erformed via Monte Carlo sim ulations ov er the lo cations of coarsened points. The mo delling approach and Mon te Carlo experiments presen ted in the pap er sho w the v alidit y of the prop osed estimation metho d in comparison with the estimates obtained by means of other estimation approaches. In particular, the comparison concerns the parameter estimates and the direct and indirect eﬀects of mo del cov ariates on the dep enden t v ariable Arbia, Bera et al. 2019. The pap er is organised as follo ws. Section 2 describ es the mo delling approach and the notation w e adopted in this pap er. Section 3 illustrates and discusses the prop osed estimation approach. Section 4 illustrates the results of Monte Carlo sim ulations where the ﬁnite prop erties of parameters’ estimators and direct and of indirect impacts of regressors are studied. Section 5 concludes the pap er. 2 Mo delling approac h and notation Consider a p opulation of n units i = 1 , . . . , n for whic h a quan titative c haracteristic of in terest y i ∈ R and k regressors x i ∈ R k are kno wn. Assume that p ostal addresses are a v ailable for all n units, how ever only p < n of them are complete, whereas n − p are incomplete. Assume also, that the p units can b e assigned to, say , the ZIP areas they actually b elong to. Under these conditions, if a spatial mo del is needed for mo delling y (a thorough illustration of the reasons why a spatial mo delling approach may be necessary is av ailable in LeSage and Pace 2009, c h. 2), the coarsening of the n − p units’ lo cations only aﬀects the sp eciﬁcation of the spatial weigh t matrix, as y i and x i are known for all units i = 1 , . . . , n . 3 Consider, for example, the following isotropic Spatial Lag Mo del (SLM): ( y = ρW y + X β + ε ε ∼ N n (0 , σ 2 I n ) (1) where X ∈ R n × k is the design matrix which includes k regressors, and W ∈ R n × n is the usual spatial w eight matrix whose elemen ts w ij tak e positive v alues according to some proximit y criterion and zero if units i and j are not considered as neighbours. It can be v eriﬁed that, if p/n is the proportion of non-coarsened units, the share of elements of W not aﬀected by coarsening is only about ( p/n ) 2 , whereas all elements change if W is sto c hastic (that is, if W is row-standardised). The magnitude of the eﬀects of coarsening on the spatial w eigh t matrix is the cause of bias of estimators for the autoregressive parameter ρ Arbia, Espa and Giuliani 2016. The estimation metho d prop osed in this pap er basically reduces the dimen- sionalit y of the mo del by concen trating the lik elihoo d on the p non-coarsened units, thus limiting the eﬀects of the coarsened lo cations on model estimates and, at the same time, exploiting the a v ailable information ab out cov ariates and zone-based lo cation of the coarsened units. The problem is mo delled as a mark ed p oin t pro cess where both the sto c hastic spatial pro cess and the coarsening pro cess are speciﬁed conditionally on the underlying p oin t pro cess. Let (Ω , F , P ) b e a probability space, and let Z ∈ R n × 2 b e a realisation of n p oin ts from a 2-dimensional p oin t pro cess { Z ( s, ω ) : s ∈ S } deﬁned o ver a b ounded metric space ( S, k·k ) where S ⊂ R 2 . Let λ : S → R + b e the intensit y function of { Z ( s, ω ) : s ∈ S } deﬁned as: λ ( x ) = lim | d x |→ 0 E ( N ( x, d x )) d x , b eing N ( x, d x ) the coun t function for p oin ts in the neighbour d x ⊂ S cen tered in x ∈ S (see e.g. Illian et al. 2008). Conditionally on Z , the isotropic SLM (1) is deﬁned for the spatial pro cess y , where the spatial weigh t matrix W is row-standardised and its elements w ij are deﬁned as follows: w ij =    κ ( k z i − z j k ) P n h =1 κ ( k z i − z h k ) if i 6 = j and P n h =1 κ ( k z i − z h k ) 6 = 0 0 otherwise , (2) for an y i, j ∈ { 1 , . . . , n } , and some non-increasing function κ : R + → R + suc h that lim x →∞ κ ( x ) = 0 . The coarsening pro cess can b e either dep enden t on the intensit y function λ and the realisation of the p oin t pro cess { Z ( s, ω ) : s ∈ S } or indep enden t from them. Here we just assume that the coarsening is mo delled by means of a random v ector Φ , which is a realisation of n Bernoulli random v ariables indep endent from the spatial process y conditionally on the point pro cess Z . The comp onen ts Φ j of the random vector Φ are deﬁned as follows: Φ j ∼ B ( p j ) (3) 4 2 MODELLING APPRO ACH AND NOT A TION for j = 1 , . . . , n , and tak e v alue Φ j = 0 if p oin t j is coarsened, whereas Φ j = 1 if p oin t j has been correctly geoco ded. Finally , let S = { S 1 , S 2 , . . . , S R } b e a partition of the space S in to R regions suc h that, for an y unit i with co ordinate c i ∈ S , it exists one region S r suc h that c i ∈ S r . 1 It is assumed that, for eac h coarsened unit i , the region S r where i is lo cated is known. T o sum up, for all units i = 1 , . . . , n the v alues of the dep enden t v ariable y i and the cov ariates x i are known. F or non-coarsened units i = 1 , . . . , p the co ordinates c i ∈ S are known, whereas it is known the coarsening area S r of each coarsened unit i = p + 1 , . . . , n suc h that c j ∈ S r . Other missing or unkno wn information such as the v alues of parameters and the co ordinates of coarsened units ab out mo del (1) should b e either learn t (through estimation) or made it non-relev ant (through marginalisation). Before illustrating our prop osal for tackling the estimation problem, w e in tro duce the notation that will b e used throughout the rest of the pap er. W e denote with subscript P and subscript C non-coarsened and coarsened p oin ts resp ectively (that is, p oin ts where Φ j = 1 and Φ j = 0 resp ectiv ely). Conditionally on the random vector Φ , SLM (1) can b e restated as it follo ws:  y P y C  = ρ  W P P W P C W C P W C C  ·  y P y C  +  X P X C  β +  ε P ε C  (4) pro vided that the original SLM is prop erly p erm uted by means of a suitable p erm utation matrix P Φ ∈ { 0 , 1 } n × n , that is:  y P y C  = P Φ y ,  W P P W P C W C P W C C  = P Φ W P Φ ,  X P X C  = P Φ X ,  ε P ε C  = P Φ ε . Restatemen t (4) allo ws to organise observ ations ab out coarsened ( C ) and non-coarsened ( P ) p oin ts in blo ck matrices. W e also deﬁne matrix A ≡ I n − ρP Φ W P Φ ∈ R n × n , so that: A =  A P P A P C A C P A C C  =  I p − ρW P P − ρW P C − ρW C P I n − p − ρW C C  . Finally , it can b e prov ed (see e.g. Lu and Shou 2002) that the following relations hold for the inv erse matrix A − 1 : A − 1 =  A − 1 P P + A − 1 P P A P C ˜ Ξ − 1 A C P A − 1 P P − A − 1 P P A P C ˜ Ξ − 1 − ˜ Ξ − 1 A C P A − 1 P P ˜ Ξ − 1  (5) where ˜ Ξ ≡ A C C − A C P A − 1 P P A P C is the Sch ur complemen t of A P P and A − 1 ≡  ( A − 1 ) P P ( A − 1 ) P C ( A − 1 ) C P ( A − 1 ) C C  = =  Ξ − 1 − Ξ − 1 A P C A − 1 C C − A − 1 C C A C P Ξ − 1 A − 1 C C + A − 1 C C A C P Ξ − 1 A P C A − 1 C C  (6) where Ξ ≡ A P P − A P C A − 1 C C A C P is the Sch ur complemen t of A C C (see e.g. Horn and Johnson 2013). 1 In fact, this assumption is not crucial in our analysis, and can b e easily generalised by assuming S to b e a cov er of S such that S ∈ S . This generalisation permits v arious degrees of incompleteness in p ostal addresses to b e modelled, including the situation where some units are only known to be located in S . The estimation metho d proposed later can be applied with no mo diﬁcations also to this framework, how ever, for the sake of notational simplicit y , in the rest of the paper only the case where S is a partition of S is discussed. 5 3 Estimation strategy 3.1 Mo del ﬁtting Equation (5) allows to restate the reduced form of mo del (4) as follows: y P = ρW P P y P + X P β + ε P + A P C Ξ − 1  A C P A − 1 P P ( X P β + ε P ) − ( X C β + ε C )  . (7) Left-hand side term of Equation (7) together with the ﬁrst three terms of the right-hand side p erfectly describ e a SLM amongst correctly geo-referenced p oin ts, sharing the same parameters of the complete mo del (1) . Unfortunately , the last term on the righ t-hand side mak es things more complicated. The fourth term on the right-hand side of Equation (7) pro ves that, in general, any subset of observ ations of a SLM does not follo w a SLM. Indeed, it mak es the estimation pro cess of a SLM with coarsened points particularly tricky since Equation (7) includes blocks of matrix A whic h depend on (unknown) co ordinates of coarsened p oin ts. As previously stated, the estimation strategy prop osed in this pap er relies on a double marginalisation of SLM (1). In particular, the former marginalisation should b e made with respect to y P , thus concen trating the information ab out coarsened p oints in to a low er dimensional space. A similar approach to the marginalisation of the SLM has alredy pro ved to b e successful in the context of v ariance estimation in 2-dimensional systematic sampling (see Espa et al. 2017). The latter marginalisation should instead b e made with resp ect to the p oin t pro cess of non-coarsened p oin ts Z P , so as to include direct and indirect eﬀects of p ositional errors in the (marginal) probabilit y distribution of y P . The ﬁrst marginalisation can b e deriv ed in closed form from the inv erse form ula (6) and equals y P = Ξ − 1 X P β + Ξ − 1 ε P − Ξ − 1 A P C A − 1 C C ( X C β + ε C ) , whic h is a restatement of Equation (7) and implies that: E ( y P | Z, Φ) = Ξ − 1 X P β + ρ Ξ − 1 W P C A − 1 C C X C β , (8a) co v( y P | Z, Φ) = σ 2 Ξ − 1 ( I p + ρ 2 W P C ( A T C C A C C ) − 1 W T P C )(Ξ − 1 ) T . (8b) On the other hand, the second marginalisation requires the in tensit y function λ to b e estimated, so as to characterise the spatial p oin t pro cess { Z ( s, ω ) : s ∈ S } and, in turn, the probabilistic law of the spatial w eight matrix W under coarsened geo coding. A ccording to Zimmerman 2008, for an y s ∈ S , the intensit y function of a spatial point pattern aﬀected by incomplete geo co ding can b e estimated as follo ws: ˆ λ ( s ) = n X i =1 [ ˆ φ ( z i )] − 1 K h ( s − z i ) , (9) where K is some kernel function with bandwidth h , z i is an observed unit’s point lo cation, and ˆ φ is an estimate of the geo coding prop ensit y function φ : S → (0 , 1] Zimmerman 2008. The geo co ding prop ensit y function φ can b e estimated in v arious w ays, according to the av ailable information ab out the coarsening pro cess. In this 6 3 ESTIMA TION STRA TEGY pap er, the v alues of the coarsening probabilities in (3) are assumed to b e such that p j = φ ( z j ) , given the coordinate z j ∈ S of the unit j . It follows that: ˆ φ ( s ) = P R r =1 P n j =1 Φ j 1 { z j ∈ S r } 1 { s ∈ S r } P R r =1 P n j =1 1 { z j ∈ S r } 1 { s ∈ S r } , (10) so that ˆ φ is constan t ov er each region S r ∈ S and equals the prop ortion of non-coarsened p oin ts in S r . As stated in Zimmerman 2008, function (9) can b e estimated via a weigh ted k ernel intensit y estimator Diggle 1985. T o sum up, the solution w e prop ose in this pap er consists in four steps: 1. the intensit y function of the coarsened p oint process Z is estimated ac- cording to Zimmerman 2008 through estimators (9) and (10); 2. the lik eliho od of SLM (1) marginalised with respect to y P is deriv ed from (8); we denote that likelihoo d function as L ( ρ, β , σ 2 | y , X , Z, Φ) ; 3. the likelihoo d L ( ρ, β , σ 2 | y , X , Z, Φ) is marginalised with resp ect to Z P , that is: L ( ρ, β , σ 2 | y , X , Z P , Φ) = Z S n − p L ( ρ, β , σ 2 | y , X , Z P , z C , Φ) ˆ % ( z C | Z P ) d z C (11) where ˆ % : S n − p → R + is the conditional probabilit y density function of Z C | Z P implied by the estimated intensit y function ˆ λ ; 4. marginal likelihoo d L ( ρ, β , σ 2 | y , X , Z P , Φ) is maximised with resp ect to ρ , β and σ 2 . As anticipated, marginalisation (11) has to b e p erformed n umerically since it seems impossible to compute it analytically . An yw ay , t w o issues ma y make the outlined metho d computationally unfeasible. Firstly , the high-dimensional integration space in (11) ma y su bsantially deteriorate the p erformances of Monte Carlo in tegration metho ds. Secondly , the need to ev aluate in tegral (11) at every step of the optimisation pro cedure dramatically exacerbates the problem outlined in the previous p oint. In order to ov ercome b oth problems (and the second in particular), we rely on the cross-entrop y algorithm for the optimisation of noisy functions R ubinstein and Kro ese 2004, which iteratively marginalises and optimises the likelihoo d function L ( ρ, β , σ 2 | y , X , Z P , Z C , Φ) at the same time. Results of Monte Carlo sim ulations discussed in the next section hav e b een p erformed adopting the same parameters and instrumen tal distributions of the cross-en trop y algorithm as in Bee et al. 2017, where the metho d ha v e b een applied to maximum lik eliho od estimation of generalised linear m ultilev el mo dels (the only exception is in the n um b er N of dra ws, as it will b e clariﬁed later). 3.2 Theoretical prop erties and generalisations As stated in the introduction, this pap er aims at prop osing an estimation metho d for spatial mo dels à la Cliﬀ-Ord Cliﬀ and Ord 1969 where a p ortion of data is aﬀected by coarsening, thus the primarily interest is devoted to the 3.3 Impact estimators 7 parameters of that mo del, and to the other measures of cov ariates’ eﬀects (like, e.g. direct, indirect and total impacts, whic h will b e discussed in Section 3.3). Ho w ever, the theoretical prop erties of the prop osed estimation metho d cannot b e easily deriv ed, considering the comp osite nature of the model, which, in fact, consists of three elements: the p oint pro cess, the coarsening pro cess, and the spatial mo del (the SLM in this case). The prop erties of the estimators of the mo del’s parameters clearly dep end on the statistical prop erties of b oth the geo coding prop ensit y function estimator (10) and the in tensit y function estimator (9) . The former is basically a frequency estimator whic h may be in terpreted as an estimator of the mean v alue of the coarsening in tensity function o v er each coarsening region. Asymptotic prop erties of estimator (9) along with estimator (10) are discussed in Zimmerman 2008. Finally , the asymptotics of maxim um lik elihoo d estimators of the SLM for non-coarsened data are analysed in depth in Lee 2004. Since the estimation metho d prop osed in this pap er relies on a marginalisation of the full lik eliho od function of the marked p oin t pro cess which describ e all three random pro cesses, the ma jor concern for consistency of the estimators is represen ted by missp eciﬁcation problems in the mo del. Although that issue is clearly imp ortant, it is worth stressing that the mo delling approach prop osed in this pap er can b e easily adapted or generalised to other coarsening mechanisms, p oin t patterns, or sto c hastic spatial pro cesses, as it is only required that the mo del can b e iden tiﬁed and its lik eliho o d marginalised. In general, if the estimators adopted for eac h component of the model (p oin t pro cess, coarsening pro cess, and spatial pro cess) are singularly consistent, the marginalisation preserves such prop ert y for the parameter estimates once the eﬀects of coarsening are considered. 3.3 Impact estimators A ccording to LeSage and P ace 2009, the eﬀects of cov ariates on the dep enden t v ariable of a SLM do not solely dep end on regression co eﬃcien ts β , as the spatially-lagged dep endent v ariable induces an indirect eﬀect resulting from the autoregressiv e parameter ρ and the spatial w eigh t matrix W . It follo ws that the ov erall impact of a regressor on the v alue of the dep enden t v ariable can b e decomp osed in a direct and an indirect impact, whic h, how ever, it is not constant amongst all units. F or these reasons, av erages of total ( T ( β ) ), direct ( D ( β )), and indirect ( M ( β ) ) impacts are usually computed LeSage and Pace 2009: T ( β ) = n − 1 ι T n ( I − ρW ) − 1 ι n β , (12a) D ( β ) = n − 1 tr( I − ρW ) − 1 β , (12b) M ( β ) = T ( β ) − D ( β ) . (12c) A ccording to the mo del we hav e describ ed in Section 2, some elements of the spatial weigh t matrix W are not known when geo co ding is not complete. It follo ws that impacts should b e estimated via Mon te Carlo simulations where the w eigh t matrices are generated from realisations of p oin t pro cess Z with estimated intensit y function ˆ λ . Thus, the Mon te Carlo estimators of the impact 8 4 MONTE CARLO SIMULA TIONS measures (12) can b e deﬁned as follo ws: \ ( A − 1 ) = 1 N N X k =1 ( I − ˆ ρW k ) − 1 , ˆ T ( ˆ β ) = n − 1 ι T n \ ( A − 1 ) ι n ˆ β , ˆ D ( ˆ β ) = n − 1 tr \ ( A − 1 ) ˆ β , ˆ M ( ˆ β ) = ˆ T ( ˆ β ) − ˆ D ( ˆ β ) . Since Monte Carlo estimation of matrix \ ( A − 1 ) ma y b e computationally demand- ing b ecause of the in versions of the weigh t matrices W k , a truncated geometric series of ( I − ˆ ρW k ) − 1 ma y reduce substan tially the computational burden of the sim ulation: \ ( A − 1 ) = 1 N N X k =1 m X h =0 ˆ ρ h W h k . where m represents the truncation point. 4 Mon te Carlo simulations The performances of the prop osed estimation approach in ﬁnite samples ha ve b een studied by means of Monte Carlo simulations. The complication of b oth the mo delling setting and estimation metho d considerably widens the v ariety of scenarios which should be considered for studying the estimators’ prop erties in ﬁnite samples. In this section eight diﬀerent scenarios are considered: (A) a p oin t pattern with n = 250 p oin ts is generated o v er an irregular area S according to an inhomogeneous Poisson pro cess with the intensit y function λ represen ted in Figure 1. The surface S is partitioned in to R = 17 hexagonal regions of equal size excepting for b order zones (see Figure 1). The SLM includes tw o regressors (generated as realisations of a standard normal distribution) and a constant term, so that X ∈ R n × 3 . The parameters of the SLM are ρ = 0 . 5 , β = [1 , 1 , − 1] T , σ 2 = 1 , whereas the spatial w eigh t matrix W is computed according to (2), and κ ( x ) = 1 { x ≤ 0 . 5 } (note that sides of hexagons measure 1 . 5 ). Eac h unit of the p oin t pattern is indep enden tly coarsened with probability 0 . 4 . Sim ulations are based on N = 300 replications, eac h of whic h share the same p oin t pattern and design matrix X ; (B) the same simulation settings as in p oin t (A), except that ρ = 0 . 3 ; (C) the same simulation settings as in p oin t (A), except that ρ = 0 . 7 ; (D) the same simulation settings as in p oin t (A), except that σ 2 = 2 ; (E) the same simulation settings as in p oin t (A), except that n = 500 and κ ( x ) = 1 { x ≤ √ 1 / 8 } . F unction κ has b een redeﬁned so that the av erage neigh b ourhoo d area p er unit is the same as in case (A); (F) the same simulation settings as in point (A), except that φ ( s ) ∝ 0 . 8 λ ( s ) . F unction φ is set so that the coarsening probability ranges b et ween 0 . 2 and 0 . 75 , whereas its a v erage equals 0 . 4 , in line with all the other simulation scenarios; 9 Figure 1: In tensity function λ used for generating the point process (left) and the realisation of the pro cess for n = 250 with hexagonal partition of the space (right). 2 2 4 4 4 4 4 4 4 6 6 6 6 6 8 8 8 10 10 10 10 12 12 14 16 18 20 (G) the same simulation settings as in point (A), except that φ ( s ) ∝ − 0 . 8 λ ( s ) . F unction φ is set so that the coarsening probability ranges b et ween 0 . 04 and 0 . 60 , whereas its a v erage equals 0 . 4 , in line with all the other simulation scenarios. (H) the same sim ulation settings as in p oint (A), except that the sides of hexagons measure 1, thus the n um b er of regions is R = 29 ; F or each scenario ﬁv e estimation metho ds are considered: • the maximum lik eliho od estimator based on a dataset where lo cation of all units are kno wn, and there is no coarsening. Hereinafter this estimator is referred to as NCM, whic h stands for non-c o arsene d mo del ; • the proposed estimator based on double marginalisation (hereinafter DME); • the maxim um likelihoo d estimator of the SLM based only on non-coarsened units (hereinafter REM). In this case the weigh t matrix is computed using the same κ function as the data generating pro cess, but no standardisation is p erformed; • the maxim um likelihoo d estimator of the SLM based only on non-coarsened units. Unlik e the previous case, the spatial weigh t matrix is row-standardised (hereinafter SREM); • the maximum lik eliho od estimator of the SLM based on all points. Lo cation of coarsened p oin ts is imputed to the centroids of regions where p oin ts are lo cated, and a row-standardised weigh t matrix is derived according to the same κ function as the data generating pro cess. Hereinafter this metho d is referred to as CIP , whic h stands for c entr oid impute d p osition . Results of simulations are summarized in terms of relative ro ot mean squared error (RMSE) and relative bias in T ables 1 and 2. T ables 1 and 2 only rep ort 10 4 MONTE CARLO SIMULA TIONS T able 1: Relativ e ro ot mean squared error and relative bias (in paren thesis) of parameter and impact estimators for scenarios A, B, C, D of Monte Carlo sim ulations (see Section 4) for v arious estimation methods. Direct (D), indirect (M) and total (T) impact estimates refer to the second regressor (whose co eﬃcien t is β 1 ). All v alues are m ultiplied b y 100. Method ρ β 0 β 1 β 2 σ D ( β 1 ) M ( β 1 ) T ( β 1 ) Scenario A NCM 4 . 78 10 . 40 2 . 29 3 . 85 4 . 20 2 . 43 9 . 57 5 . 22 ( − 0 . 40) ( − 0 . 70) ( − 0 . 11) ( − 0 . 15) ( − 0 . 59) ( − 0 . 14) ( − 0 . 46) ( − 0 . 28) DME 23 . 74 24 . 58 3 . 99 6 . 16 25 . 41 4 . 59 37 . 50 18 . 81 ( − 22 . 25) ( − 17 . 98) (0 . 68) (0 . 28) (24 . 09) ( − 2 . 57) ( − 36 . 00) ( − 17 . 89) SREM 29 . 67 30 . 19 3 . 80 5 . 80 25 . 70 3 . 97 48 . 34 23 . 10 ( − 28 . 50) ( − 24 . 85) (0 . 74) (0 . 82) (24 . 23) ( − 1 . 33) ( − 47 . 44) − 22 . 46 CIP 33 . 01 27 . 50 3 . 14 4 . 57 44 . 09 3 . 77 47 . 87 23 . 46 ( − 32 . 11) ( − 23 . 75) (1 . 48) (1 . 36) (43 . 36) ( − 2 . 61) ( − 47 . 06) ( − 22 . 98) REM 83 . 95 39 . 36 4 . 13 5 . 88 40 . 72 4 . 11 49 . 03 23 . 78 ( − 83 . 89) ( − 34 . 71) (2 . 00) ( − 0 . 02) (39 . 54) ( − 1 . 67) ( − 43 . 44) ( − 20 . 82) Scenario B NCM 9 . 84 9 . 93 2 . 17 3 . 50 4 . 51 2 . 20 13 . 81 4 . 63 ( − 0 . 19) ( − 0 . 63) ( − 0 . 05) ( − 0 . 23) ( − 0 . 85) ( − 0 . 03) (0 . 22) (0 . 04) DME 29 . 15 17 . 39 3 . 18 5 . 01 9 . 79 3 . 42 36 . 19 11 . 44 ( − 25 . 81) ( − 9 . 36) ( − 0 . 22) (0 . 06) (7 . 32) ( − 1 . 27) ( − 32 . 69) ( − 10 . 08) SREM 36 . 51 19 . 10 3 . 00 4 . 84 9 . 71 3 . 19 47 . 20 14 . 26 ( − 34 . 22) ( − 12 . 70) ( − 0 . 18) (0 . 38) (7 . 33) ( − 0 . 96) ( − 45 . 38) ( − 13 . 42) CIP 38 . 88 17 . 78 2 . 26 3 . 62 14 . 22 2 . 71 46 . 52 14 . 27 ( − 36 . 64) ( − 13 . 23) ( − 0 . 19) ( − 0 . 18) (13 . 07) ( − 1 . 50) ( − 44 . 43) ( − 13 . 54) REM 84 . 15 20 . 88 3 . 06 4 . 87 12 . 89 3 . 35 49 . 32 15 . 05 ( − 84 . 05) ( − 14 . 62) (0 . 00) ( − 0 . 23) (10 . 99) ( − 1 . 16) ( − 45 . 28) ( − 13 . 53) Scenario C NCM 2 . 42 10 . 99 2 . 21 3 . 76 4 . 92 2 . 36 7 . 79 5 . 54 ( − 0 . 34) ( − 0 . 93) ( − 0 . 20) ( − 0 . 13) ( − 0 . 86) ( − 0 . 33) ( − 0 . 91) ( − 0 . 70) DME 17 . 58 42 . 81 4 . 86 8 . 64 58 . 44 7 . 08 43 . 65 30 . 03 ( − 16 . 66) ( − 33 . 12) (2 . 04) (1 . 03) (56 . 86) ( − 5 . 79) ( − 42 . 98) ( − 29 . 48) SREM 22 . 30 51 . 97 5 . 29 7 . 21 59 . 68 4 . 53 51 . 00 33 . 02 ( − 21 . 60) ( − 44 . 62) (3 . 24) (2 . 08) (57 . 89) ( − 0 . 96) ( − 50 . 39) ( − 32 . 45) CIP 28 . 41 49 . 80 6 . 70 7 . 43 110 . 29 5 . 39 53 . 90 36 . 01 ( − 27 . 92) ( − 45 . 83) (5 . 81) (5 . 18) (109 . 49) ( − 4 . 42) ( − 53 . 50) ( − 35 . 69) REM 84 . 69 86 . 62 8 . 58 8 . 46 109 . 38 5 . 54 48 . 59 32 . 33 ( − 84 . 66) ( − 81 . 77) (7 . 35) (2 . 63) (108 . 13) ( − 2 . 27) ( − 40 . 87) ( − 26 . 86) Scenario D NCM 6 . 35 14 . 79 3 . 30 4 . 97 4 . 87 3 . 30 12 . 12 6 . 51 ( − 0 . 76) (0 . 69) (0 . 37) (0 . 54) ( − 0 . 80) (0 . 30) ( − 0 . 42) ( − 0 . 03) DME 24 . 23 27 . 68 5 . 00 7 . 58 15 . 30 5 . 13 37 . 42 18 . 65 ( − 21 . 96) ( − 16 . 14) (1 . 15) (1 . 23) (13 . 30) ( − 2 . 06) ( − 35 . 01) ( − 17 . 16) SREM 30 . 77 31 . 76 4 . 79 7 . 48 16 . 03 4 . 60 48 . 73 23 . 08 ( − 29 . 17) ( − 22 . 53) (1 . 40) (1 . 84) (14 . 30) ( − 0 . 77) ( − 47 . 46) ( − 22 . 17) CIP 34 . 08 29 . 41 4 . 20 6 . 01 27 . 21 4 . 12 48 . 42 23 . 50 ( − 32 . 67) ( − 22 . 32) (2 . 03) (2 . 09) (26 . 27) ( − 2 . 09) ( − 47 . 14) ( − 22 . 74) REM 84 . 20 39 . 79 5 . 34 7 . 35 25 . 13 4 . 86 48 . 32 23 . 31 ( − 84 . 14) ( − 31 . 45) (2 . 56) (0 . 82) (23 . 89) ( − 1 . 18) ( − 42 . 58) ( − 20 . 15) impacts estimates ab out the ﬁrst regressor, since estimates on other regressor impacts are similar. Ob viously , in all scenarios, the NCM estimator is the b est p erformer for all parameters b oth in terms of bias and RMSE, as it relies on correct p ositions for all units. F or this reason, it is not commen ted in the follo wing. Estimates in T ables 1 and 2 sho w tw o general results which basically hold under all scenarios. Firstly , the estimates obtained from all estimation metho ds are rather stable under all simulation settings for most parameters and impacts. The only remark- 11 T able 2: Relativ e ro ot mean squared error and relative bias (in paren thesis) of parameter and impact estimators for scenarios E, F, G, H of Mon te Carlo simulations (see Section 4) for v arious estimation methods. Direct (D), indirect (M) and total (T) impact estimates refer to the second regressor (whose co eﬃcien t is β 1 ). All v alues are m ultiplied b y 100. Method ρ β 0 β 1 β 2 σ D ( β 1 ) M ( β 1 ) T ( β 1 ) Scenario E NCM 3 . 42 7 . 01 1 . 47 2 . 36 3 . 44 1 . 52 6 . 68 3 . 52 ( − 0 . 17) ( − 0 . 20) (0 . 03) ( − 0 . 07) ( − 0 . 70) (0 . 02) ( − 0 . 08) ( − 0 . 02) DME 26 . 50 24 . 73 3 . 36 5 . 17 25 . 81 3 . 46 40 . 36 19 . 73 ( − 25 . 62) ( − 20 . 79) (1 . 87) (2 . 65) (24 . 89) ( − 2 . 09) ( − 39 . 42) ( − 19 . 14) SREM 29 . 77 26 . 93 3 . 17 4 . 97 25 . 99 2 . 68 47 . 67 22 . 28 ( − 29 . 01) ( − 23 . 84) (1 . 94) (2 . 86) (25 . 02) ( − 0 . 65) ( − 47 . 01) ( − 21 . 83) CIP 33 . 62 24 . 32 2 . 96 5 . 52 40 . 96 2 . 91 48 . 02 23 . 23 ( − 33 . 00) ( − 21 . 96) (2 . 34) (4 . 48) (40 . 44) ( − 2 . 26) ( − 47 . 43) ( − 22 . 90) REM 85 . 04 34 . 12 4 . 10 7 . 02 41 . 29 2 . 95 37 . 40 17 . 85 ( − 85 . 02) ( − 31 . 62) (3 . 18) (5 . 72) (40 . 67) ( − 0 . 21) ( − 26 . 80) ( − 12 . 35) Scenario F NCM 4 . 67 10 . 23 2 . 14 3 . 72 4 . 36 2 . 18 9 . 09 4 . 82 ( − 0 . 10) (0 . 27) (0 . 01) ( − 0 . 02) ( − 0 . 86) (0 . 03) (0 . 22) (0 . 12) DME 21 . 86 24 . 29 3 . 66 5 . 96 24 . 63 4 . 08 35 . 26 17 . 55 ( − 20 . 49) ( − 17 . 05) (0 . 84) (0 . 33) (23 . 43) ( − 2 . 12) ( − 33 . 81) ( − 16 . 64) SREM 27 . 07 29 . 30 3 . 56 5 . 75 24 . 73 3 . 50 45 . 52 21 . 30 ( − 26 . 00) ( − 23 . 52) (1 . 15) (0 . 85) (23 . 40) ( − 0 . 34) ( − 44 . 62) ( − 20 . 63) CIP 31 . 00 25 . 80 2 . 95 4 . 30 41 . 10 3 . 57 45 . 71 22 . 40 ( − 30 . 14) ( − 21 . 94) (1 . 43) (1 . 28) (40 . 46) ( − 2 . 40) ( − 44 . 85) ( − 21 . 86) REM 80 . 09 33 . 36 3 . 67 5 . 86 37 . 55 3 . 72 46 . 57 22 . 36 ( − 80 . 05) ( − 27 . 57) (1 . 55) (0 . 81) (36 . 58) ( − 1 . 32) ( − 44 . 73) ( − 21 . 21) Scenario G NCM 5 . 04 10 . 24 2 . 11 3 . 85 4 . 58 2 . 19 9 . 70 5 . 14 ( − 0 . 48) ( − 1 . 40) (0 . 05) ( − 0 . 19) ( − 1 . 01) (0 . 01) ( − 0 . 42) ( − 0 . 19) DME 24 . 43 27 . 55 3 . 63 6 . 04 24 . 82 4 . 26 38 . 00 19 . 01 ( − 22 . 95) ( − 21 . 33) (0 . 90) (0 . 13) (23 . 33) ( − 2 . 63) ( − 36 . 53) ( − 18 . 17) SREM 31 . 01 33 . 33 3 . 50 5 . 69 25 . 33 3 . 56 49 . 13 23 . 46 ( − 29 . 85) ( − 28 . 52) (1 . 26) (0 . 54) (23 . 89) ( − 1 . 41) ( − 48 . 23) ( − 22 . 86) CIP 32 . 87 27 . 86 3 . 04 4 . 67 44 . 74 3 . 34 47 . 38 23 . 09 ( − 31 . 94) ( − 24 . 23) (1 . 86) (1 . 34) (44 . 01) ( − 2 . 34) ( − 46 . 54) ( − 22 . 59) REM 89 . 09 50 . 58 4 . 86 5 . 73 43 . 11 4 . 09 54 . 25 26 . 19 ( − 89 . 06) ( − 46 . 77) (3 . 27) ( − 1 . 09) (42 . 11) ( − 1 . 71) ( − 48 . 39) ( − 23 . 10) Scenario H NCM 4 . 84 9 . 44 2 . 01 3 . 40 4 . 93 2 . 16 9 . 57 5 . 13 ( − 0 . 27) ( − 0 . 47) (0 . 01) ( − 0 . 23) ( − 0 . 60) (0 . 00) ( − 0 . 07) ( − 0 . 03) DME 20 . 28 22 . 43 3 . 52 6 . 00 24 . 59 4 . 16 33 . 16 16 . 73 ( − 18 . 59) ( − 15 . 55) (0 . 47) (0 . 14) (22 . 91) ( − 2 . 41) ( − 31 . 33) ( − 15 . 66) SREM 29 . 18 29 . 64 3 . 50 5 . 48 25 . 88 3 . 48 47 . 55 22 . 48 ( − 28 . 07) ( − 24 . 61) (1 . 18) (0 . 57) (24 . 12) ( − 0 . 88) ( − 46 . 65) ( − 21 . 85) CIP 27 . 94 22 . 79 3 . 12 4 . 21 40 . 35 3 . 27 41 . 89 20 . 44 ( − 26 . 98) ( − 19 . 15) (1 . 76) (1 . 55) (39 . 58) ( − 2 . 05) ( − 40 . 91) ( − 19 . 86) REM 84 . 05 38 . 60 4 . 28 5 . 87 40 . 73 4 . 02 50 . 74 24 . 49 ( − 84 . 00) ( − 34 . 50) (2 . 54) ( − 0 . 43) (39 . 48) ( − 1 . 16) ( − 42 . 00) ( − 19 . 87) able exception is represented b y the estimates of the error v ariance, which are rather sensitive with respect to the v alue of parameter ρ and σ 2 . Secondly , the rank of estimation methods in terms of b oth bias and RMSE is basically the same whatev er the scenario we consider, although some diﬀerences emerge amongst parameters. If cov ariate coeﬃcients are considered (that is β 0 , β 1 , β 2 ), DME estimator is the b est p erformer in terms of relative bias. On the other hand, the CIP estimator exhibits the smallest RMSE, follow ed by the SREM estimator, whereas larger RMSE result from DME and REM estimator. Anyw a y , b oth in case of 12 4 MONTE CARLO SIMULA TIONS bias and RMSE, diﬀerences amongst estimators are rather small if w e consider co v ariates coeﬃcients β 1 and β 2 , whereas larger v ariabilit y emerges for β 0 . Things change if the autoregressive parameter ρ is considered. In this case, the DME clearly outp erforms all other estimators b oth in terms of bias and RMSE in all considered scenarios, whereas the second-b est estimator is SREM estimator follow ed b y CIP and REM estimators. Unlike regressors co eﬃcien ts, diﬀerences amongst estimation metho ds are large in terms of bias and RMSE. If error disp ersion parameter σ is considered, the four estimation metho ds for coarsened data can b e gathered into tw o groups. The former includes the b est p erformers whic h are DME and SREM, the latter consists in CIP and REM estimators, which almost double the relative bias and the relativ e RMSE of estimators in the other group. It is interesting to note that estimators of each group exhibit very similar relative bias and relative RMSE. The p erformances of estimators on assessing impacts of co v ariates clearly reﬂect the statistical p erformances on parameters ρ , β 1 , and β 2 . Thus CIP , REM, and SREM estimators p erform well in estimating the direct impact, whereas the DME deﬁnitely outp erforms the others when indirect impact is estimated. The eﬃciency of DME on indirect impact estimation is large enough to make DME the most eﬃcien t estimator also for the total impact. Analogous results hold also in terms of bias. Although relative p erformances of estimators are pretty stable amongst scenarios considered in the simulations, it is worth stressing some stylised facts whic h emerged from simulations and are in line with the b ehaviour whic h may b e exp ected. Firstly , the relative bias and the relative RMSE of the estimators of the autoregressiv e parameter ρ and the error dispersion parameter σ are asso ciated with ρ itself. In particular, the larger is ρ , the smaller will b e the relativ e bias and the relativ e RMSE of estimators for ρ and σ . This prop ert y seems to hold also for the other parameters ( β 0 , β 1 , β 2 ), how ever the magnitude of the eﬀect is not particularly wide. The bias and the RMSE of impact estimates are related to the v alue of ρ to o, as they depend on bias and RMSE of estimates on parameters ρ , β 1 , β 2 , thus, the higher the v alue of ρ , the higher is the eﬃciency of the considered estimators. Secondly , scenario (D) shows that, as exp ected, an increase in the error v ariance with respect to scenario (A), leads to a loss in eﬃciency of all estimation metho ds. On the other hand, if the size of the regions is reduced, the eﬀect of coarsening are more limited, and this turns in to an increase of estimators eﬃciency and a decrease of biases, as the comparison of results from scenario (A) and (H) makes it apparen t. Thirdly , if scenarios (A), (F), and (G) are compared, no clear pattern emerges, although it seems that RMSE tends to sligh tly increase as we mov e from scenario (F) to (A), and from (A) to (G), suggesting that better estimates can be obtained if coarsening is more frequent in areas where the intensit y of the p oin t pro cess is higher – scenario (F) –, whereas the opp osite is true if the in tensity of the p oin t pro cess and the coarsening probability are in v ersely related. 13 5 Conclusions The estimation method prop osed in this paper for tackling the problem of incompletely geo co ded data is based on a mo delling approac h which integrates the p oin t pro cess, the coarsening pro cess and the spatial pro cess through a mark ed p oin t pro cess model whose likelihoo d function is then marginalised twice so as to clean out the eﬀects of coarsening. Mon te Carlo simulations for the spatial lag mo del hav e shown that the prop osed metho d is basically equiv alent to other methods in terms of bias and RMSE in the estimation of regressor co eﬃcien ts, whereas it returns more eﬃcien t and less biased estimates for the spatial autoregressive parameter, the error v ariance, indirect impacts, and total impacts. Gains in eﬃciency and biasedness are substantial and they clearly emerges under the v arious simulation settings. The prop osed metho dology can be generalised in v arious directions to accoun t for other forms of data incompleteness t ypically emerging when analysing large spatial datasets related to individual economic agents. References Arbia, G. (2014). A Primer for Sp atial Ec onometrics . Palgra ve Macmillan, UK. Arbia, G., A. Bera et al. (2019). ‘T esting Impact Measures in Spatial Autore- gressiv e Mo dels’. In: International R e gional Scienc e R eview . Arbia, G., G. Espa and D. Giuliani (2016). ‘Dirt y spatial econometrics’. In: The A nnals of R e gional Scienc e 56.1, pp. 177–189. doi : 10. 1007/ s00168- 015- 0726- 5 . — (2019). Sp atial Micr o e c onometrics . Routledge, London (UK). Bee, M. et al. (2017). ‘A cross-en tropy approach to the estimation of general- ised linear m ultilev el mo dels’. In: Journal of Computational and Gr aphic al Statistics 26.3, pp. 695–708. doi : 10.1080/10618600.2016.1278003 . Cliﬀ, A. D. and J. K. Ord (1969). ‘The problem of Spatial Autocorrelation’. In: L ondon Pap ers in R e gional Scienc e 1, Studies in Regional Science, pp. 25–55. Diggle, P . J. (1985). ‘A k ernel metho d for smo othing p oin t pro cess data’. In: Journal of the R oyal Statistic al So ciety, Series C 34, pp. 138–147. Espa, G. et al. (2017). ‘Mo del-based v ariance estimation in tw o-dimensional systematic sampling’. In: Metr on 75.3, pp. 265–275. doi : 10.1007/s40300- 017- 0125- z . Horn, R. A. and C. R. Johnson (2013). Matrix A nalysis . 2nd ed. Cambridge Univ ersit y Press, New Y ork. Illian, J. et al. (2008). Statistic al A nalysis and Mo del ling of Sp atial Point Patterns . Wiley , Chichester (UK). isbn : 978-0-470-01491-2. Kra v ets, N. and W. C. Hadden (2007). ‘The accuracy of address co ding and the eﬀects of co ding errors’. In: He alth&Plac e 13.1, pp. 293–298. doi : 10.1016/ j.healthplace.2005.08.006 . Lee, L.-F. (2004). ‘Asymptotic of Quasi-Maximum Likelihoo d Estimators for Spatial Autoregressiv e Mo dels’. In: Ec onometric a 72.6, pp. 1899–1925. LeSage, J. P . and R. K. P ace (2009). Intr o duction to Sp atial Ec onometrics . Chapmann&Hall/CR C, Bo ca Raton (FL, USA). Lu, T.-T. and S.-H. Shou (2002). ‘In verses of 2 × 2 Block Matrices’. In: Computers and Mathematics with A pplic ations 43, pp. 119–129. 14 REFERENCES R ubinstein, R. Y. and D. P . Kro ese (2004). The Cr oss-Entr opy Metho d . Springer, New Y ork. Zimmerman, D. L. (2008). ‘Estimating the Intensit y of a Spatial Poin t Process from Lo cations Coarsened by Incomplete Geoco ding’. In: Biometrics 64.1, pp. 262–270. Zimmerman, D. L. and J. Li (2010). ‘The eﬀects of lo cal street netw ork c harac- teristics on the p ositional accuracy of automated geo coding for geographic health studies’. In: International Journal of He alth Ge o gr aphics 9.1.

Reduced-bias estimation of spatial econometric models with incompletely geocoded data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment