Approximate Variational Estimation for a Model of Network Formation

APPR O XIMA TE V ARIA TIONAL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION ANGELO MELE AND LINGJIONG ZHU A B S T R AC T . W e de velop approximate estimation methods for exponential random graph models (ERGMs), whose likelihood is proportional to an intractable normalizing constant. The usual ap- proach approximates this constant with Monte Carlo simulations, howe ver con ver gence may be exponentially slow . W e propose a deterministic method, based on a variational mean-ﬁeld approxi- mation of the ERGM’ s normalizing constant. W e compute lower and upper bounds for the approx- imation error for any network size, adapting nonlinear large deviations results. This translates into bounds on the distance between true likelihood and mean-ﬁeld likelihood. Monte Carlo simulations suggest that in practice our deterministic method performs better than our conservati ve theoretical approximation bounds imply , for a large class of models. K e ywor ds: Networks, Micr oeconometrics, Lar ge Networks, V ariational Infer ence, Larg e devia- tions, Mean-F ield Appr oximations 1. I N T RO D U C T I O N This paper studies v ariational mean-ﬁeld methods to approximate the likelihood of exponential random graph models (ERGMs), a class of statistical network formation models that has become popular in sociology , machine learning, statistics and more recently economics. While a large part of the statistical network literature is de voted to models with unconditionally or conditionally independent links ( Graham , 2017 ; Airoldi et al. , 2008 ; Bickel et al. , 2013 ), ERGMs allow for conditional and unconditional dependence among links ( Snijders , 2002 ; W asserman and Pattison , 1996 ). These models hav e recently gained attention in economics, because sev eral works ha ve sho wn that ERGMs hav e a microeconomic foundation. In fact, the ERGM likelihood naturally Date : First version: June 15, 2015. This version: May 12, 2020. W e are grateful to the editor and three excellent referees for their suggestions. W e thank Anton Badev , V incent Boucher , Aureo DeP aula, Bryan Graham, Mert G ¨ urb ¨ uzbalaban, Matt Jackson, Hiro Kaido, Michael Leung, Xiaodong Liu and Demian Pouzo for comments on previous versions of this paper . The second author is partially supported by NSF Grant DMS-1613164. 1 2 ANGELO MELE AND LINGJIONG ZHU emerges as the stationary equilibrium of a potential game, where players eng age in a myopic best- response dynamics of link formation ( Blume , 1993 ; Mele , 2017 ; Bade v , 2013 ; Chandrasekhar , 2016 ; Chandrasekhar and Jackson , 2014 ; Boucher and Mouriﬁe , 2017 ), and in a large class of e volutionary games and social interactions models ( Blume , 1993 ; Durlauf and Ioannides , 2010 ). Estimation and inference for ERGMs are challenging, because the likelihood of the observed network is proportional to an intractable normalizing constant, that cannot be computed exactly , e ven in small networks. Therefore, exact Maximum Likelihood estimation (MLE) is infeasible. The usual estimation approach, the Markov Chain Monte Carlo MLE (MCMC-MLE), consists of simulating many networks using the model’ s conditional link probabilities and approximating the constant and the likelihood with Monte Carlo methods ( Snijders , 2002 ; K oskinen , 2004 ; Chatter- jee and Diaconis , 2013 ; Mele , 2017 ). Estimates of the MCMC-MLE con ver ge almost surely to the MLE if the likelihoods are well-behav ed ( Geyer and Thompson , 1992 ). Ho we v er , a recent literature has shown that the simulation methods used to compute the MCMC-MLE may have e x- ponential slo w con vergence, making estimation and approximation of the likelihood impractical or infeasible for a lar ge class of ERGMs ( Bhamidi et al. , 2011 ; Chatterjee and Diaconis , 2013 ; Mele , 2017 ). An alternativ e is the Maximum Pseudo-likelihood estimator (MPLE), that ﬁnds the param- eters that maximize the product of the conditional link probabilities of the model. While MPLE is simple and computationally fast, the properties of the estimator are not well understood, except in special cases, when some regularity conditions are satisﬁed ( Boucher and Mouriﬁe , 2017 ; Be- sag , 1974 ); in practice MPLE may gi v e misleading estimates when the dependence among links is strong ( Geyer and Thompson , 1992 ). Furthermore, since the ERGMs are exponential families, networks with the same sufﬁcient statistics will produce the same MLE, but may ha ve dif ferent MPLE. Our w ork departs from the standard methods of estimation, proposing deterministic approxima- tions of the likelihood, based on the approximated solution of a variational problem. Our strategy is to use a mean-ﬁeld algorithm to approximate the normalizing constant of the ERGM, at an y gi ven parameter value ( W ainwright and Jordan , 2008 ; Bishop , 2006 ; Chatterjee and Diaconis , 2013 ). W e then maximize the resulting approximate log-likelihood, with respect to the parameters. T o be APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 3 concrete, our approximation consists of using the likelihood of a simpler model with independent links to approximate the constant of the ERGM. The mean-ﬁeld approximation algorithm ﬁnds the likelihood with independent links that minimizes the Kullback-Leibler div ergence from the ERGM likelihood. Using this likelihood with independent links, we compute an approximate normalizing constant. W e then ev aluate the log-likelihood of our model, where the exact normalizing constant is replaced by its mean-ﬁeld approximation. Our main contribution is the computation of exact bounds for the approximation error of the normalizing constant’ s mean-ﬁeld estimate. Our proofs use the theoretical machinery of Chat- terjee and Dembo ( 2016 ) for non-linear large deviations in models with intractable normalizing constants. Using this powerful tool, we provide explicit lower and upper bounds to the error of approximation for the mean-ﬁeld normalizing constant. The bounds depend on the magnitude of the parameters of our model and the size of link externalities ( Mele , 2017 ; Boucher and Mouriﬁe , 2017 ; Chandrasekhar , 2016 ; DePaula , 2017 ). The result holds for dense and moderately sparse networks. Remarkably and con veniently the mean-ﬁeld error con verges to zero as the network becomes lar ge. This guarantees that for large networks, the log-normalizing constant of an ERGM is well approximated by our mean-ﬁeld log-normalizing constant. The main implication of our main result is that we can compute bounds to the distance between the log-likelihood of the ERGM and our approximate log-likelihood; these also con verge in sup- norm as the netw ork gro ws large. As a consequence, we can use the approximated likelihood for estimation in lar ge networks. If the likelihood is strictly conca ve, it is possible to sho w that our approximate estimator con ver ges to the maximum likelihood estimator as long as the network gro ws lar ge. Furthermore, because our bounds may not be sharp, in practice con ver gence could be faster than what is implied in these results. While our method is guaranteed to perform well in large graphs, many applications in volv e small networks. For example, the school netw orks data in the National Longitudinal Study of Adoloscent Health (Add Health) ( Boucher and Mouriﬁe , 2017 ; Moody , 2001 ; Bade v , 2013 ) or the Indian villages in Banerjee et al. ( 2013 ) include on av erage about 200-300 nodes. T o under - stand the performance of our estimator in practice, we perform simple Monte Carlo e xercises in 4 ANGELO MELE AND LINGJIONG ZHU networks with fe w hundreds nodes, comparing mean-ﬁeld estimates to MCMC-MLE and MPLE. Our Monte Carlo results sho w that in practice our estimator works better than the theoretical re- sults suggest, for networks with 50 to 1000 nodes. The median mean-ﬁeld approximation point estimates are close to the true parameters, but exhibit a small bias. Both MCMC-MLE and MPLE sho w a larger variability of point estimates for the two-stars and triangle parameters, measured as median absolute deviation. When we increase the network size, all three estimators impro ve, as expected. W e conclude that our method’ s performance is comparable to av ailable estimators in small networks. While our code can be made faster by exploiting ef ﬁcient matrix algebra libraries and parallelization, the CPU time for estimation is comparable to the estimators implemented in the ergm package in R for networks with less than 200 nodes. The main message of our theoretical results and Monte Carlo simulations is that the approxi- mate mean-ﬁeld approach is a valid alternativ e to existing methods for estimation of a large class of ERGMs. W e note that our theoretical bounds may not be sharp, and in practice the mean- ﬁeld algorithm may ha ve better performance than what is implied by our conservati ve results, as conﬁrmed by our Monte Carlo experiments. T o the best of our kno wledge, this paper is one of the ﬁrst works in economics to use mean- ﬁeld approximations for approximate estimation of comple x models. W e sho w that our application of variational approximations has theoretical guarantees, and we can bound the error of approxi- mation. While similar deterministic methods hav e been used to provide an approximation to the normalizing constant of the ERGM model ( Chatterjee and Diaconis , 2013 ; Amir et al. , 2012 ; Mele , 2017 ; He and Zheng , 2013 ; Aristoff and Zhu , 2018 ), we are the ﬁrst to characterize the v ariational approximation error for a model with cov ariates and its computational feasibility . Our technique can be applied to other models in economics and social sciences. F or example, models of social interactions with binary decisions like in Blume ( 1993 ), Badev ( 2013 ), Durlauf and Ioannides ( 2010 ), models for bundles ( Fox and Lazzati , 2017 ), and models of choices from menus ( Kosyak ov a et al. , 2018 ) hav e similar likelihoods with intractable normalizing constants . Therefore our method of approximation may allow estimation of these models for large sets of bundles or menu choices. APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 5 The rest of the paper is organized as follo ws. Section 2 presents the theoretical model and v ariational approximations. Section 3 contains the main theoretical results and the error bounds. Section 4 presents the Monte Carlo results and Section 5 concludes. All the proofs and additional Monte Carlo simulations are in the Appendix. Additional results and discussions are presented in the Online Appendix. 2. N E T W O R K F O R M A T I O N M O D E L A N D V A R I A T I O N A L M E T H O D S 2.1. Exponential random graph models. The class of exponential random graphs is an impor- tant generati v e model for networks and has been e xtensi vely used in applications in many disci- plines ( W asserman and Pattison , 1996 ; Jackson , 2010 ; DePaula , 2017 ; Mele , 2017 ; Moody , 2001 ; W immer and Lewis , 2010 ; Amir et al. , 2012 ). In this paper we consider a model with nodal co- v ariates, two-stars and triangles. Our model assumes that the network consists of n heterogeneous nodes, indexed by i = 1 , ..., n ; each node is characterized by a S -dimensional v ector of observ ed attributes τ i ∈ X := ⊗ S j =1 X j , i = 1 , ..., n . The sets X j can represent age, race, gender , income, etc. 1 Let α be a n × n symmetric matrix with elements α ij := ν ( τ i , τ j ) , where ν : X × X → R is a symmetric function and let β and γ be scalars. For ease of e xposition we focus on the case in which the attributes are discrete and ﬁnite, but our results hold when this assumption is relaxed and the number of attrib utes is allowed to increase with the size of the network. The likelihood π n ( g , α, β , γ ) of observing the adjacency matrix g depends on the composition of links, the number of two-stars and the number of triangles (2.1) π n ( g ; α, β , γ ) = exp [ Q n ( g ; α, β , γ )] P ω ∈G n exp [ Q n ( ω ; α, β , γ )] , where the function Q is called a potential function and takes the form (2.2) Q n ( g ; α, β , γ ) = n X i =1 n X j =1 α ij g ij + β 2 n n X i =1 n X j =1 n X k =1 g ij g j k + 2 γ 3 n n X i =1 n X j =1 n X k =1 g ij g j k g ki . 1 For instance, if we consider gender and income, then S = 2 , and we can take ⊗ 2 j =1 X j = { male,female } × { low , medium, high } . The sets X j can be both discrete and continuous. F or example, if we consider gender and income, we can also take ⊗ 2 j =1 X j = { male,female } × [$50,000,$200,000]. Below we restrict the covariates to be discrete, but we allo w the number of types to gro w with the size of the network. 6 ANGELO MELE AND LINGJIONG ZHU and c ( α, β , γ ) := P ω ∈G n exp [ Q n ( ω ; α, β , γ )] is a normalizing constant that guarantees that like- lihood ( 2.1 ) is a proper distribution. The second and third term of the potential function ( 2.2 ) are the counts of two-stars and triangles in the network, rescaled by n . W e re write ( 2.1 ) as (2.3) π n ( g ; α, β , γ ) = exp  n 2 [ T n ( g ; α, β , γ ) − ψ n ( α, β , γ )]  , where T n ( g ; α, β , γ ) = Q n ( g ; α, β , γ ) n − 2 is the potential scaled by n 2 and the log-normalizing constant (scaled by n 2 ) is , (2.4) ψ n ( α, β , γ ) = 1 n 2 log X ω ∈G n exp  n 2 T n ( ω ; α, β , γ )  , and G n := { ω = ( ω ij ) 1 ≤ i,j ≤ n : ω ij = ω j i ∈ { 0 , 1 } , ω ii = 0 , 1 ≤ i, j ≤ n } is the set of all binary matrices with n nodes. The re-scaling of the potential and the log-normalizing constant is necessary for the asymptotic results, to avoid the explosion of the potential function as the size of the network gro ws lar ge. 2.2. Microeconomic equilibrium f oundations. ERGMs caught the attention of economists be- cause recent works proves a beha vioral and equilibrium interpretation of likelihood ( 2.3 ). 2 In fact, these likelihoods naturally arise as equilibrium of best-response dynamics in potential games ( Blume , 1993 ; Monderer and Shapley , 1996 ; Butts , 2009 ; Mele , 2011 ). T o be concrete, let’ s consider the following game. Players’ payof fs are a function of the compo- sition of direct links, friends’ popularity and the number of common friends. The utility of network g for player i is giv en by (2.5) u i ( g , τ ) = n X j =1 α ij g ij + β n n X j =1 n X k =1 g ij g j k + γ n n X j =1 n X k =1 g ij g j k g ki , Each player forms links with other nodes, maximizing utility ( 2.5 ), b ut taking into account the strategies of other players. W e can show that this game of network formation con verges to an exponential random graph in a stationary equilibrium, under the follo wing assumptions: 3 (1) the 2 Butts ( 2009 ), Mele ( 2017 ), Chandrasekhar and Jackson ( 2014 ), Boucher and Mouriﬁe ( 2017 ), Badev ( 2013 ), DePaula ( 2017 ). 3 See Mele ( 2017 ) or Badev ( 2013 ) for more technical details and v ariants of these assumptions. See also Chan- drasekhar ( 2016 ), DePaula ( 2017 ), Chandrasekhar and Jackson ( 2014 ), Boucher and Mouriﬁe ( 2017 ). APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 7 network formation is sequential, with only two acti v e players in each period; (2) two players meet ov er time with probability ρ ij := ρ ( τ i , τ j , g − ij ) > 0 , where g − ij indicate the network g but link g ij ; and these meetings are i.i.d. over time; (3) before choosing whether to form or delete a link, players recei ve an i.i.d. logistic shock ( ε ij 1 , ε ij 0 ) . At time t , the link g t ij is formed if u i ( g t ij = 1 , g t − 1 − ij , τ ) + u j ( g t ij = 1 , g t − 1 − ij , τ ) + ε t ij 1 ≥ u i ( g t ij = 0 , g t − 1 − ij , τ ) + u j ( g t ij = 0 , g t − 1 − ij , τ ) + ε t ij 0 . Mele ( 2017 ) sho ws that such a model is a potential game ( Monderer and Shaple y , 1996 ) with potential function giv en by equation ( 2.2 ). The probability of observing network g in the long run is giv en by ( 2.3 ) (Theorem 1 in Mele ( 2017 )), thus ( 2.3 ) describes the stationary behavior of the model. In the long-run we observe with high probability the pairwise stable netw orks, where no pair of players want to form or delete a link. 4 2.3. V ariational A pproximations. The constant ψ n ( α, β , γ ) in ( 2.4 ) is intractable because it is a sum ov er all 2 ( n 2 ) possible networks with n nodes; if there are n = 10 nodes, the sum in volv es computation of 2 45 potential functions, which is infeasible. 5 In the literature on exponential family likelihoods with intractable normalizing constant, this problem is solved by approximating the normalizing constant using Markov Chain Monte Carlo ( Snijders , 2002 ; Mele , 2017 ; Goodreau et al. , 2009 ; Koskinen , 2004 ; Caimo and Friel , 2011 ; Murray et al. , 2006 ). Ho wev er , Bhamidi et al. ( 2011 ) has shown that such methods may have exponentially slow con v ergence for many ERGMs speciﬁcations. W e propose methods that av oid simulations and we ﬁnd an approximate likelihood q n ( g ) that minimizes the Kullback-Leibler di v ergence K L ( q n | π n ) between q n and the true likelihood π n : K L ( q n | π n ) = X ω ∈G n q n ( ω ) log  q n ( ω ) π n ( ω ; α, β )  = X ω ∈G n q n ( ω )  log q n ( ω ) − n 2 T n ( ω ; α, β , γ ) + n 2 ψ n ( α, β , γ )  ≥ 0 . (2.6) 4 In the Online Appendix E we provide more details about the microeconomic foundation of the model for the interested reader . 5 See Geyer and Thompson ( 1992 ), Murray et al. ( 2006 ), Snijders ( 2002 ) for e xamples. 8 ANGELO MELE AND LINGJIONG ZHU W ith some algebra we obtain a lo wer -bound for the constant ψ n ( α, β , γ ) ψ n ( α, β , γ ) ≥ E q n [ T n ( ω ; α, β , γ )] + 1 n 2 H ( q n ) := L ( q n ) , where H ( q n ) = − P ω ∈G n q n ( ω ) log q n ( ω ) is the entrop y of distrib ution q n , and E q n [ T n ( ω ; α, β , γ )] is the expected v alue of the re-scaled potential, computed according to the distrib ution q n . T o ﬁnd the best likelihood approximation we minimize K L ( q n | π n ) with respect to q n , which is equi valent to ﬁnding the supremum of the lo wer -bound L ( q n ) , i.e. (2.7) ψ n ( α, β , γ ) = sup q n ∈Q n L ( q n ) = sup q n ∈Q n  E q n [ T n ( ω ; α, β , γ )] + 1 n 2 H ( q n )  , where Q n is the set of all the probability distrib utions on G n . W e hav e transformed the problem of computing an intractable sum into a v ariational problem, i.e. a maximization problem. In general, problem ( 2.7 ) has no closed-form solution, thus the literature suggests to restrict Q n to be the set of all completely factorized distrib ution 6 (2.8) q n ( g ) = Y i,j µ g ij ij (1 − µ ij ) 1 − g ij , where µ ij = E q n ( g ij ) = P q n ( g ij = 1) . This approximation is called a mean-ﬁeld appr oximation of the discrete exponential f amily . Straightforward algebra sho ws that the entrop y of q n is additi ve 1 n 2 H ( q n ) = − 1 2 n 2 n X i =1 n X j =1 [ µ ij log µ ij + (1 − µ ij ) log (1 − µ ij )] , and the expected potential can be computed as E q n [ T n ( ω ; α, β , γ )] = P i P j α ij µ ij n 2 + β P i P j P k µ ij µ j k 2 n 3 + γ 2 P i P j P k µ ij µ j k µ ki 3 n 3 . 6 See W ainwright and Jordan ( 2008 ), Bishop ( 2006 ) APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 9 The mean-ﬁeld approximation leads to a lower bound of ψ n ( α, β , γ ) , because we restricted Q n , and the simpler v ariational problem is to ﬁnd a n × n symmetric matrix µ ( α, β , γ ) that solv es ψ n ( α, β , γ ) ≥ ψ M F n ( µ ( α, β , γ )) = sup µ ∈ [0 , 1] n 2 : µ ij = µ j i , ∀ i,j  1 n 2 X i,j α ij µ ij + β 2 n 3 X i,j,k µ ij µ j k + 2 γ 3 n 3 X i,j,k µ ij µ j k µ ki − 1 2 n 2 n X i =1 n X j =1 [ µ ij log µ ij + (1 − µ ij ) log (1 − µ ij )]  . (2.9) The mean-ﬁeld problem is in general noncon vex and the maximization can be performed using an y global optimization method, e.g. simulated annealing or Nelder -Mead. 7 3. T H E O R E T I C A L R E S U L T S 3.1. Con vergence of the variational mean-ﬁeld appr oximation. For ﬁnite n , the variational mean-ﬁeld approximation contains an error of approximation. In the next theorem, we provide a lo wer and upper bound to the error of approximation for our model. THEOREM 3.1. F or ﬁxed network size n , the appr oximation err or of the variational mean-ﬁeld pr oblem is bounded as (3.1) C 3 ( β , γ ) n ≤ ψ n ( α, β , γ ) − ψ M F n ( µ ( α, β , γ )) ≤ C 1 ( α, β , γ )  log n n  1 / 5 + C 2 ( α, β , γ ) n 1 / 2 , wher e C 1 ( α, β , γ ) , C 2 ( α, β , γ ) ar e constants depending on α , β and γ and C 3 ( β , γ ) is a constant depending only on β , γ : C 1 ( α, β , γ ) := c 1 ·  max i,j | α ij | + | β | 4 + | γ | 4 + 1  , C 2 ( α, β , γ ) := c 2 ·  max i,j | α ij | + | β | + | γ | + 1  1 / 2 · (1 + | β | 2 + | γ | 2 ) 1 / 2 , C 3 ( β , γ ) := | β | + 4 | γ | , wher e c 1 , c 2 > 0 ar e some univer sal constants. 7 See W ainwright and Jordan ( 2008 ) and Bishop ( 2006 ) for more details. 10 ANGELO MELE AND LINGJIONG ZHU The constants in Theorem 3.1 are functions of the parameters α , β and γ . The upper bound depends on the maximum payoff from direct links and the intensity of payof f from indirect con- nections. The lower bound only depends on the strength of indirect connections payoffs (popularity and common friends, that is β and γ ). One consequence is that our result holds when the network is dense, b ut also when it is moderately sparse, in the sense that | α ij | , | β | and | γ | can ha ve moderate gro wth in n instead of being bounded, and the dif ference of ψ n and ψ M F n goes to zero if C 1 ( α, β , γ ) gro ws slower than n 1 / 5 / (log n ) 1 / 5 and C 2 ( α, β , γ ) grows slower than n 1 / 2 as n → ∞ . For exam- ple, if max i,j | α ij | = O ( n δ 1 ) , | β | = O ( n δ 2 ) , | γ | = O ( n δ 3 ) where δ 1 < 1 5 and δ 2 , δ 3 < 1 20 , then ψ n − ψ M F n goes to zero as n → ∞ . On the other hand, if the graph is too sparse, e.g. | β | = Ω( n ) , | γ | = Ω( n ) , then ψ n cannot be approximated by ψ M F n . Our main Theorem 3.1 implies that we can appr oximate the log-likelihood of the ERGM using the mean-ﬁeld approximated constant. PR OPOSITION 3.1. Let ` n ( g n , α, β , γ ) be the lo g-likelihood of the ERGM ` n ( g n , α, β , γ ) := n − 2 log ( π n ( g n , α, β , γ )) = T n ( g n , α, β , γ ) − ψ n ( α, β , γ ) , and ` M F n ( g n , α, β , γ ) be the “mean-ﬁeld log-lik elihood” obtained by appr oximating ψ n with ψ M F n : ` M F n ( g n , α, β , γ ) := T n ( g n , α, β , γ ) − ψ M F n ( α, β , γ ) . Then for any compact parameter space Θ , (3.2) 0 ≤ sup α,β ,γ ∈ Θ  ` M F n − ` n  ≤ sup α,β ,γ ∈ Θ C 1 ( α, β , γ ) n − 1 / 5 (log n ) 1 / 5 + sup α,β ,γ ∈ Θ C 2 ( α, β , γ ) n − 1 / 2 . Proposition 3.1 shows that as the network size grows large, the mean-ﬁeld approximation of the log-likelihood ` M F n is arbitrarily close to the ERGM log-likelihood ` n . This approximation is similar in spirit to the MCMC-MLE method, where the log-normalizing constant is approximated via MCMC to obtain an approximated log-lik elihood ( Geyer and Thompson , 1992 ; Snijders , 2002 ; DePaula , 2017 ; Moller and W aagepetersen , 2004 ). The main difference is that our approximation is deterministic and does not require any simulation. APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 11 Note that ` M F n = T n − ψ M F n and ` n = T n − ψ n . If ` n con v erges to ` ∞ uniformly on a compact parameter space Θ , then so does ` M F n . If ` n , ` M F n and ` ∞ are continuous and strictly conca ve, ˆ θ n , ˆ θ M F n , the unique maximizers of ` n and ` M F n will con ver ge to the unique maximizer of ` ∞ and hence ˆ θ n − ˆ θ M F n will go to zero as n → ∞ . In the Online Appendix we provide further results on the behavior of the mean-ﬁeld approximation as n → ∞ , where we discuss the con v ergence of the log-constant. 8 The result in Proposition 3.1 can be used to bound the distance between the mean-ﬁeld estimate and the maximum likelihood estimate, for an y network size rather than for lar ge n . Howe v er , such bounds require additional and stronger assumptions on the shape of the likel ihood. Indeed, in Appendix B , we sho w that a sufﬁcient conditions for computing the bound is a strongly concav e likelihood. Under such assumption, we can use the bound in Proposition 3.1 for the log-likelihood to provide a bound on the distance between MLE and mean-ﬁeld estimator for an y network size n . Howe ver , these bounds may not be sharp, and therefore we consider them very conserv ati ve. In the next section we show via Monte Carlo simulation that in many cases our estimator performs better than the bounds would imply . 4. E S T I M A T I O N E X P E R I M E N T S T o understand the performance of the v ariational approximation in smaller networks, we per- form some Monte Carlo experiments. W e compare the mean-ﬁeld approximation with the stan- dard simulation-based MCMC-MLE Geyer and Thompson ( 1992 ); Snijders ( 2002 ) and the MPLE ( Besag , 1974 ). Our method con v erges in n 2 steps, while the MCMC-MLE may con ver ge in e n 2 steps. The MPLE usually con ver ges faster . 4.1. Appr oximation algorithm f or the normalizing constant. W e implemented our variational approximation for fe w models in the R package mfergm , av ailable in Github . 9 W e follow the statistical machine learning literature and use an iterativ e algorithm that is guaranteed to con ver ge 8 The strict concavity of the likelihood is closely related to the identiﬁcation of parameters in ERGM models, for which there is a lack of general results (see Mele ( 2017 ), Chatterjee and Diaconis ( 2013 ), Aristof f and Zhu ( 2018 ) for examples in special cases). 9 See https://github.com/meleangelo/mfergm , with instructions for installation and few e xamples. 12 ANGELO MELE AND LINGJIONG ZHU to a local maximum of the mean-ﬁeld problem ( W ainwright and Jordan , 2008 ; Bishop , 2006 ). The algorithm is deri ved from ﬁrst-order conditions of the v ariational mean-ﬁeld problem. Let µ ∗ be the matrix that solves the variational problem ( 2.9 ). If we take the deriv ati ve with respect to µ ij and equate to zero, we get (4.1) µ ∗ ij = ( 1 + exp " − 2 α ij − β n − 1 n X k =1  µ ∗ j k + µ ∗ ki  − 4 γ n − 1 n X k =1 µ ∗ j k µ ∗ ki #) − 1 The logit equation in ( 4.1 ) characterizes a system of equations, whose ﬁxed point is a solution of the mean-ﬁeld problem. W e can therefore start from a matrix µ and iterate the updates in ( 4.1 ) until we reach a ﬁxed point, as described in the follo wing algorithm. ALGORITHM 1. Approximation of log-normalizing constant . F ix par ameters α, β , γ and a r elatively small tolerance value  tol . Initialize the n × n matrix µ (0) as µ (0) ij iid ∼ U [0 , 1] , for all i, j . F ix the maximum number of iter ations as T . Then for eac h t = 0 , ..., T : Step 1. Update the entries of matrix µ ( t ) for all i, j = 1 , ..., n (4.2) µ ( t +1) ij = ( 1 + exp " − 2 α ij − β n − 1 n X k =1  µ ( t ) j k + µ ( t ) ki  − 4 γ n − 1 n X k =1 µ ( t ) j k µ ( t ) ki #) − 1 . Step 2. Compute the value of the variational mean-ﬁeld lo g-constant ψ M F ( t ) n as ψ M F ( t ) n = P i P j α ij µ ( t ) ij n 2 + β P i P j P k µ ( t ) ij µ ( t ) j k 2 n 3 + γ 2 P i P j P k µ ( t ) ij µ ( t ) j k µ ( t ) ki 3 n 3 − 1 2 n 2 n X i =1 n X j =1 h µ ( t ) ij log µ ( t ) ij + (1 − µ ( t ) ij ) log (1 − µ ( t ) ij ) i . Step 3. Stop at t ∗ ≤ T if: ψ M F ( t ∗ ) n − ψ M F ( t ∗ − 1) n ≤  tol . Otherwise go bac k to Step 1. The algorithm is initialized at a random uniform matrix µ (0) and iterati vely applies the update ( 4.1 ) to each entry of the matrix, until the increase in the objectiv e function is less than a tolerance le vel . Since the problem is concave in each µ ij , this iterativ e method is guaranteed to ﬁnd a local maximum of ( 2.9 ). 10 In our simulations we use a tolerance lev el of  tol = 0 . 0001 . T o improve 10 There are other alternati ves to the random uniform matrix. Indeed a simple starting value could be the set of conditional probabilities of the model at parameters α, β , γ . W e did not experiment with this alternativ e method. APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 13 con v ergence we can re-start the algorithm from dif ferent random matrices, as usually done with local optimizers. 11 This step is easily parallelizable, thus preserving the order n 2 con v ergence; while the standard MCMC-MLE is an intrinsically sequential algorithm and cannot be parallelized. 4.2. Monte Carlo design. All the computations in this section are performed on a PC Dell T6610 with 6 Quad-core Intel i7 (48 threads) and 64GB RAM. W e test our approximation using 1000 simulated netw orks. Each node i has a binary attrib ute x i , i.e. x i iid ∼ B er noull i (0 . 5) . Let z ij = 1 if x i = x j and z ij = 0 otherwise. t z ( g ) := 1 n 2 n X i =1 n X j =1 g ij z ij ; t − z ( g ) := 1 n 2 n X i =1 n X j =1 g ij (1 − z ij ) , (4.3) t e ( g ) := 1 n 2 n X i =1 n X j =1 g ij ; t s ( g ) := 1 n 3 n X i =1 n X j =1 n X k =1 g ij g j k ; t t ( g ) := 1 n 3 n X i =1 n X j =1 n X k =1 g ij g j k g ki , where t e ( g ) , t s ( g ) and t t ( g ) are the fraction of links, two-stars and triangles respectiv ely . And t z ( g ) and t − z ( g ) are the fractions of links of the same type and dif ferent type, respecti vely . The log-likelihood of the model ` n ( g ; α, β , γ ) is (4.4) ` n ( g , x ; α , β , γ ) = α 1 t z ( g ) + α 2 t − z ( g ) + ( β / 2) t s ( g ) + (2 γ / 3) t t ( g ) − ψ n ( α 1 , α 2 , β , γ ) . For computational con v enience we rewrite model ( 4.4 ) in a slightly different but equi v alent way (4.5) ` n ( g , x ; ˜ α, β , γ ) = ˜ α 1 t e ( g ) + ˜ α 2 t z ( g ) + ( β / 2) t s ( g ) + (2 γ / 3) t t ( g ) − ψ n ( α 1 , α 2 , β , γ ) , where we hav e deﬁned ˜ α 1 := α 2 and ˜ α 2 := α 1 − α 2 . W e use speciﬁcation ( 4.5 ) in our simulations. 12 T o generate the artiﬁcial netw orks, we draw i.i.d. attrib utes x i ∼ B er noull i (0 . 5) , initialize a network with n nodes as an Erdos-Renyi graph with probability p = e ˜ α 1 / (1 + e ˜ α 1 ) , and then run 11 In the Monte Carlo exercises we hav e experimented with dif ferent numbers of re-starts of the iterativ e algorithm. Howe v er , it is not clear what w ould be the optimal number of re-starts. A ﬁx ed number of restarts could be suboptimal. It seems reasonable to increase this number as the network gro ws larger . 12 There are other small dif ferences in how we ha ve speciﬁed the model and how we ha ve setup computations using the statnet package in R, that can af fect the comparability of the simulation results, in particular the normalizations of the suf ﬁcient statistics. This is handled by our mfergm package, to guarantee comparability of the estimates obtained with MCMC-MLE, MPLE and Mean-ﬁeld approximate inference. 14 ANGELO MELE AND LINGJIONG ZHU the Metropolis-Hastings network sampler using the simulate.ergm command in the R pack- age ergm to sample 1000 networks, each separated by 10 , 000 iterations, and after a burn-in of 10 million iterations. 13 The MCMC-MLE estimator is solv ed using the Stochastic approxima- tion method of Snijders ( 2002 ), where each simulation has a burnin of 100 , 000 iterations of the Metropolis-Hastings sampler and netw orks are sampled e very 1000 iterations. The other con ver - gence parameters are kept at default of the ergm package. The MPLE estimate is obtained using the default parameters in ergm . T o be sure that our results do not depend on the initialization of the parameters, we start each estimator at the true parameter value, thus decreasing the computational time required for con v ergence. All the code is av ailable in Github for replication. T A B L E 4 . 1 . Monte Carlo estimates, comparison of three methods. T rue parameter vector is ( ˜ α 1 , ˜ α 2 , β , γ ) = ( − 2 , 1 , 1 , 1) n = 50 MCMC-MLE MEAN-FIELD MPLE ˜ α 1 ˜ α 2 β γ ˜ α 1 ˜ α 2 β γ ˜ α 1 ˜ α 2 β γ median -2.002 1.024 0.716 -2.042 -2.000 0.998 1.000 0.999 -1.957 1.016 0.118 -0.584 mad 0.295 0.238 3.412 26.132 0.044 0.040 0.012 0.012 0.268 0.179 3.261 16.540 n = 100 MCMC-MLE MEAN-FIELD MPLE median -1.991 0.991 0.886 1.183 -2.002 0.995 1.001 0.999 -1.974 0.991 0.713 1.020 mad 0.197 0.117 2.324 16.150 0.020 0.017 0.005 0.005 0.178 0.085 2.237 10.478 n = 200 MCMC-MLE MEAN-FIELD MPLE median -2.000 1.000 1.043 0.438 -2.003 0.995 1.001 0.999 -1.990 1.000 0.853 0.657 mad 0.127 0.064 1.686 10.627 0.009 0.009 0.002 0.002 0.125 0.046 1.613 7.950 n = 500 MCMC-MLE MEAN-FIELD MPLE median -2.000 1.001 1.000 0.706 -2.002 0.994 1.016 0.992 -1.994 1.001 0.912 0.762 mad 0.084 0.033 1.090 6.962 0.007 0.008 0.023 0.011 0.074 0.023 0.945 4.691 Results of 1000 Monte Carlo estimates using three methods. MCMC-MLE is the Monte Carlo Maximum Likelihood estimator of Ge yer and Thompson ( 1992 ), as implemented in ergm in R , with a stochastic approximation algorithm Snijders ( 2002 ). MEAN-FIELD is our method. MPLE is the Maximum Pseudo-Likelihood Estimate. Each network is generated with a 10 million run of the Metropolis-Hastings sampler of the ergm command in R, sampling ev ery 10000 iterations. mad is the median absolute de viation. 4.3. Results. The ﬁrst model has true parameter v ector ( ˜ α 1 , ˜ α 2 , β , γ ) = ( − 2 , 1 , 1 , 1) and the sum- maries of point estimates are reported in T able 4.1 . W e show results for n = 50 , 100 , 200 and 500 ; reporting median and median absolute de viation (mad) of point estimates for each parameter . 13 The code is av ailable in the Github package mfergm , and the function is simulate.model # , where # stands for the model number: 2 is the model with γ = 0 , 3 is the model with β = 0 , and 4 is the model with β 6 = 0 and γ 6 = 0 . APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 15 The median estimates of the mean-ﬁeld approximation are quite stable and exhibit a small bias, as is well kno wn in the literature ( W ainwright and Jordan , 2008 ; Bishop , 2006 ). The median results for MCMC-MLE and MPLE are quite precise for ˜ α 1 and ˜ α 2 , but vary a lot for β and γ , as shown by the lar ge median absolute deviation. Nonetheless the median point estimates of β and γ are slo wly con v erging to the true parameter vector as n increases. 14 Therefore, the mean-ﬁeld approximation provides estimates in line with MPLE and MCMC-MLE, with more reliability for β and γ in these small sample estimation ex ercises. T A B L E 4 . 2 . Monte Carlo estimates, comparison of three methods. T rue parameter vector is ( ˜ α 1 , ˜ α 2 , β , γ ) = ( − 3 , 2 , 1 , 3) n = 50 MCMC-MLE MEAN-FIELD MPLE ˜ α 1 ˜ α 2 β γ ˜ α 1 ˜ α 2 β γ ˜ α 1 ˜ α 2 β γ median -3.041 2.064 0.743 -0.512 -3.007 1.993 1.000 3.000 -3.026 2.083 0.215 1.764 mad 0.476 0.424 3.811 25.109 0.026 0.026 0.013 0.014 0.514 0.401 3.593 16.538 n = 100 MCMC-MLE MEAN-FIELD MPLE median -3.006 2.015 0.932 0.587 -3.011 1.989 1.000 2.999 -2.991 2.018 0.682 1.773 mad 0.261 0.206 2.538 17.905 0.016 0.016 0.008 0.008 0.259 0.194 2.364 12.123 n = 200 MCMC-MLE MEAN-FIELD MPLE median -3.012 2.007 1.069 2.807 -3.011 1.988 1.000 2.999 -3.005 2.011 0.932 2.988 mad 0.158 0.117 1.822 11.360 0.008 0.008 0.004 0.004 0.156 0.109 1.714 8.144 n = 500 MCMC-MLE MEAN-FIELD MPLE median -2.998 2.000 0.951 3.047 -3.011 1.988 1.002 2.999 -2.998 2.001 0.921 3.117 mad 0.096 0.061 1.276 7.191 0.003 0.003 0.002 0.002 0.083 0.049 1.077 5.378 Notes: see notes for T able 4.1 . T A B L E 4 . 3 . Monte Carlo estimates, comparison of three methods. T rue parameter vector is ( ˜ α 1 , ˜ α 2 , β , γ ) = ( − 3 , 1 , 2 , 1) n = 500 MCMC-MLE MEAN-FIELD MPLE ˜ α 1 ˜ α 2 β γ ˜ α 1 ˜ α 2 β γ ˜ α 1 ˜ α 2 β γ median -3.001 0.998 2.028 -19.034 -3.000 1.000 2.000 1.000 -2.996 1.000 1.488 -7.923 mad 0.086 0.065 7.205 165.600 0.011 0.011 0.0001 0.0001 0.078 0.044 6.345 84.681 n = 1000 MCMC-MLE MEAN-FIELD MPLE ˜ α 1 ˜ α 2 β γ ˜ α 1 ˜ α 2 β γ ˜ α 1 ˜ α 2 β γ median -2.999 1.004 1.809 -0.716 -3.000 1.000 2.000 1.000 -2.999 1.002 1.757 0.540 mad 0.057 0.037 4.891 125.293 0.005 0.005 0.0001 0.0001 0.049 0.022 4.113 61.328 Notes: see notes for T able 4.1 . The case with n = 1000 contains only 500 monte carlo replications. 14 Some of the bias in the mean-ﬁeld approximation may be due to the fact that we only initialize µ once in these simulations. 16 ANGELO MELE AND LINGJIONG ZHU The second set of results is for a model with parameters ( ˜ α 1 , ˜ α 2 , β , γ ) = ( − 3 , 2 , 1 , 3) , see T able 4.2 . The pattern is similar to T able 4.1 . Indeed the mean-ﬁeld estimator seems to work relativ ely well in most cases, especially for the estimates of β and γ . For parameters ˜ α 1 , ˜ α 2 our mean-ﬁeld estimator (median) bias persists as n increases. Finally , we also report a simulation with a larger network with n = 500 , 1000 in T able 4.3 . The results are the same as the other tables and the mean-ﬁeld approximation is robustly close to the true parameter v alues in most simulations. These Monte Carlo experiments suggests that our approximation method performs well in prac- tice. W e conclude that in most cases the mean-ﬁeld approximation algorithm w orks better than our conserv ativ e theoretical results suggest. 15 5. C O N C L U S I O N S A N D F U T U R E W O R K W e have shown that for a lar ge class of exponential random graph models (ERGM), we can approximate the normalizing constant of the likelihood using a mean-ﬁeld variational approxima- tion algorithm ( W ainwright and Jordan , 2008 ; Bishop , 2006 ; Chatterjee and Diaconis , 2013 ; Mele , 2017 ). Our theoretical results use nonlinear lar ge de viations methods ( Chatterjee and Dembo , 2016 ) to bound the error of approximation, showing that it con ver ges to zero as the network gro ws. Our estimation method consists of replacing the log-normalizing constant in the log-likelihood of the ERGM with the value approximated by the mean-ﬁeld algorithm; we then ﬁnd the param- eters that maximize such approximate log-likelihood. Since our approximated constant con verges to the true constant in large netw orks, the approximate log-likelihood con verges to the correct log- likelihood in sup-norm, as the network becomes lar ge. If the likelihoods are well-behav ed and not too ﬂat around the maximizers, we can also sho w that our estimate con ver ges to MLE. Using an iterativ e procedure to ﬁnd the approximate mean-ﬁeld constant, we compare our method to MCMC-MLE and MPLE ( Snijders , 2002 ; Boucher , 2015 ; Besag , 1974 ; DeP aula , 2017 ) in a simple Monte Carlo study for small networks. The mean-ﬁeld approximation exhibits a small bias, b ut the median estimates are similar to MCMC-MLE and MPLE. Theoretically , our method 15 While these results are encouraging, in Appendix we report some example of non-con ver gence of the mean-ﬁeld algorithm, mostly due to our iterativ e algorithm getting trapped in a local maximum in some simulations. APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 17 con v erges in a number of steps proportional to the number of potential links of a network, while MCMC-MLE could be exponentially slo w . While these results are encouraging, there are se v eral open problems and possible research di- rections. First, it is not clear that the mean-ﬁeld estimates are consistent. Our small Monte Carlo seem to indicate that there is a persistent bias term, but there is no general proof in this setting along the lines of Bickel et al. ( 2013 ) for stochastic block models. Second, it is not clear that the ERGM model is identiﬁed for all parameter values. Indeed some results in this literature suggest otherwise ( Chatterjee and Diaconis , 2013 ; Mele , 2017 ; Boucher and Mouriﬁe , 2017 ). A promis- ing research a v enue for the future is the use of the lar ge n mean-ﬁeld approximation to understand identiﬁcation, similarly to what has been done with graph limits in Chatterjee and Diaconis ( 2013 ). Third, while the mean-ﬁeld approximation is simple and we are able to compute the approximation errors, our lo wer and upper bounds may not be sharp. This raises the question of whether there is another factorization (like in structured mean-ﬁeld) that leads to better approximations and faster con v ergence ( Xing et al. , 2003 ). W e hope that our work will stimulate additional research and more applications of this class of approximations. 18 ANGELO MELE AND LINGJIONG ZHU R E F E R E N C E S Airoldi, Edoardo M., David Blei, Stephen E. Fienberg and Eric P . Xing (2008), ‘Mixed member- ship stochastic blockmodels’, J ournal of Machine Learning Resear ch 9 , 1981–2014. Amir , Eyal, W en Pu and Dorothy Espelage (2012), Approximating partition functions for exponential-f amily random graph models, in ‘ Advances in Neural Information Processing Sys- tems (NIPS)’. Aristof f, Da vid and Lingjiong Zhu (2018), ‘On the phase transition curve in a directed e xponential random graph model’, Advances in Applied Pr obability 50 , 272–301. Bade v , Anton (2013), Discrete games in endogenous networks: Theory and policy . Banerjee, Abhijit, Arun G. Chandrasekhar , Esther Duﬂo and Matthew O. Jackson (2013), ‘The dif fusion of microﬁnance’, Science 341 (6144). Besag, Julian (1974), ‘Spatial interaction and the statistical analysis of lattice systems’, J ournal of the Royal Statistical Society Series B (Methodological) 36 (2), 192–236. Bhamidi, Shankar , Guy Bresler and Allan Sly (2011), ‘Mixing time of exponential random graphs’, The Annals of Applied Pr obability 21 (6), 2146–2170. Bickel, Peter , David Choi, Xiangyu Chang and Hai Zhang (2013), ‘ Asymptotic normality of max- imum likelihood and its v ariational approximation for stochastic blockmodels’, Ann. Statist. 41 (4), 1922–1943. Bishop, Christopher (2006), P attern Reco gnition and Machine Learning , Springer , New Y ork. Blume, Lawrence E. (1993), ‘The statistical mechanics of strategic interaction’, Games and Eco- nomic Behavior 5 (3), 387–424. Borgs, C., J.T . Chayes, L. Lov ´ asz, V .T . S ´ os and K. V esztergombi (2008), ‘Con vergent sequences of dense graphs i: Subgraph frequencies, metric properties and testing’, Advances in Mathematics 219 (6), 1801 – 1851. Boucher , V incent (2015), ‘Structural homophily’, International Economic Review 56 (1), 235–264. Boucher , V incent and Ismael Mouriﬁe (2017), ‘My friends far far away: A random ﬁeld approach to exponential random graph models’, Econometrics J ournal 20 (3), S14–S46. Butts, Carter (2009), Using potential games to parameterize ERG models. working paper . APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 19 Caimo, Alberto and Nial Friel (2011), ‘Bayesian inference for exponential random graph models’, Social Networks 33 (1), 41–55. Chandrasekhar , Arun (2016), in Y .Bramoulle, A.Galeotti and B. W .Rogers, eds, ‘The Oxford Handbook of the Economics of Netw orks’, Oxford Univ ersity Press, chapter Econometrics of Network F ormation. Chandrasekhar , Arun and Matthew Jackson (2014), T ractable and consistent exponential random graph models. working paper . Chatterjee, Soura v and Amir Dembo (2016), ‘Nonlinear large deviations’, Advances in Mathemat- ics 299 , 396–450. Chatterjee, Sourav and Persi Diaconis (2013), ‘Estimating and understanding exponential random graph models’, The Annals of Statistics 41 (5). Chatterjee, Sourav and S. R. S. V aradhan (2011), ‘The large deviation principle for the Erdos-R ´ enyi random graph’, Eur opean J ournal of Combinatorics 32 (7), 1000 – 1017. DePaula, Aureo (2017), Econometrics of network models, in B.Honore, A.Pakes, M.Piazzesi and L.Samuelson, eds, ‘ Adv ances in Economics and Econometrics: Elev enth W orld Congress’, Cambridge Uni versity Press. DePaula, Aureo, Seth Richards-Shubik and Elie T amer (2018), ‘Identifying preferences in net- works with bounded de gree’, Econometrica 86 (1), 263–288. Durlauf, Stev en N. and Y annis M. Ioannides (2010), ‘Social interactions’, Annual Review of Eco- nomics 2 (1), 451–478. Fox, Jeremy T . and Natalia Lazzati (2017), ‘ A note on identiﬁcation of discrete choice models for bundles and binary g ames’, Quantitative Economics 8 (3), 1021–1036. Geyer , Charles and Elizabeth Thompson (1992), ‘Constrained Monte Carlo maximum likeli- hood for dependent data’, J ournal of the Royal Statistical Society , Series B (Methodological) 54 (3), 657–699. Goodreau, S. M., Kitts J. A. and Morris M. (2009), ‘Birds of a feather , or friend of a friend? using exponential random graph models to inv estigate adolescent social networks’, Demography 46 (1), 103–125. 20 ANGELO MELE AND LINGJIONG ZHU Graham, Bryan (2017), ‘ An empirical model of network formation: with degree heterogeneity’, Econometrica 85 (4), 1033–1063. He, Ran and T ian Zheng (2013), Estimation of exponential random graph models for large social networks via graph limits, in ‘Proceedings of the 2013 IEEE/A CM International Conference on Advances in Social Networks Analysis and Mining’, ASONAM ’13, A CM, New Y ork, NY , USA, pp. 248–255. Iijima, Ryota and Y uichiro Kamada (2014), Social distance and network structures. W orking Paper . Jackson, Matthe w O. (2010), Social and Economics Networks , Princeton Univ ersity Press. K oskinen, Johan (2004), Bayesian analysis of exponential random graphs - estimation of parame- ters and model selection, Research report 2004:2, Department of Statistics, Stockholm Uni ver - sity . K osyako v a, T etyana, Thomas Otter , Sanjog Misra and Christian Neuerburg (2018), Exact MCMC for choices from menus - measuring substitution and complementarity among menu items. Lov asz, L. (2012), Lar ge Networks and Graph Limits , American Mathematical Society colloquium publications, American Mathematical Society . Mele, Angelo (2011), Se gregation in social networks: Monte Carlo maximum likelihood estima- tion. W orking Paper . Mele, Angelo (2017), ‘ A structural model of dense network formation’, Econometrica 85 , 825– 850. Moller , Jesper and Rasmus Plenge W aagepetersen (2004), Statistical inference and simulation for spatial point pr ocesses , Monographs on Statistics and Applied Probability 100, Chapman and Hall. Monderer , Dov and Lloyd Shaple y (1996), ‘Potential games’, Games and Economic Behavior 14 , 124–143. Moody , James (2001), ‘Race, school integration, and friendship segre gation in America’, American J ournal of Sociology 103 (7), 679–716. Murray , Iain A., Zoubin Ghahramani and David J. C. MacKay (2006), MCMC for doubly- intractable distributions, in ‘Proceedings of the T wenty-Second Conference on Uncertainty in APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 21 Artiﬁcial Intelligence’, pp. 359–366. Radin, Charles and Mei Y in (2013), ‘Phase transitions in e xponential random graphs’, The Annals of Applied Pr obability 23 (6), 2458–2471. Sheng, Shuyang (2012), Identiﬁcation and estimation of network formation g ames. working paper . Snijders, T om A.B (2002), ‘Markov chain Monte Carlo estimation of e xponential random graph models’, J ournal of Social Structur e 3 (2). T rain, K enneth (2009), Discr ete Choice Methods with Simulation , Cambridge Uni versity Press. W ainwright, M.J. and M.l. Jordan (2008), ‘Graphical models, e xponential f amilies, and v ariational inference’, F oundations and T r ends@ in Machine Learning 1 (1-2), 1–305. W asserman, Stanley and Philippa Pattison (1996), ‘Logit models and logistic re gressions for social networks: I. An introduction to Markov graphs and p*’, Psyc hometrika 61 (3), 401–425. W immer , Andreas and Ke vin Lewis (2010), ‘Beyond and belo w racial homophily: ERG models of a friendship network documented on Facebook’, American Journal of Sociology 116 (2), 583– 642. Xing, Eric P ., Michael I. Jordan and Stuart Russell (2003), A generalized mean ﬁeld algorithm for v ariational inference in e xponential families, in ‘Proceedings of the Nineteenth Conference on Uncertainty in Artiﬁcial Intelligence’, U AI’03, pp. 583–591. 22 ANGELO MELE AND LINGJIONG ZHU A P P E N D I X A.1. Proof of Theorem 3.1 . In this proof we will try to follo w closely the notation in Chatterjee and Dembo ( 2016 ). Suppose that f : [0 , 1] N → R is twice continuously differentiable in (0 , 1) N , so that f and all its ﬁrst and second order deri vati v es extend continuously to the boundary . Let k f k denote the supremum norm of f : [0 , 1] N → R . For each i and j , denote (A.1) f i := ∂ f ∂ x i , f ij := ∂ 2 f ∂ x i ∂ x j , and let (A.2) a := k f k , b i := k f i k , c ij := k f ij k . Gi ven  > 0 , D (  ) is the ﬁnite subset of R N so that for any x ∈ { 0 , 1 } N , there exists d = ( d 1 , . . . , d N ) ∈ D (  ) such that (A.3) N X i =1 ( f i ( x ) − d i ) 2 ≤ N  2 . Let us deﬁne (A.4) F := log X x ∈{ 0 , 1 } N e f ( x ) , and for any x = ( x 1 , . . . , x N ) ∈ [0 , 1] N , (A.5) I ( x ) := N X i =1 [ x i log x i + (1 − x i ) log (1 − x i )] . In the proof we rely on Theorem 1.5 in Chatterjee and Dembo ( 2016 ) that we reproduce in Theorem A.1 to help the reader: THEOREM A.1 ( Chatterjee and Dembo ( 2016 )) . F or any  > 0 , (A.6) sup x ∈ [0 , 1] N { f ( x ) − I ( x ) } − 1 2 N X i =1 c ii ≤ F ≤ sup x ∈ [0 , 1] N { f ( x ) − I ( x ) } + E 1 + E 2 , APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 23 wher e (A.7) E 1 := 1 4 N N X i =1 b 2 i ! 1 / 2  + 3 N  + log |D (  ) | , and E 2 := 4  P N i =1 ( ac ii + b 2 i ) + 1 4 P N i,j =1 ( ac 2 ij + b i b j c ij + 4 b i c ij )  1 / 2 (A.8) + 1 4  P N i =1 b 2 i  1 / 2  P N i =1 c 2 ii  1 / 2 + 3 P N i =1 c ii + log 2 . W e will use the Theorem A.1 to deri ve the lo wer and upper bound of the mean-ﬁeld approxima- tion problem. Our results extend Theorem 1.7. in Chatterjee and Dembo ( 2016 ) from the ERGM with two-stars and triangles to the model that allo ws nodal cov ariates. Notice that in our case the N of the theorem is the number of links, i.e. N =  n 2  . Let (A.9) Z n := X x ij ∈{ 0 , 1 } ,x ij = x j i , 1 ≤ i 0 is some univ ersal constant. T o see why we can choose C 1 ( α, β , γ ) as in ( A.26 ) so that ( A.25 ) holds, we ﬁrst notice that it follows from ( A.25 ) that we can choose C 1 ( α, β , γ ) such that C 1 ( α, β , γ ) ≥ max { ˜ c 1 max ij | α ij | + ˜ c 2 | β | + ˜ c 3 | γ | + ˜ c 4 , ˜ c 5 β 4 , ˜ c 6 γ 4 } , where ˜ c 1 , ˜ c 2 , ˜ c 3 , ˜ c 4 , ˜ c 5 , ˜ c 6 > 0 are some univ ersal constants. Note that max { ˜ c 1 max ij | α ij | + ˜ c 2 | β | + ˜ c 3 | γ | + ˜ c 4 , ˜ c 5 β 4 , ˜ c 6 γ 4 } ≤ ˜ c 1 max ij | α ij | + ˜ c 2 | β | + ˜ c 3 | γ | + ˜ c 4 + ˜ c 5 β 4 + ˜ c 6 γ 4 ≤ c 1 (max i,j | α ij | + | β | 4 + | γ | 4 + 1) for some uni versal constant c 1 > 0 . Thus, we can take C 1 ( α, β , γ ) as in ( A.26 ). W e can also compute from ( A.8 ) that E 2 = 4  X 1 ≤ i 0 is some uni versal constant. Finally , to get lo wer bound, notice that (A.28) 1 2 X 1 ≤ i 0 and µ M F n > 0 . Then (B.1) k ˆ θ n − ˆ θ M F n k ≤ 2 ( µ n + µ M F n ) 1 2 " sup α,β ,γ ∈ Θ C 1 2 1 ( α, β , γ )  log n n  1 10 + sup α,β ,γ ∈ Θ C 1 2 2 ( α, β , γ ) n − 1 4 # , wher e C 1 and C 2 ar e deﬁned in Theor em 3.1 and k · k denotes the Euclidean norm. In Proposition B.1 , if µ n and µ M F n goes to zero sufﬁciently fast as n goes zero, then the bound in ( B.1 ) may not go to zero as n goes to zero. If for e xample µ n , µ M F n are uniformly bounded from belo w , and both sup α,β ,γ ∈ Θ C 1 ( α, β , γ ) and sup α,β ,γ ∈ Θ C 2 ( α, β , γ ) are O (1) , then k ˆ θ n − ˆ θ M F n k = O ( n − 1 / 10 (log n ) 1 / 10 ) . B.1. Proof of Proposition B.1 . W e assume that ψ n (resp. ψ M F n ) is dif ferentiable and µ n -strongly con v ex (resp. µ M F n -strongly con v ex) in θ := ( α, β , γ ) ∈ Θ . Note that ` n = T n − ψ n , ` M F n = T n − ψ M F n , and T n is linear in θ = ( α , β , γ ) , we ha ve that ` n (resp. ` M F n ) is differentiable and µ n -strongly concav e in θ := ( α , β , γ ) ∈ Θ so that for any x, y ∈ Θ , (B.2) ` n ( y ) ≤ ` n ( x ) + ∇ ` n ( x ) T ( y − x ) − µ n 2 k y − x k 2 , 17 Geyer and Thompson ( 1992 ) mentions similar problems arise for the MCMC-MLE. Indeed, as mentioned above, the MLE may not exist. For example, if the number of triangles is zero in the data, it will be impossible to estimate γ and the MCMC-MLE may giv e an approximation with solution that tends to inﬁnity . 30 ANGELO MELE AND LINGJIONG ZHU and in particular , ` n ( ˆ θ M F n ) ≤ ` n ( ˆ θ n ) + ∇ ` n ( ˆ θ n ) T ( ˆ θ M F n − ˆ θ n ) − µ n 2 k ˆ θ M F n − ˆ θ n k 2 (B.3) = ` n ( ˆ θ n ) − µ n 2 k ˆ θ M F n − ˆ θ n k 2 , and similarly , for any x, y ∈ Θ , (B.4) ` M F n ( y ) ≤ ` M F n ( x ) + ∇ ` M F n ( x ) T ( y − x ) − µ n 2 k y − x k 2 , and in particular , ` M F n ( ˆ θ n ) ≤ ` M F n ( ˆ θ M F n ) + ∇ ` M F n ( ˆ θ M F n ) T ( ˆ θ n − ˆ θ M F n ) − µ M F n 2 k ˆ θ n − ˆ θ M F n k 2 (B.5) = ` M F n ( ˆ θ M F n ) − µ M F n 2 k ˆ θ n − ˆ θ M F n k 2 . Adding the inequalities ( B.3 ) and ( B.5 ), we get k ˆ θ n − ˆ θ M F n k 2 ≤ 2 µ M F n + µ n h ` M F n ( ˆ θ M F n ) − ` n ( ˆ θ M F n )  +  ` n ( ˆ θ n ) − ` M F n ( ˆ θ n ) i ≤ 4 µ M F n + µ n sup θ ∈ Θ | ` M F n ( θ ) − ` n ( θ ) | . By applying Theorem 3.1 , we get k ˆ θ n − ˆ θ M F n k ≤ 2 ( µ n + µ M F n ) 1 2  sup α,β ,γ ∈ Θ C 1 ( α, β , γ ) n − 1 5 (log n ) 1 5 + sup α,β ,γ ∈ Θ C 2 ( α, β , γ ) n − 1 2  1 2 ≤ 2 ( µ n + µ M F n ) 1 2  sup α,β ,γ ∈ Θ C 1 2 1 ( α, β , γ ) n − 1 10 (log n ) 1 10 + sup α,β ,γ ∈ Θ C 1 2 2 ( α, β , γ ) n − 1 4  , where the last step is due to the inequality √ x + y ≤ √ x + √ y for an y x, y ≥ 0 . The proof is complete. A P P E N D I X C. A D D I T I O N A L S I M U L A T I O N R E S U L T S C.1. No covariates, edges and tw o-stars model. W e hav e estimated a model with no co v ariates. This corresponds to a model in which ˜ α 2 = 0 or α 1 = α 2 = α . The results of our simulations for small networks are in T able C.1 . Our method performs relativ ely well in this simpler case. Indeed APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 31 in this case there are results that would allow us to solv e the variational problem in closed form for large n ( Chatterjee and Diaconis , 2013 ; Mele , 2017 ; Aristoff and Zhu , 2018 ; Radin and Y in , 2013 ). The MPLE and MCMC-MLE median estimate seems to con v erge to the true v alue as we increase n , but our approximation seems to perform slightly better here. T A B L E C . 1 . Monte Carlo estimates, comparison of three methods. T rue parameter vector is ( ˜ α 1 , ˜ α 2 , β ) = ( − 2 , 0 , 1) n = 50 MCMC-MLE MEAN-FIELD MPLE ˜ α 1 ˜ α 2 β ˜ α 1 ˜ α 2 β ˜ α 1 ˜ α 2 β median -2.063 0.016 -0.324 -2.021 0.007 0.999 -1.983 0.018 -1.006 0.05 -2.692 -0.614 -23.828 -2.412 -0.372 0.975 -2.439 -0.368 -34.177 0.95 -1.363 0.657 22.738 -1.783 0.413 1.015 -1.449 0.401 14.465 n = 100 MCMC-MLE MEAN-FIELD MPLE ˜ α 1 ˜ α 2 β ˜ α 1 ˜ α 2 β ˜ α 1 ˜ α 2 β median -1.970 -0.042 0.221 -1.981 -0.017 1.000 -1.949 -0.023 -1.231 0.05 -2.241 -0.333 -13.226 -2.101 -0.194 0.993 -2.168 -0.196 -14.402 0.95 -1.602 0.249 16.316 -1.874 0.134 1.012 -1.643 0.142 9.328 n = 200 MCMC-MLE MEAN-FIELD MPLE ˜ α 1 ˜ α 2 β ˜ α 1 ˜ α 2 β ˜ α 1 ˜ α 2 β median -2.012 -0.005 1.483 -1.998 0.002 1.000 -2.003 -0.001 1.225 0.05 -2.214 -0.184 -9.515 -2.067 -0.093 0.997 -2.160 -0.095 -9.682 0.95 -1.796 0.161 12.179 -1.935 0.091 1.003 -1.790 0.095 8.784 Notes. See notes for T able 4.1 . C.2. Model with 2-stars. In this subsection we report estimates of a model where the triangle term is excluded from the speciﬁcation ( γ = 0 in log-likelihood ( 4.5 )). In T able C.2 we report results for 100 simulations of a model with ( ˜ α 1 , ˜ α 2 , β ) = ( − 2 , 1 , 2) . W e run simulations for networks of size n = 50 , 100 , 200 , to show how our method compares to MCMC-MLE and MPLE when the size of the network gro ws. In general, we expect more precise results as n grows lar ge. The results are encouraging and the mean-ﬁeld approximation seems to beha ve as expected. Indeed, the median estimate is very close to the true parameters that generate the data. As the size of the network grows from n = 50 to n = 200 , both MCMC-MLE and MPLE also improv e in precision. The f astest method in terms of computational time is the MPLE. This is because the MPLE’ s speed depends on the number of parameters. Our mean-ﬁeld approximation is as fast as the MCMC-MLE. 32 ANGELO MELE AND LINGJIONG ZHU T A B L E C . 2 . Monte Carlo estimates, comparison of three methods. T rue parameter vector is ( ˜ α 1 , ˜ α 2 , β ) = ( − 2 , 1 , 2) n = 50 MCMC-MLE MEAN-FIELD MPLE ˜ α 1 ˜ α 2 β ˜ α 1 ˜ α 2 β ˜ α 1 ˜ α 2 β median -2.015 0.999 2.303 -1.993 1.000 2.004 -1.996 0.998 1.820 0.05 -2.433 0.641 -1.085 -2.060 0.885 1.916 -2.325 0.780 -2.556 0.95 -1.666 1.337 6.118 -1.905 1.090 2.087 -1.573 1.273 4.783 n = 100 MCMC-MLE MEAN-FIELD MPLE ˜ α 1 ˜ α 2 β ˜ α 1 ˜ α 2 β ˜ α 1 ˜ α 2 β median -1.995 1.012 1.932 -1.980 1.011 2.011 -1.980 1.010 1.783 0.05 -2.189 0.861 0.701 -2.032 0.969 1.992 -2.175 0.901 0.329 0.95 -1.833 1.157 3.314 -1.944 1.044 2.088 -1.816 1.141 2.867 n = 200 MCMC-MLE MEAN-FIELD MPLE ˜ α 1 ˜ α 2 β ˜ α 1 ˜ α 2 β ˜ α 1 ˜ α 2 β median -2.000 1.009 1.938 -1.986 1.005 2.016 -1.997 1.007 1.930 0.05 -2.182 0.925 0.843 -2.004 0.932 1.999 -2.176 0.950 0.592 0.95 -1.882 1.087 4.119 -1.935 1.028 2.214 -1.847 1.069 3.541 Notes. See notes for T able 4.1 . The second set of Monte Carlo experiments is reported in T able C.3 , where the data are generated by parameter vector ( ˜ α 1 , ˜ α 2 , β ) = ( − 2 , 1 , 3) . The pattern is similar to the previous table, but the mean ﬁeld estimates exhibit a little more bias. T A B L E C . 3 . Monte Carlo estimates, comparison of three methods. T rue parameter vector is ( ˜ α 1 , ˜ α 2 , β ) = ( − 2 , 1 , 3) n = 50 MCMC-MLE MEAN-FIELD MPLE ˜ α 1 ˜ α 2 β ˜ α 1 ˜ α 2 β ˜ α 1 ˜ α 2 β median -1.978 1.010 2.742 -1.958 1.026 3.025 -1.921 1.016 2.357 0.05 -2.308 0.745 1.342 -2.045 0.878 2.938 -2.201 0.823 -0.742 0.95 -1.689 1.229 4.466 -1.811 1.141 3.468 -1.547 1.202 4.288 n = 100 MCMC-MLE MEAN-FIELD MPLE ˜ α 1 ˜ α 2 β ˜ α 1 ˜ α 2 β ˜ α 1 ˜ α 2 β median -2.005 1.002 3.022 -1.851 1.091 3.166 -1.997 1.001 3.009 0.05 -2.116 0.892 2.665 -2.274 0.866 2.998 -2.098 0.924 2.514 0.95 -1.902 1.110 3.414 -1.670 1.861 4.092 -1.895 1.096 3.425 n = 200 MCMC-MLE MEAN-FIELD MPLE ˜ α 1 ˜ α 2 β ˜ α 1 ˜ α 2 β ˜ α 1 ˜ α 2 β median -2.003 1.000 2.959 -1.923 1.030 3.107 -1.984 1.000 2.847 0.05 -2.151 0.934 2.314 -2.059 0.922 3.000 -2.104 0.951 2.096 0.95 -1.902 1.064 3.944 -1.836 1.164 4.222 -1.861 1.039 3.666 Notes. See notes for T able 4.1 . APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 33 C.3. Model with triangles. The second set of simulations in volv es a model with no two-stars, that is β = 0 , in T able C.4 . In this speciﬁcation our mean-ﬁeld approximation seems to do better than the other estimators, at least for this small networks. T A B L E C . 4 . Monte Carlo estimates, comparison of three methods. T rue parameter vector is ( ˜ α 1 , ˜ α 2 , γ ) = ( − 2 , 1 , − 2) n = 50 MCMC-MLE MEAN-FIELD MPLE ˜ α 1 ˜ α 2 γ ˜ α 1 ˜ α 2 γ ˜ α 1 ˜ α 2 γ median -2.024 1.026 -13.959 -2.000 1.005 -2.000 -2.031 1.012 -9.804 0.05 -2.384 0.622 -60.419 -2.321 0.168 -6.425 -2.398 0.758 -45.881 0.95 -1.689 1.457 49.585 -0.739 2.246 -1.777 -1.809 1.394 21.696 n = 100 MCMC-MLE MEAN-FIELD MPLE ˜ α 1 ˜ α 2 γ ˜ α 1 ˜ α 2 γ ˜ α 1 ˜ α 2 γ median -2.006 1.019 -6.053 -1.967 1.035 -2.007 -2.002 1.015 -4.980 0.05 -2.164 0.832 -35.171 -3.472 0.951 -7.368 -2.124 0.876 -23.937 0.95 -1.824 1.183 27.361 -1.388 3.763 -1.910 -1.890 1.153 13.519 n = 200 MCMC-MLE MEAN-FIELD MPLE ˜ α 1 ˜ α 2 γ ˜ α 1 ˜ α 2 γ ˜ α 1 ˜ α 2 γ median -2.007 1.001 -1.002 -1.972 1.031 -2.006 -2.003 1.000 -1.913 0.05 -2.083 0.901 -23.049 -2.014 1.008 -2.115 -2.061 0.929 -15.721 0.95 -1.931 1.095 16.760 -1.473 1.636 -1.983 -1.952 1.072 9.153 Notes. See notes for T able 4.1 . 34 ANGELO MELE AND LINGJIONG ZHU T A B L E C . 5 . Monte Carlo estimates, comparison of three methods. T rue parameter vector is ( ˜ α 1 , ˜ α 2 , β , γ ) = ( − 2 , 1 , − 1 , − 1) n = 50 MCMC-MLE MEAN-FIELD MPLE ˜ α 1 ˜ α 2 β γ ˜ α 1 ˜ α 2 β γ ˜ α 1 ˜ α 2 β γ median -2.008 1.023 -1.256 -4.943 -1.977 1.030 -1.018 -1.002 -1.959 1.015 -2.032 -3.296 mad 0.320 0.267 4.898 43.074 0.153 0.165 0.144 0.154 0.307 0.191 4.532 24.826 n = 100 MCMC-MLE MEAN-FIELD MPLE median -1.996 1.004 -1.138 -3.173 -1.932 1.177 -1.057 -1.021 -1.974 1.006 -1.566 -1.489 mad 0.219 0.133 3.364 28.410 0.567 0.553 0.335 0.346 0.207 0.093 3.119 16.695 n = 200 MCMC-MLE MEAN-FIELD MPLE median -1.995 1.007 -1.155 -0.980 -1.603 1.645 -1.317 -1.078 -1.987 1.003 -1.340 -1.308 mad 0.133 0.069 2.098 18.167 0.559 0.794 0.656 0.558 0.127 0.047 2.064 11.196 n = 500 MCMC-MLE MEAN-FIELD MPLE median -1.998 1.002 -1.070 -1.315 -1.682 1.836 -1.431 -1.155 -1.991 1.000 -1.113 -1.227 mad 0.084 0.033 1.496 10.897 0.805 0.849 0.776 0.883 0.079 0.020 1.340 7.036 Notes: see notes for T able 4.1 . T A B L E C . 6 . Monte Carlo estimates, comparison of three methods. T rue parameter vector is ( ˜ α 1 , ˜ α 2 , β , γ ) = ( − 2 , 1 , − 2 , 3) n = 50 MCMC-MLE MEAN-FIELD MPLE ˜ α 1 ˜ α 2 β γ ˜ α 1 ˜ α 2 β γ ˜ α 1 ˜ α 2 β γ median -2.005 1.024 -2.368 -4.197 -1.955 1.037 -2.022 2.998 -1.958 1.017 -3.006 -0.198 mad 0.349 0.292 5.767 46.688 0.095 0.085 0.088 0.082 0.307 0.196 4.707 26.076 n = 100 MCMC-MLE MEAN-FIELD MPLE median -2.000 0.995 -2.333 1.560 -1.909 1.082 -2.100 2.983 -1.972 0.997 -2.708 2.617 mad 0.216 0.145 3.429 31.810 0.151 0.147 0.199 0.130 0.195 0.099 3.221 17.184 n = 200 MCMC-MLE MEAN-FIELD MPLE median -1.998 0.997 -2.062 1.847 -1.593 1.512 -2.849 2.711 -1.985 0.999 -2.321 2.326 mad 0.129 0.073 2.302 22.032 0.565 0.677 1.195 0.594 0.124 0.049 2.167 13.057 n = 500 MCMC-MLE MEAN-FIELD MPLE median -2.004 1.002 -1.944 2.531 -1.523 1.605 -3.493 2.557 -2.002 1.002 -2.059 2.786 mad 0.091 0.038 1.579 11.813 0.782 0.726 1.472 0.982 0.080 0.024 1.472 8.068 Notes: see notes for T able 4.1 . C.4. Some examples of noncon ver gence. In T ables C.5 and C.6 we sho w e xamples in which our mean-ﬁeld approximation performs worse than the alternati ve estimators. There are sev eral possi- ble explanations for this poor conv er gence. First, it may be that we are not ﬁnding the maximizer of the approximation v ariational problem ( 2.9 ), gi ven the local nature of updates ( 4.1 ). In these simulations we do not start the matrix µ (0) at dif ferent initial values, therefore we con v erge to a local maximum that may not be global. Our package mfergm allo ws the researcher to initialize APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 35 µ (0) at dif ferent random starting points. This can impro ve con v ergence. In principle we should increase the number of re-starts as n gro ws, as it is known that these models may ha ve multiple modes. Ideally , one can use a Nelder-Mead or Simulated Annealing algorithm to ﬁnd the maxi- mizer of the v ariational problem, but this is more time-consuming. All these ideas lead to simple parallelization of our package’ s functions that are beyond the scope of the present work. Second, the tolerance lev el that we use  tol = 0 . 0001 may be too large. Third, the likelihood may exhibit a phase transition and thus a small difference in parameters may cause a large change in the behavior of the model. W e conjecture that some of these issues are related to identiﬁcation and we plan to explore this in future w ork. C.5. A note on computational speed. In our Monte Carlo ex ercises, we note that the computa- tional speed of the three estimators is similar for small networks. F or n = 100 , the mean-ﬁeld approximation takes about 3 . 5 s to estimate the model, while an MCMC-MLE with a burnin of 100 , 000 and sampling ev ery 1000 iterations takes approximately 5 . 5 s and the MPLE takes about 1 . 7 s . F or n = 50 the estimates tak e 1 . 6 s for mean-ﬁeld, 4 s for MCMC-MLE and 1 . 2 s for MPLE. Ho wev er , for larger networks, our code is computationally inef ﬁcient and results in much larger computational time than using the built-in functions in the ergm package in R for MCMC-MLE and MPLE. W e ha ve experimented with faster iterati ve routines that could speed up the approx- imate solution of the variational mean-ﬁeld problem, but these are not fully stable. Additionally our code does not make ef ﬁcient use of the memory , as the matrix µ is dense and we are not using ef ﬁcient matrix algebra libraries to speed up the computation. W e believ e that such improv ement in our benchmark code will make computational time comparable to MPLE. 36 ANGELO MELE AND LINGJIONG ZHU ONLINE APPENDIX - NO T FOR PUBLICA TION A P P E N D I X D. A S Y M P T OT I C R E S U LT S In this section we consider the model as n → ∞ . W e ha ve seen previously that the log normal- izing constant ψ n ( α, β , γ ) can be approximated by ψ M F n ( µ ( α, β , γ )) by the mean-ﬁeld approxima- tion, where µ ( α, β , γ ) solves the optimization problem in ( 2.9 ) and ψ M F n ( µ ( α, β , γ )) is its optimal v alue, where we recall that ψ M F n ( µ ( α, β , γ )) = sup µ ∈ [0 , 1] n 2 : µ ij = µ j i , ∀ i,j ( 1 n 2 X i,j α ij µ ij + β 2 n 3 X i,j,k µ ij µ j k + 2 γ 3 n 3 X i,j,k µ ij µ j k µ ki − 1 2 n 2 X i,j [ µ ij log µ ij + (1 − µ ij ) log (1 − µ ij )] ) , W e will study the limit as n → ∞ . Before we proceed, we need a representation of the v ector α in the inﬁnite network. The following assumption guarantee that we can switch from the discrete to the continuum. ASSUMPTION D.1. Assume that α ij = α ( i/n, j /n ) , wher e α ( x, y ) : [0 , 1] 2 → R is a deterministic exo genous function that is symmetric, i.e ., α ( x, y ) = α ( y , x ) . 18 Since we have n players, the number of types for the players must be ﬁnite, although it may gro w as n grows. α ij are symmetric, and can take at most n ( n +1) 2 v alues. As n → ∞ , the number of types can become inﬁnite and α ( x, y ) may take inﬁnitely man y v alues. On the other hand, in terms of practical applications, ﬁnitely many v alues often suf ﬁce 19 . 18 T o ease the notations, we project ⊗ S j =1 X j onto [0 , 1] and the function α ( τ i , τ j ) deﬁned previously is no w re-deﬁned from [0 , 1] 2 to R . 19 If an entry of the v ector τ i is continuous, we can alw ays transform the v ariable in a discrete vector using thresholds. For example, if X j = [$50,000,$200,000], we can bucket the incomes into three levels, low: [$50,000,$100,000), medium [$100,000,$150,000) and high: [$150,000, $200,000]. APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 37 F I G U R E D . 1 . Examples of function α ( x, y ) . 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x y α 1 α 2 α 3 α 4 α 5 α 6 α 7 α 8 α 9 α 10 α 11 α 12 α 13 α 14 α 15 α 16 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x y α 1 α 2 α 3 α 4 α 5 α 6 α 7 α 8 α 9 α 10 α 11 α 12 α 13 α 14 α 15 α 16 (A) (B) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x y α mm α fm α mf α ff 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x y α 1 − K − K − K − K α 2 − K − K − K − K α 3 − K − K − K − K α 4 (C) (D) The ﬁgure pro vides se v eral examples of possible partitions of the net beneﬁt function α ( x, y ) with ﬁnite cov ariates. The asymptotic version of this function is deﬁned ov er the unit square. ASSUMPTION D.2. W e assume that α ( x, y ) is uniformly bounded in x and y : (D.1) sup ( x,y ) ∈ [0 , 1] 2 | α ( x, y ) | < ∞ . As a simple example, let us consider gender: the population consists of males and female agents. For example, half of the nodes (population) are males, say i = 1 , 2 , . . . , n 2 and the other half are females, i = n 2 + 1 , n 2 + 2 , . . . , n . 20 That means, α ( x, y ) takes three values according to the three 20 Here, we assume without loss of generality that n is an ev en number . 38 ANGELO MELE AND LINGJIONG ZHU regions:  ( x, y ) : 0 < x, y < 1 2  ,  ( x, y ) : 1 2 < x, y < 1  ,  ( x, y ) : 0 < x < 1 2 < y < 1  S  ( x, y ) : 0 < y < 1 2 < x < 1  , and these three regions correspond precisely to pairs: male-male, female-female, and male-female. This example is represented in Figure D.1 (C). The work of Chatterjee and Diaconis ( 2013 ) sho w that the v ariational problem in ( 2.7 ) translates into an analogous variational problem for the graph limit. 21 In the special case α ( x, y ) ≡ α , it is sho wn in Chatterjee and Diaconis ( 2013 ) that as n → ∞ the log-constant of the ERGM con verges to the solution of the v ariational problem ( D.3 ), that is (D.2) ψ n ( α, β , γ ) → ψ ( α, β , γ ) , where ψ ( α, β , γ ) = sup h ∈W  α ˆ 1 0 ˆ 1 0 h ( x, y ) dxdy + β 2 ˆ 1 0 ˆ 1 0 ˆ 1 0 h ( x, y ) h ( y , z ) dxdy dz (D.3) + 2 γ 3 ˆ 1 0 ˆ 1 0 ˆ 1 0 h ( x, y ) h ( y , z ) h ( z , x ) dxdy dz − 1 2 ˆ 1 0 ˆ 1 0 I ( h ( x, y )) dxdy  , where (D.4) W :=  h : [0 , 1] 2 → [0 , 1] , h ( x, y ) = h ( y , x ) , 0 ≤ x, y ≤ 1  , and we deﬁne the entropy function: I ( x ) := x log x + (1 − x ) log (1 − x ) , 0 ≤ x ≤ 1 , with I (0) = I (1) = 0 . 21 See also Mele ( 2017 ) for similar results in a directed network. APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 39 In essence the ﬁrst three terms in ( D.3 ) correspond to the expected potential function in the continuum, while the last term in ( D.3 ) corresponds to the entropy of the graph limit. W e will sho w that ( D.2 ) holds with ψ ( α, β , γ ) = sup h ∈W  ˆ 1 0 ˆ 1 0 α ( x, y ) h ( x, y ) dxdy + β 2 ˆ 1 0 ˆ 1 0 ˆ 1 0 h ( x, y ) h ( y , z ) dxdy dz (D.5) + 2 γ 3 ˆ 1 0 ˆ 1 0 ˆ 1 0 h ( x, y ) h ( y , z ) h ( z , x ) dxdy dz − 1 2 ˆ 1 0 ˆ 1 0 I ( h ( x, y )) dxdy  , The function h in the expressions abov e is kno wn as the graphon from the graph limits literature 22 , large de viations literature for random graphs 23 and analysis of the resulting variational prob- lem. 24 and it is a representation of an inﬁnite network, where h is a simple symmetric function h : [0 , 1] 2 → [0 , 1] , and h ( x, y ) = h ( y , x ) . Note that our goal is to approximate ψ M F n and hence ψ n by ψ , whose deﬁnition in v olves the function h , and we call such a function a graphon in the rest of the paper , to be consistent with the literature, while we are not attempting here to establish a theory of graph limits to allow nodal cov ariates. That is an interesting research direction worth in v estigating in the future, but is out of the scope of the current paper . The follo wing proposition shows that for a model with ﬁnitely man y types the variational ap- proximation is asymptotically exact. PR OPOSITION D.1. Under Assumptions D.1 and D.2 , as n → ∞ ψ n ( α, β , γ ) → ψ ( α, β , γ ) , wher e ψ ( α, β , γ ) is deﬁned in ( D.5 ) . Pr oof . It follows directly from Theorem 3.1 and ψ M F n ( µ ( α, β , γ )) → ψ ( α, β , γ ) , as n → ∞ .  The proposition states that as n becomes lar ge, we can approximate the exponential random graph using a model with independent links (conditional on ﬁnitely many types). This is a v ery 22 See Lov asz ( 2012 ), Borgs et al. ( 2008 ) 23 See Chatterjee and V aradhan ( 2011 ), Chatterjee and Diaconis ( 2013 ) 24 See Aristoff and Zhu ( 2018 ), Radin and Y in ( 2013 ) among others. 40 ANGELO MELE AND LINGJIONG ZHU useful result because the latter approximation is simple and tractable, while the exponential ran- dom graph model contains complex dependence patterns that make estimation computationally expensi v e. D.1. Appr oximation of the limit log normalizing constant. W e can analyze and pro vide an ap- proximation of the log-constant in the large network limit. The variational formula for ψ ( α, β , γ ) is an inﬁnite-dimensional problem which is intractable in most cases. Ne vertheless, we can al- ways bound the inﬁnite dimensional problem with ﬁnite dimensional ones (both lower and upper bounds), at least in the absence of transiti vity . For details, see Proposition F .2 in the Online Ap- pendix. The lower -bound in Proposition F .2 coincides with the structured mean-ﬁeld approach of Xing et al. ( 2003 ). In a model with homogeneous players, the lo wer-bound corresponds to the computational approximation of graph limits implemented in He and Zheng ( 2013 ). In the case of extreme homophily , we can also obtain ﬁnite-dimensional approximation, see Proposition F .1 in the Online Appendix. D.2. Characterization of the variational pr oblem. W e recall that the log normalizing constant in the n → ∞ limit is gi ven by the v ariational problem: ψ ( α, β , γ ) = sup h ∈W  ˆ 1 0 ˆ 1 0 α ( x, y ) h ( x, y ) dxdy + β 2 ˆ 1 0 ˆ 1 0 ˆ 1 0 h ( x, y ) h ( y , z ) dxdy dz (D.6) + 2 γ 3 ˆ 1 0 ˆ 1 0 ˆ 1 0 h ( x, y ) h ( y , z ) h ( z , x ) dxdy dz − 1 2 ˆ 1 0 ˆ 1 0 [ h ( x, y ) log h ( x, y ) + (1 − h ( x, y )) log(1 − h ( x, y ))] dxdy  . PR OPOSITION D.2. The optimal graphon h that solves the variational pr oblem ( D.6 ) satisﬁes the Euler-La grang e equation: (D.7) 2 α ( x, y ) + β ˆ 1 0 h ( x, y ) dx + β ˆ 1 0 h ( x, y ) dy + 4 γ ˆ 1 0 h ( x, z ) h ( y , z ) dz = log  h ( x, y ) 1 − h ( x, y )  . APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 41 Pr oof . The proof follows from the same argument as in Theorem 6.1. in Chatterjee and Diaconis ( 2013 ).  COR OLLAR Y 1. If α ( x, y ) is not a constant function, then the optimal gr aphon h that solves the variational pr oblem ( D.6 ) is not a constant function. Pr oof . If the optimal graphon h is a constant function, then ( D.7 ) implies that α is a constant function. Contradiction.  In general, if a graphon satisﬁes the Euler-Lagrange equation, that only indicates that the graphon is a stationary point, and it is not clear if the graphon is the local maximizer , local minimizer or neither . In the next result, we will sho w that when β is negati v e, any graphon satisfying the Euler - Lagrange equation in our model is indeed a local maximizer . PR OPOSITION D.3. Assume that β < 0 and γ = 0 . If h is a graphon that satisﬁes the Euler - Lagr ange equation ( D.7 ) , then h is a local maximizer of the variational pr oblem ( D.6 ) . Pr oof . Let us deﬁne Λ[ h ] := ˆ 1 0 ˆ 1 0 α ( x, y ) h ( x, y ) dxdy + β 2 ˆ 1 0 ˆ 1 0 ˆ 1 0 h ( x, y ) h ( y , z ) dxdy dz (D.8) − 1 2 ˆ 1 0 ˆ 1 0 [ h ( x, y ) log h ( x, y ) + (1 − h ( x, y )) log(1 − h ( x, y ))] dxdy . Let h satisfy ( D.7 ) and for any symmetric function g and  > 0 suf ﬁciently small, we ha ve Λ[ h + g ] − Λ[ h ] (D.9) =  2 " β 2 ˆ 1 0  ˆ 1 0 g ( x, y ) dy  2 dx − 1 4 ˆ 1 0 ˆ 1 0 I 00 ( h ( x, y )) g 2 ( x, y ) dxdy # + O (  3 ) =  2 " β 2 ˆ 1 0  ˆ 1 0 g ( x, y ) dy  2 dx − 1 4 ˆ 1 0 ˆ 1 0 g 2 ( x, y ) h ( x, y )(1 − h ( x, y )) dxdy # + O (  3 ) , and since β < 0 , we conclude that h is a local maximizer in ( D.6 ).  42 ANGELO MELE AND LINGJIONG ZHU Remark D.1. In gener al, the variational pr oblem for the graphons and the corresponding Euler- Lagr ange equation ( D.7 ) does not yield a closed form solution. In the special case β = γ = 0 , (D.10) ψ ( α, 0 , 0) = sup h ∈W  ¨ [0 , 1] 2 α ( x, y ) h ( x, y ) dxdy − 1 2 ¨ [0 , 1] 2 I ( h ( x, y )) dxdy  , wher e I ( x ) := x log x + (1 − x ) log(1 − x ) and it is easy to see that the optimal graphon h ( x, y ) is given by h ( x, y ) = e 2 α ( x,y ) e 2 α ( x,y ) +1 , and ther efor e , ψ ( α, 0 , 0) = 1 2 ˜ [0 , 1] 2 log(1 + e 2 α ( x,y ) ) dxdy . A P P E N D I X E. D E TA I L S O F E Q U I L I B R I U M E C O N O M I C F O U N DAT I O N S E.1. Setup and pr eferences. Consider a population of n heterogeneous players (the nodes), each characterized by an e xogenous type τ i ∈ ⊗ S j =1 X j , i = 1 , ..., n . The attrib ute τ i is an S -dimensional vector and the sets X j can represent age, race, gender , income, etc. 25 W e collect all τ i ’ s in an n × S matrix τ . The network’ s adjacency matrix g has entries g ij = 1 if i and j are linked; and g ij = 0 otherwise. The network is undirected, i.e. g ij = g j i , and g ii = 0 , for all i ’ s. 26 The utility of player i is (E.1) u i ( g , τ ) = n X j =1 α ij g ij + β n n X j =1 n X k =1 g ij g j k , where α ij := ν ( τ i , τ j ) are symmetric functions ν : ⊗ S j =1 X j × ⊗ S j =1 X j → R and ν ( τ i , τ j ) = ν ( τ j , τ i ) for all i, j ; and β is a scalar . The utility of player i depends on the number of direct links, each weighted according to a function ν of the types τ . This payof f structure implies that the net beneﬁt of forming a direct connection depends on the characteristics of the two indi viduals in v olved in the link. Players also care about the number of links that each of their direct contacts hav e formed. 27 For example, when β > 0 , there is an incenti ve to form links to people that ha ve many friends, e.g. popular kids in school. On the other hand, when β < 0 the incentive is re versed. For e xample, one 25 For instance, if we consider gender and income, then S = 2 , and we can take ⊗ 2 j =1 X j = { male,female } × { low , medium, high } . The sets X j can be both discrete and continuous. F or example, if we consider gender and income, we can also take ⊗ 2 j =1 X j = { male,female } × [$50,000,$200,000]. Below we restrict the covariates to be discrete, but we allo w the number of types to gro w with the size of the network. 26 Extensions to directed networks are straightforward (see Mele ( 2017 )). 27 The normalization of β by n is necessary for the asymptotic analysis. APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 43 can think that forming links to a person with many connections could decrease our visibility and decrease the effecti v eness of interactions. Similar utility functions hav e been used extensiv ely in the empirical network formation literature. 28 The preferences in ( E.1 ) include only direct links and friends’ populatity . Howe v er , we can also include other types of link externalities. For example, in man y applications the researcher is inter - ested in estimating preferences for common neighbors. This is an important network statistics to measure transitity and clustering in networks. In our model we can easily add an utility component to capture these ef fects. (E.2) u i ( g , τ ) = n X j =1 α ij g ij + β n n X j =1 n X k =1 g ij g j k + γ n n X j =1 n X k =1 g ij g j k g ki , These preferences include an additional parameter γ that measures the effect of common neighbors. The potential function for this model is (E.3) Q n ( g ; α, β ) = n X i =1 n X j =1 α ij g ij + β 2 n n X i =1 n X j =1 n X k =1 g ij g j k + 2 γ 3 n n X j =1 n X k =1 g ij g j k g ki . In general, all the results that we show belo w extend to more general utility functions that include payof fs for link externalities similar to ( 2.5 ). The probability that i and j meet can depend on their networks: it could be a function of their common neighbors, or a function of their degrees and centralities, for example. In Assumption E.1 , we assume that the existence of a link between i and j does not af fect their probability of meeting. This is because we pro ve the existence and functional form of the stationary distrib ution ( 2.3 ) using the detailed balance condition, which is not satisﬁed if we allo w the meeting probabilities to depend on the link between i and j . The model can easily be e xtended to directed networks and the results on equilibria and long- run stationary distrib ution will hold. The results about the approximations of the likelihood sho wn belo w will also hold for directed networks, with minimal modiﬁcations of the proofs. 28 See Mele ( 2017 ), Sheng ( 2012 ), DePaula et al. ( 2018 ), Chandrasekhar and Jackson ( 2014 ), Bade v ( 2013 ), Butts ( 2009 ). 44 ANGELO MELE AND LINGJIONG ZHU Finally , while our model generates dense graphs, the approximations using variational methods and nonlinear large de viations that we dev elop in the rest of the paper also work in moderately sparse graphs. More precisely , the utility of player i is giv en by (E.4) u i ( g , τ ) = n X j =1 α ( n ) ij g ij + β ( n ) n n X j =1 n X k =1 g ij g j k + γ ( n ) n n X j =1 n X k =1 g ij g j k g ki , where | α ( n ) ij | , | β ( n ) | and | γ ( n ) | can have moderate gro wth in n instead of being bounded. W e will gi ve more details later in our paper . 29 Example E.1. (Homophily) Consider a model with ν ( τ i , τ j ) = V − c ( τ i , τ j ) , wher e V > 0 is the beneﬁt of a link and c ( τ i , τ j ) ( = c ( τ j , τ i ) ) is the cost of the link between i and j . T o model homophily in this frame work let the cost function be (E.5) c ( τ i , τ j ) =        c if τ i = τ j , C if τ i 6 = τ j . F or example , consider the parameterization 0 < c < V < C and β = 0 , γ = 0 . In this case the players have no incentive to form links with agents of other gr oups. On the other hand, if we have 0 < c < V < C and β , γ > 0 , also links acr oss gr oups will be formed, as long as β , γ ar e sufﬁciently lar ge. Example E.2. (Social Distance Model) Let the payoff fr om dir ect links be a function of the social distance among the individuals. F ormally , let ν ( τ i , τ j ) := η d ( τ i , τ j ) − c , wher e d ( τ i , τ j ) is a distance function, η is a parameter that determines the sensitivity to the social distance and c > 0 is the cost of forming a link. 30 The case with η < 0 r epr esents a world where individuals pr efer linking to similar a gents and η > 0 r epr esents a world wher e individuals pr efer linking with people at lar ger social distance. Note that even when η < 0 , if we have β , γ > 0 suf ﬁciently lar ge , individuals may still have an incentive to form links with people at lar ger social distance . 29 See Chatterjee and Dembo ( 2016 ) for additional applications of nonlinear large de viations. 30 See Iijima and Kamada ( 2014 ) for a more general example of such model. APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 45 E.2. Meetings and equilibrium. The network formation process follows a stochastic best-response dynamics: 31 in each period t , two random players meet with probability ρ ij ; upon meeting they hav e the opportunity to form a link (or delete it, if already in place). Players are myopic: when they form a new link, they do not consider how the new link will affect the incentiv es of the other player in the future e volution of the netw ork. ASSUMPTION E.1. The meeting pr ocess is a function of types and the network. Let g − ij indicate the network g without considering the link g ij . Then the pr obability that i and j meet is (E.6) ρ ij := ρ ( τ i , τ j , g − ij ) > 0 for all pairs i and j , and i.i.d. o ver time. Assumption E.1 implies that the meeting process can depend on cov ariates and the state of the network. For example, if two players have many friends in common they may meet with high probability; or people that share some demographics may meet more often. Crucially , e very pair of players has a strictly positiv e probability of meeting. This guarantees that each link of the network has the opportunity of being re vised. Upon meetings, players decide whether to form or delete a link by maximizing the sum of their current utilities, i.e. the total surplus generated by the relationship. W e are implicitly assuming that indi viduals can transfer utilities. When deciding whether to form a ne w link or deleting an existing link, players recei v e a random matching shock ε ij that shifts their preferences. At time t , the links g ij is formed if u i ( g ij = 1 , g − ij , τ )+ u j ( g ij = 1 , g − ij , τ )+ ε ij (1) ≥ u i ( g ij = 0 , g − ij , τ )+ u j ( g ij = 0 , g − ij , τ )+ ε ij (0) . W e mak e the following assumptions on the matching v alue. ASSUMPTION E.2. Individuals r eceive a lo gistic shock befor e they decide whether to form a link (i.i.d. o ver time and players). 31 See Blume ( 1993 ), Mele ( 2017 ), Badev ( 2013 ). 46 ANGELO MELE AND LINGJIONG ZHU The logistic assumption is standard in many discrete choice models in economics and statistics ( T rain ( 2009 )). W e can no w characterize the equilibria of the model, follo wing Mele ( 2017 ) and Chandrasekhar and Jackson ( 2014 ). In particular , we can show that the network formation is a potential game ( Monderer and Shapley ( 1996 )). PR OPOSITION E.1. The network formation is a potential game , and ther e e xists a potential function Q n ( g ; α, β ) that c haracterizes the incentives of all the player s in any state of the network (E.7) Q n ( g ; α, β ) = n X i =1 n X j =1 α ij g ij + β 2 n n X i =1 n X j =1 n X k =1 g ij g j k + 2 γ 3 n n X i =1 n X j =1 n X k =1 g ij g j k g ki . Pr oof . The proposition follows the same lines as Proposition 1 in Mele ( 2017 ) and it is omitted for bre vity .  The potential function Q n ( g ; α, β ) is such that, for an y g ij Q n ( g ; α, β ) − Q n ( g − ij ; α , β ) = u i ( g ) + u j ( g ) − [ u i ( g − ij ) + u j ( g − ij )] . Thus we can keep track of all players’ incentiv es using the scalar Q n ( g ; α, β ) . It is easy to show that all the pairwise stable (with transfers) networks are the local maxima of the potential function. 32 The sequential network formation follo ws a Glauber dynamics, therefore con ver ging to a unique stationary distribution. THEOREM E.1. In the long run, the model con ver ges to the stationary distrib ution π n , deﬁned as (E.8) π n ( g ; α, β ) = exp [ Q n ( g ; α, β )] P ω ∈G exp [ Q n ( ω ; α, β )] = exp  n 2 [ T n ( g ; α, β ) − ψ n ( α, β )]  , wher e T n ( g ; α, β ) = n − 2 Q n ( g ; α, β ) , (E.9) ψ n ( α, β ) = 1 n 2 log X ω ∈G exp  n 2 T n ( ω ; α, β )  , 32 A network g is pairwise stable with transfers if: (1) g ij = 1 ⇒ u i ( g , τ ) + u j ( g , τ ) ≥ u i ( g − ij , τ ) + u j ( g − ij , τ ) and (2) g ij = 0 ⇒ u i ( g , τ ) + u j ( g , τ ) ≥ u i ( g + ij , τ ) + u j ( g + ij , τ ) ; where g + ij represents network g with the addition of link g ij and network g − ij represents network g without link g ij . See Jackson ( 2010 ) for more details. APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 47 and G := { ω = ( ω ij ) 1 ≤ i,j ≤ n : ω ij = ω j i ∈ { 0 , 1 } , ω ii = 0 , 1 ≤ i, j ≤ n } . Pr oof . The proof is an extension of Theorem 1 in Mele ( 2017 ). See also Chandrasekhar and Jackson ( 2014 ) and Butts ( 2009 ).  Notice that the likelihood ( 2.3 ) corresponds to an ERGM model with heterogeneous nodes and two-stars. As a consequence our model inherits all the estimation and identiﬁcation challenges of the ERGM model. A P P E N D I X F. S P E C I A L C A S E : T H E E D G E - S TA R M O D E L The general solution of the variational problem ( D.3 ) is complicated. Ho we ver , there are some special cases where we can characterize the solution with e xtreme detail. These examples show ho w we can solv e the variational approximation in stylized settings, and we use them to e xplain ho w the method works in practice. In this section, we consider the special case in the absence of transiti vity , i.e. γ = 0 and we get further results for the edge-star model. F .1. Extreme homophily . W e can e xploit homophily to obtain a tractable approximation. Sup- pose that there are M types in the population. The cost of forming links among individuals of the same group is ﬁnite, b ut there is a lar ge cost of forming links among people of dif ferent groups (potentially inﬁnite). W e show that in this case the normalizing constant can be approximated by solving M independent univ ariate maximization problems. In the special case of extreme ho- mophily , our model con v erges to a block-diagonal model. PR OPOSITION F .1. Let 0 = a 0 < a 1 < · · · < a M = 1 be a given sequence . Assume that (F .1) α ( x, y ) = α mm , if a m − 1 < x, y < a m , m = 1 , 2 , . . . , M . and α ( x, y ) ≤ − K otherwise is a given function. Let ψ ( α, β , 0; − K ) be the variational pr oblem for the graphons and ψ ( α , β , 0; −∞ ) = lim K →∞ ψ ( α, β , 0; − K ) . Then, we have (F .2) ψ ( α, β , 0; −∞ ) = M X m =1 ( a m − a m − 1 ) 2 sup 0 ≤ x ≤ 1  α mm x + β 2 x 2 − 1 2 I ( x )  . 48 ANGELO MELE AND LINGJIONG ZHU Pr oof . First, observe that ψ ( α, β , 0; −∞ ) (F .3) = sup h ∈W −  M X i =1 α i ¨ [ a i − 1 ,a i ] 2 h ( x, y ) dxdy + β 2 ˆ 1 0 ˆ 1 0 h ( x, y ) h ( y , z ) dxdy dz − 1 2 M X i =1 ¨ [ a i − 1 ,a i ] 2 I ( h ( x, y )) dxdy  = sup h ∈W −  M X i =1 α i ¨ [ a i − 1 ,a i ] 2 h ( x, y ) dxdy + β 2 M X i =1 ˆ a i a i − 1  ˆ a i a i − 1 h ( x, y ) dy  2 dx − 1 2 M X i =1 ¨ [ a i − 1 ,a i ] 2 I ( h ( x, y )) dxdy  = M X i =1 sup h :[ a i − 1 ,a i ] 2 → [0 , 1] h ( x,y )= h ( y,x )  α i ¨ [ a i − 1 ,a i ] 2 h ( x, y ) dxdy + β 2 ˆ a i a i − 1  ˆ a i a i − 1 h ( x, y ) dy  2 dx − 1 2 ¨ [ a i − 1 ,a i ] 2 I ( h ( x, y )) dxdy  , where (F .4) W − := ( h ∈ W : h ( x, y ) = 0 for an y ( x, y ) / ∈ M [ i =1 [ a i − 1 , a i ] 2 ) . By taking h to be a constant on [ a i − 1 , a i ] 2 , it is clear that (F .5) ψ ( α, β , 0; −∞ ) ≥ M X i =1 ( a i − a i − 1 ) 2 sup 0 ≤ x ≤ 1  α i x + β 2 x 2 − 1 2 I ( x )  . APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 49 By Jensen’ s inequality ψ ( α, β , 0; −∞ ) ≤ M X i =1 sup h :[ a i − 1 ,a i ] 2 → [0 , 1] h ( x,y )= h ( y,x )  α i ˆ a i a i − 1  ˆ a i a i − 1 h ( x, y ) dy  dx (F .6) + β 2 ˆ a i a i − 1  ˆ a i a i − 1 h ( x, y ) dy  2 dx − 1 2 ( a i − a i − 1 ) ˆ a i a i − 1 I  1 a i − a i − 1 ˆ a i a i − 1 h ( x, y ) dy  dx  ≤ M X i =1 ( a i − a i − 1 ) 2 sup 0 ≤ x ≤ 1  α i x + β 2 x 2 − 1 2 I ( x )  .  The net beneﬁt function α ( x, y ) assumed in the Proposition is shown in Figure D.1 (D). Essen- tially this result means that with extreme homophily , we can approximate the model, assuming perfect segregation: thus we can independently solve the v ariational problem of each type. This approach is computationally very simple, since each variational problem becomes a uni v ariate maximization problem. The solution of such uni v ariate problem has been studied and characterized in previous work by Chatterjee and Diaconis ( 2013 ), Radin and Y in ( 2013 ), Aristoff and Zhu ( 2018 ) and Mele ( 2017 ). It can be sho wn that the solutions µ ∗ m , where m = 1 , .., M , are the ﬁxed point of equations (F .7) µ m = exp [ α mm + β µ m ] 1 + exp [ α mm + β µ m ] , for each group m , and β µ ∗ m (1 − µ ∗ m ) < 1 . The global maximizer µ ∗ m is unique except on a phase transition curve { ( α mm , β ) : α mm + β = 0 , α mm < − 1 } , see e.g. Radin and Y in ( 2013 ); Aristoff and Zhu ( 2018 ). It is shown in Chatterjee and Diaconis ( 2013 ) that the network of each group corresponds to an Erd ˝ os-R ´ enyi graph with probability of a link equal to µ ∗ m . F .2. Analytically T ractable Bounds. In this section, for the edge-star model, we provide analyt- ically tractable bounds for ψ ( α, β , γ ) when γ = 0 . 50 ANGELO MELE AND LINGJIONG ZHU PR OPOSITION F .2. Let γ = 0 and 0 = a 0 < a 1 < · · · < a M − 1 < a M = 1 be a given sequence. Let us assume that α ( x, y ) = α ml , if a m − 1 < x < a m and a l − 1 < y < a l , wher e 1 ≤ m, l ≤ M . Then, we have sup 0 ≤ u ml ≤ 1 u ml = u lm , 1 ≤ m,l ≤ M M X m =1 ( a m − a m − 1 )  M X l =1 ( a l − a l − 1 ) α ml u ml + β 2 M X l =1 ( a l − a l − 1 ) u ml ! 2 − 1 2 M X l =1 ( a l − a l − 1 ) I ( u ml )  ≤ ψ ( α, β , 0) ≤ M X m =1 ( a m − a m − 1 ) sup 0 ≤ u ml ≤ 1 1 ≤ l ≤ M  M X l =1 ( a l − a l − 1 ) α ml u ml + β 2 M X l =1 ( a l − a l − 1 ) u ml ! 2 − 1 2 M X l =1 ( a l − a l − 1 ) I ( u ml )  . Pr oof . T o compute the lo wer and upper bounds, let us deﬁne (F .8) u ij ( x ) = 1 a j − a j − 1 ˆ a j a j − 1 h ( x, y ) dy , for any a i − 1 < x < a i . W e can compute that (F .9) ¨ [0 , 1] 2 α ( x, y ) h ( x, y ) dxdy = M X i =1 M X j =1 ( a j − a j − 1 ) ˆ a i a i − 1 α ij u ij ( x ) dx. Moreov er , β 2 ˆ 1 0 ˆ 1 0 ˆ 1 0 h ( x, y ) h ( y , z ) dxdy dz = β 2 ˆ 1 0  ˆ 1 0 h ( x, y ) dy  2 dx (F .10) = β 2 M X i =1 ˆ a i a i − 1 M X j =1 ( a j − a j − 1 ) u ij ( x ) ! 2 dx. APPR O XIMA TE V ARIA TION AL ESTIMA TION FOR A MODEL OF NETWORK FORMA TION 51 By Jensen’ s inequality , we can also compute that 1 2 ˆ 1 0 ˆ 1 0 I ( h ( x, y )) dxdy = 1 2 M X i =1 ˆ a i a i − 1 " M X j =1 ˆ a j a j − 1 I ( h ( x, y )) dy # dx (F .11) = 1 2 M X i =1 ˆ a i a i − 1 " M X j =1 ( a j − a j − 1 ) 1 a j − a j − 1 ˆ a j a j − 1 I ( h ( x, y )) dy # dx ≥ 1 2 M X i =1 ˆ a i a i − 1 " M X j =1 ( a j − a j − 1 ) I 1 a j − a j − 1 ˆ a j a j − 1 h ( x, y ) dy !# dx = 1 2 M X i =1 ˆ a i a i − 1 M X j =1 ( a j − a j − 1 ) I ( u ij ( x )) dx Hence, by ( F .9 ), ( F .10 ), ( F .11 ), we get ψ ( α, β , 0) ≤ M X i =1 M X j =1 ( a j − a j − 1 ) ˆ a i a i − 1 α ij u ij ( x ) dx + β 2 M X i =1 ˆ a i a i − 1 M X j =1 ( a j − a j − 1 ) u ij ( x ) ! 2 dx − 1 2 M X i =1 ˆ a i a i − 1 M X j =1 ( a j − a j − 1 ) I ( u ij ( x )) dx ≤ M X i =1 ( a i − a i − 1 ) sup 0 ≤ u ij ≤ 1 1 ≤ j ≤ M  M X j =1 ( a j − a j − 1 ) α ij u ij + β 2 M X j =1 ( a j − a j − 1 ) u ij ! 2 − 1 2 M X j =1 ( a j − a j − 1 ) I ( u ij )  On the other hand, by restricting the supremum ov er the graphons h ( x, y ) (F .12) h ( x, y ) = u ij , if a i − 1 < x < a i and a j − 1 < y < a j , where 1 ≤ i, j ≤ M , 52 ANGELO MELE AND LINGJIONG ZHU where ( u ij ) 1 ≤ i,j ≤ M is a symmetric matrix of the constants, and optimize ov er all the possible values 0 ≤ u ij ≤ 1 , we get the lower bound: ψ ( α, β , 0) ≥ sup 0 ≤ u ij ≤ 1 u ij = u j i , 1 ≤ i,j ≤ M M X i =1 ( a i − a i − 1 )  M X j =1 ( a j − a j − 1 ) α ij u ij (F .13) + β 2 M X j =1 ( a j − a j − 1 ) u ij ! 2 − 1 2 M X j =1 ( a j − a j − 1 ) I ( u ij )  .  C A R E Y B U S I N E S S S C H O O L , J O H N S H O P K I N S U N I V E R S I T Y , 1 0 0 I N T E R N A T I O N A L D R , B A LT I M O R E , M D 2 1 2 0 2 D E PA RT M EN T O F M AT H E M A T I C S , F L O R I DA S T A T E U N I V E R S I T Y , 2 0 8 L O V E B U I L D I N G , 1 0 1 7 A C A D E M I C W AY , T A L LA H A S S E E , F L 3 2 3 0 6

Approximate Variational Estimation for a Model of Network Formation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment