Online Learning of Dynamic Parameters in Social Networks

This paper addresses the problem of online learning in a dynamic setting. We consider a social network in which each individual observes a private signal about the underlying state of the world and communicates with her neighbors at each time period.…

Authors: Shahin Shahrampour, Alex, er Rakhlin

Online Learning of Dynamic Parameters in Social Networks Shahin Shahrampour 1 Alexander Rakhlin 2 Ali Jadbabaie 1 1 Departmen t of Electrical and Systems Engineering, 2 Departmen t of Statistics University of Pennsylvania Philadelphia, P A 1910 4 USA 1 { shahin,ja dbabai } @se as.upenn.ed u 2 rakhlin@whar ton.upenn.e du Abstract This paper addresses the problem of online learning in a dynamic setti ng. W e consider a social network in which each indi vidual observes a pri vate signal about the underly ing state of the world and commu nicates with her neigh bors at each time perio d. Unlike many existing approach es, the und erlying state is dy namic, and e volves acco rding to a geometric random walk. W e view the scenario as an optimization pr oblem where agents aim to learn the tru e state while suffering the smallest possible loss. Based on the decompo sition of t he glob al loss function, we introdu ce two update mechanisms, each of which gene rates an estimate of the true state. W e establish a tight bound on the rate of chang e of the underlyin g state, un- der which individuals can track the param eter with a bou nded v ariance. Then, we characterize explicit expr essions for the stead y state mean-square deviation(MSD) of the estimates f rom the truth, per individual. W e o bserve that on ly one o f the estimators recovers the optim al MSD, w hich underscores the impact of the objec- ti ve function decomp osition on the lear ning quality . Finally , we provid e an upper bound on the regret of the proposed methods, measured as an average of errors in estimating the parameter in a finite time. 1 Intr oduction In recent years, distributed estimation, learning and prediction has a ttracted a considerable attention in wid e variety of d isciplines with applications ranging fr om sensor networks to social an d econ omic networks [1–6]. In this broad class of problems, agents aim to learn the true v alu e of a parameter often c alled the underlying state of the world . The state co uld r epresent a pr oduct, an op inion, a vote, or a quantity of interest in a sensor network. Each age nt observes a private signal abou t the underly ing state at each time period , and commu nicates with her n eighbo rs to au gment her imperfect observations. Despite the we alth o f research in this area w hen the u nderlyin g state is fixed (see e.g. [1–3, 7]), ofte n the state is subject to some chang e over time(e. g. the price of stocks) [8 – 1 1]. Therefo re, it is m ore re alistic to study m odels which a llow th e p arameter of interest to vary . In the n on-distributed context, such m odels h ave been studied in the classical literatu re on time-series prediction , and, more rece ntly , in the literature on o nline learn ing un der relaxed assump tions abou t the natur e of sequences [12]. In this p aper we aim to study the sequential pred iction problem in the context of a social network and noisy feedb ack to agents. W e c onsider a stoch astic optimization f ramework t o describ e an online social learning pro blem when the underlying state of the w or ld v ar ies over time. Ou r moti vation for the current study is the results of [8] and [9] where authors propose a social l earning scheme in which the underlying state follows a simple rando m walk. H owe ver, unlike [8] and [9], we assume a geometric random walk e volution with a n a ssociated rate o f chan ge . This enab les us to inv e stigate the interp lay of social learn ing, network structure, and the rate of state chang e, especially in the interesting case that the rate is 1 greater than unity . W e then pose the social learning as an optimiza tion problem in which individuals aim to suffer the smallest possible loss as they ob serve the stream of signals. Of particular relev an ce to th is work is the work of Duchi et al. in [13] where th e au thors d ev e lop a d istributed method based on dual averaging of sub-gr adients to converge to the optim al solution. In this p aper, we restrict o ur attention to quad ratic loss fu nctions regularized by a qu adratic prox imal fu nction, but there is no fixed o ptimal solution as the underlying state is dynam ic. In this direction, the key observation is the decomp osition of th e global loss fun ction into local loss fun ctions. W e conside r two deco mpositions for the g lobal ob jective, each of wh ich g iv es rise to a single-co nsensus-step belief update mechanism. Th e first m ethod incorporates the a verag ed prior beliefs a mong neighbors with the new priv ate o bservation, w hile th e second o ne takes into accou nt the observations in the neighbo rhoo d as well. In both scenario s, we establish that the estimates are ev en tually un biased, an d we characterize an explicit expression for the mean-squ are deviation(MSD) of the beliefs fro m th e truth, per individual. Interestingly , this quantity relies on the whole spectrum of the communication matrix wh ich exhibits the form idable role of the n etwork structure in the asymp totic learning. W e observe that th e estimators outper form t he upper bound p rovided for MSD in the pre v ious w o rk [8]. Furthermo re, only one of th e two prop osed estimato rs can comp ete w ith th e centralized o ptimal Kalman Filter [14] in certain circumstances. T his fact u nderscor es the d ependen ce of optimality on deco mposition of the g lobal loss fun ction. W e fur ther hig hlight the influence of connectivity on learning by qu antifyin g the ratio o f MSD for a complete versus a disconnecte d network. W e s e e that this ratio is always less than unity and it can get arb itrarily close to zero under some constraints. Our n ext co ntribution is to provide an up per bo und for re gr et of the p ropo sed me thods, defined as an average of errors in estimating the parame ter up to a given time minu s the lon g-run expected loss due to noise and dynamics alo ne. This finite-time regret analysis is ba sed o n the recently developed con centration in equalities fo r matrice s and it complements the asymptotic statemen ts about the behavior of MSD. Finally , we e xa mine the trade-off between th e n etwork sparsity and learning quality in a microscopic lev el. Under mild technical constraints, we see that losing each connection has detrimen tal effect o n learning as it mo noton ically in creases th e MSD. On the oth er han d, captur ing agents commun ica- tions with a g raph, we introduce the notion of optimal edge as the edge whose addition has th e most effect on learning in the sense of MSD reductio n. W e prove that such a f riendship is likely to occur between a pair of individuals with high self-relianc e that ha ve the least common neighb ors. 2 Pr eliminaries 2.1 State and Observation Model W e consider a n etwork con sisting o f a fin ite n umber of ag ents V = { 1 , 2 , ..., N } . The agen ts indexed by i ∈ V seek the unde rlying state of the world , x t ∈ R , which varies over time and ev olves accordin g to x t +1 = ax t + r t , (1) where r t is a zero mean in novation , which is indepen dent over time with finite v aria nce E [ r 2 t ] = σ 2 r , and a ∈ R is the expected rate of change o f the state of the world, assumed to be av ailable to all agents, and could potentially be greater than unity . W e assume the initial state x 0 is a finite random variable drawn independ ently by the natu re. At time perio d t , each agen t i receiv es a priv ate sign al y i,t ∈ R , which is a noisy version of x t , and can be described by the linear equation y i,t = x t + w i,t , (2) where w i,t is a zero mean observation noise with finite variance E [ w 2 i,t ] = σ 2 w , an d it is assum ed to be in depend ent over time and agen ts, and un correlated to the innovation noise. Each agen t i form s an estima te or a belief about the tru e value of x t at time t conformin g to an up date mech anism that will be discussed later . Mu ch of the difficulty of this pro blem stems from the hardness of trackin g a dynamic state with noisy obser vations, especially whe n | a | > 1 , and c ommun ication mitigates th e difficulty by virtue of reduc ing the ef fe ctiv e noise. 2 2.2 Communication Structure Agents commun icate with each o ther to update their beliefs abou t the underly ing state of the world. The interaction between agen ts is captu red by an u ndirected grap h G = ( V , E ) , wh ere V is the set of agents, and if the re is a link betwe en agent i and ag ent j , then { i, j } ∈ E . W e let ¯ N i = { j ∈ V : { i, j } ∈ E } be the set of neigh bors of a gent i , and N i = ¯ N i ∪ { i } . Each ag ent i can on ly commun icate with her neighbo rs, and assigns a weight p ij > 0 for any j ∈ ¯ N i . W e also let p ii ≥ 0 denote the self-r eliance of agen t i . Assumption 1. The commun ication matrix P = [ p ij ] is symmetric a nd dou bly stochastic, i.e., it satisfies p ij ≥ 0 , p ij = p j i , and X j ∈N i p ij = N X j =1 p ij = 1 . W e further assume the eigen va lues of P ar e in descend ing or der and satisfy − 1 < λ N ( P ) ≤ ... ≤ λ 2 ( P ) < λ 1 ( P ) = 1 . 2.3 Estimate Updates The go al of agents is to lea rn x t in a co llaborative manner by makin g sequential pr edictions. From optimization perspective, this can be cast as a quest for online minimizatio n of the sep arable, global, time-varying cost func tion min ¯ x ∈ R f t ( ¯ x ) = 1 N N X i =1  ˆ f i,t ( ¯ x ) , 1 2 E  y i,t − ¯ x  2  = 1 N N X i =1  ˜ f i,t ( ¯ x ) , N X j =1 p ij ˆ f j,t ( ¯ x )  , (3) at each time period t . On e appro ach to tackle th e stochastic learning problem formulated above is to employ distri buted dual averaging regularized by a quadratic pr oximal function [13]. T o this end, if agent i exploits ˆ f i,t as the local loss function, she updates her belief as ˆ x i,t +1 = a  X j ∈N i p ij ˆ x j,t | {z } consensus update + α ( y i,t − ˆ x i,t ) | {z } innov a tion update  , (4) while using ˜ f i,t as the local loss function results in the following update ˜ x i,t +1 = a  X j ∈N i p ij ˜ x j,t | {z } consensus update + α ( X j ∈N i p ij y j,t − ˜ x i,t ) | {z } innov a tion update  , (5) where α ∈ (0 , 1] is a constant step size that agen ts place for their innovation update , and we refer to it as sign al weigh t . Equations (4) and (5) are d istinct, single-consen sus-step estimators differing in the choice of the local lo ss fun ction with ( 4) using only priv ate o bservations while (5) av e raging observations over th e neighbo rhood . W e analy ze b oth class of estimators noting that on e mig ht expect (5) to perform better than (4) due to more information a vailability . Note th at th e ch oice of constant step size provides an insight on the in terplay o f persistent i nnovation and learning abilities of the network. W e remark that agen ts can easily l earn the fixed rate of change a by taking ratio s of o bservations, an d we assume that this has been already pe rformed by the agents in the p ast. The case of a ch anging a is beyond the scope of th e present paper . W e also point out that the real- valued ( rather than vector-valued) nature of the state is a simplificatio n that forms a c lean playgro und for the s tu dy of the effects of social learning , effects of friendships, and other proper ties of the problem . 2.4 Error Process Defining the local error processes ˆ ξ i,t and ˜ ξ i,t , at time t for agent i , as ˆ ξ i,t , ˆ x i,t − x t and ˜ ξ i,t , ˜ x i,t − x t , 3 and stacking the local errors in vectors ˆ ξ t , ˜ ξ t ∈ R N , respectiv e ly , such that ˆ ξ t , [ ˆ ξ 1 ,t , ..., ˆ ξ N ,t ] T and ˜ ξ t , [ ˜ ξ 1 ,t , ..., ˜ ξ N ,t ] T , (6) one can show that the afo remention ed collective error pro cesses co uld be described as a linear dy- namical system. Lemma 2. Given Assumption 1, the collective err or pr ocesses ˆ ξ t and ˜ ξ t defined in (6) satisfy ˆ ξ t +1 = Q ˆ ξ t + ˆ s t and ˜ ξ t +1 = Q ˜ ξ t + ˜ s t , (7) r espectively , wher e Q = a ( P − αI N ) , (8) and ˆ s t = ( αa )[ w 1 ,t , ..., w N ,t ] T − r t 1 N and ˜ s t = ( αa ) P [ w 1 ,t , ..., w N ,t ] T − r t 1 N , (9) with 1 N being vector of all ones. Throu ghout the pa per, we let ρ ( Q ) , denote the spectra l radius of Q , which is equ al to th e largest singular value of Q due to symmetr y . 3 Social Learning: Con ver g ence of Beliefs and Regr et Analy sis In this section, we study the b ehavior of estimators (4 ) and (5) in th e me an and mean -square sense, and we provide the regret analysis. In the fo llowing p roposition , we estab lish a tigh t boun d fo r a , under which agen ts can achieve asymptotically unbiased estimates using proper signal weight. Proposition 3 (Unbiased Estimates) . Given the network G with c orr espon ding communication ma- trix P sa tisfying A ssumption 1, the rate of change of the social n etwork in (4) and ( 5) must r espect the constraint | a | < 2 1 − λ N ( P ) , to allow agents to form asymptotica lly unbiased estimates of the underlying state. Proposition 3 determines the trade-off b etween the rate of change and the network stru cture. In other words, chang ing less th an the r ate g iv en in the statement of th e p roposition , individuals can always track x t with bounde d variance by selecting an appropriate signal weight. Howe ver, the prop osition does no t make any statement o n the learning quality . T o captu re that, we define the steady state Mean Square Deviation(MSD) of the network from the truth as follows. Definition 4 ((Stead y State-) Mean Square Deviation) . Given the network G with a rate of change which allows unbiased estimation, the steady s ta te of the er r or pr ocesses in (7) is defined as follows ˆ Σ , lim t →∞ E [ ˆ ξ t ˆ ξ T t ] and ˜ Σ , lim t →∞ E [ ˜ ξ t ˜ ξ T t ] . Hence, th e (Stea dy S tate-)Mean Sq uar e Deviation of the network is the d eviation fr om th e truth in the mean-squa r e sense, per i ndividua l, and it is defined as ˆ MSD , 1 N T r ( ˆ Σ) and ˜ MSD , 1 N T r ( ˜ Σ) . Theorem 5 (MSD) . Given the err or pr oc esses (7) w ith ρ ( Q ) < 1 , the steady state MSD for (4) and (5) is a function of the communicatio n matrix P , and the signal weight α as follows ˆ MSD ( P, α ) = R M S D ( α ) + ˆ W M S D ( P, α ) ˜ MSD ( P, α ) = R M S D ( α ) + ˜ W M S D ( P, α ) , (1 0) wher e R M S D ( α ) , σ 2 r 1 − a 2 (1 − α ) 2 , (11) and ˆ W M S D ( P , α ) , 1 N N X i =1 a 2 α 2 σ 2 w 1 − a 2 ( λ i ( P ) − α ) 2 and ˜ W M S D ( P , α ) , 1 N N X i =1 a 2 α 2 σ 2 w λ 2 i ( P ) 1 − a 2 ( λ i ( P ) − α ) 2 . (12) 4 Theorem 5 shows th at the steady state MSD is governed b y all eigenv alues of P contr ibuting to W M S D pertaining to the o bservation no ise, wh ile R M S D is the pen alty incurre d du e to th e in no- vation noise. Moreover , (5) outper forms (4) due to r icher informa tion diffusion, which stresses the importan ce of global loss function decom position. One mig ht a dvance a con jecture that a co mplete n etwork, wh ere all individuals can comm unicate with ea ch othe r , ach iev e s a lower steady state MSD in the lear ning p rocess since it provides the most in formatio n diffusion among other networks. This intuitive idea is discussed in the following corollary beside a few examples. Corollary 6. Deno ting the complete, star , and cycle graphs on N vertices by K N , S N , and C N , r e- spectively , and denoting their corresponding Laplacian s by L K N , L S N , and L C N , under conditions of Theor em 5, (a) F or P = I − 1 − α N L K N , we have lim N →∞ ˆ MSD K N = R M S D ( α ) + a 2 α 2 σ 2 w . (13) (b) F or P = I − 1 − α N L S N , we have lim N →∞ ˆ MSD S N = R M S D ( α ) + a 2 α 2 σ 2 w 1 − a 2 (1 − α ) 2 . (14) (c) F or P = I − β L C N , wher e β must preserve unbia sedness, we have lim N →∞ ˆ MSD C N = R M S D ( α ) + Z 2 π 0 a 2 α 2 σ 2 w 1 − a 2 (1 − β (2 − 2 cos( τ )) − α ) 2 d τ 2 π . (15) (d) F or P = I − 1 N L K N , we have lim N →∞ ˜ MSD K N = R M S D ( α ) . (16) Pr oof. Noting th at the sp ectrum of L K N , L S N and L C N are, respectively [15], { λ N = 0 , λ N − 1 = N , ..., λ 1 = N } , { λ N = 0 , λ N − 1 = 1 , ..., λ 2 = 1 , λ 1 = N } , and { λ i = 2 − 2 cos( 2 π i N ) } N − 1 i =0 , substituting each case in (10), and taking the limit over N , the proof follows immediately . T o study the effect of comm unication let us con sider the estimator (4). Und er pur view of Theorem 5 an d Corollary 6, the r atio o f the steady state MSD for a com plete network (13) versu s a fully disconnected network ( P = I N ) can be computed as lim N →∞ ˆ MSD K N ˆ MSD disconnected = σ 2 r + a 2 α 2 σ 2 w (1 − a 2 (1 − α ) 2 ) σ 2 r + a 2 α 2 σ 2 w ≈ 1 − a 2 (1 − α ) 2 , for σ 2 r ≪ σ 2 w . T he ratio abov e can get arbitrary close t o zero which, in deed, highlights t he influence of commun ication on the learning quality . W e now consid er Kalman Filter( KF) [14] as th e o ptimal ce ntralized counter part o f (5). It is well- known th at the steady state KF satisfies a Riccati equ ation, an d whe n the para meter of in terest is scalar , the Riccati equ ation simplifies to a quadratic with the positi ve root Σ K F = a 2 σ 2 w − σ 2 w + N σ 2 r + p ( a 2 σ 2 w − σ 2 w + N σ 2 r ) 2 + 4 N σ 2 w σ 2 r 2 N . Therefo re, comparing with the comple te graph (16), we ha ve lim N →∞ Σ K F = σ 2 r ≤ σ 2 r 1 − a 2 (1 − α ) 2 , and the up per boun d can be made tight by cho osing α = 1 for | a | < 1 | λ N ( P ) − 1 | . I f | a | ≥ 1 | λ N ( P ) − 1 | we should choose an α < 1 to preserve unbiasedness as well. 5 On the other hand, to ev aluate the perform ance of estimator (4), we consider the up per bound MSD B ound = σ 2 r + α 2 σ 2 w α , (17) derived in [8], for a = 1 via a distributed estimatio n scheme. For simplicity , we assume σ 2 w = σ 2 r = σ 2 , and let β in (15) be any diminishin g fun ction of N . Optimizing (13), (1 4), (15), and (1 7) over α , we obtain lim N →∞ ˆ MSD K N ≈ 1 . 5 5 σ 2 < lim N →∞ ˆ MSD S N = lim N →∞ ˆ MSD C N ≈ 1 . 62 σ 2 < MSD B ound = 2 σ 2 , which suggests a n oticeable improvement i n learnin g e ven in the star and cycle networks where the number of individuals and con nections are in the same order . Regret Analysis W e n ow tu rn to finite-time regret ana lysis of our meth ods. The average loss of all agen ts in predicting the state, up until time T , is 1 T T X t =1 1 N N X i =1 ( ˆ x i,t − x t ) 2 = 1 T T X t =1 1 N T r ( ˆ ξ t ˆ ξ T t ) . As motivated earlier , it is not po ssible, in general, to drive this average loss to zer o, and we nee d to subtract off the limit. W e th us define r egr et as R T , 1 T T X t =1 1 N T r ( ˆ ξ t ˆ ξ T t ) − 1 T T X t =1 1 N T r ( ˆ Σ) = 1 N T r 1 T T X t =1 ˆ ξ t ˆ ξ T t − ˆ Σ ! , where ˆ Σ is from Definition 4. W e then hav e fo r the spectral norm k · k th at R T ≤      1 T T X t =1 ξ t ξ T t − Σ      , (18) where we dropp ed the d istinguishing n otation b etween the two estimators since the analysis works for bo th of th em. W e , first, state a techn ical lemma f rom [16] that we invok e later for bound ing the quantity R T . For simplicity , we assume that mag nitudes of bo th innovation and observation n oise are bound ed. Lemma 7. Let { s t } T t =1 be a n ind ependen t family of vector valued random variab les, and let H be a fu nction that maps T variab les to a self-adjoin t matrix of dimension N . Consider a sequence { A t } T t =1 of fixed self-adjoint matrices that satisfy  H ( ω 1 , ..., ω t , ..., ω T ) − H ( ω 1 , ..., ω ′ t , ..., ω T )  2  A 2 t , wher e ω i and ω ′ i range over all possible values of s i for each ind ex i . Letting V ar = k P T t =1 A 2 t k , for all c ≥ 0 , we have P    H ( s 1 , ..., s T ) − E [ H ( s 1 , ..., s T )]   ≥ c  ≤ N e − c 2 / 8 V ar . Theorem 8. Und er conditio ns of Theor em 5 together with bounded ness of noise max t ≤ T k s t k ≤ s for some s > 0 , the r egr et function defined in (18) satisfies R T ≤ 1 T  k ξ 0 k 2 1 − ρ 2 ( Q )  + 1 T  2 s k ξ 0 k  1 − ρ ( Q )  2  + 1 T  s 2  1 − ρ 2 ( Q )  2  + 1 √ T 8 s 2 q 2 log N δ (1 − ρ ( Q )) 2 , (19) with pr oba bility at least 1 − δ . W e mention that results that ar e similar in spirit hav e been studied for general unboun ded station ary ergodic time series in [1 7 – 19] b y e mploying techn iques fro m th e on line lea rning literatur e. On the other hand, our problem has the network structure and the s pecific ev olution of the hidden state, not present in the above works. 6 4 The Impact of New Friendships on Social Learning In the social learnin g model we prop osed, agents are cooperative and they a im to accomp lish a global objective. In th is dir ection, th e network stru cture co ntributes su bstantially to th e learnin g process. In this section , we restrict our attention to esti mator (5), and characterize th e intuiti ve idea that m ak- ing(losing ) friendsh ips can influence the quality of learnin g in the sense of d ecreasing(in creasing) the steady state MSD of the network. T o com mence, letting e i denote the i -th unit vector in the standard b asis of R N , we explo it the negativ e semi-definite, edge function matrix ∆ P ( i, j ) , − ( e i − e j )( e i − e j ) T , (20) for edge addition(removal) to(fr om) the graph . Essentially , if there is no connec tion between agents i and j , P ǫ , P + ǫ ∆ P ( i, j ) , (21) for ǫ < min { p ii , p j j } , cor respond s to a n ew commun ication matrix a dding the edge { i, j } with a weight ǫ to the network G , and subtracting ǫ f rom self-reliance of agents i and j . Proposition 9. Let G − be the ne twork resulted by removing the bid ir ection al edge { i, j } with the weight ǫ fr om the ne twork G , so P − ǫ and P denote the co mmunicatio n ma trices associated to G − and G , r espectively . Given Ass u mption 1, for a fixed sig nal weight α the following r ela tionship h olds ˜ MSD ( P, α ) ≤ ˜ MSD ( P − ǫ , α ) , (22) as long as P is p ositive semi-definite, and | a | < 1 | α | . Under a mild te chnical assumption , Prop osition 9 su ggests that lo sing co nnection s mono tonically increases the MSD, an d individuals tend to main tain their f riendship s to o btain a lower MSD a s a global objective. Howe ver, this does not elaborate on the existence of indi viduals with whom losing or ma king con nections cou ld have an immen se impact on lea rning. W e bring th is con cept to light in the following p roposition with findin g a so- called optimal edge which p rovides the mo st MSD reduction , in case it is added to the network graph . Proposition 10. Given A ssumption 1, a p ositive semi-defin ite P , a nd | a | < 1 | α | , to find the optimal edge with a pr e- assigned weight ǫ ≪ 1 to a dd to the network G , we need to solve the follo wing optimization pr oblem min { i,j } / ∈E N X k =1  h k ( i, j ) , z k ( i, j )  2(1 − α 2 a 2 ) λ k ( P ) + 2 a 2 αλ 2 k ( P )   1 − a 2 ( λ k ( P ) − α ) 2  2  , (23) wher e z k ( i, j ) , ( v T k ∆ P ( i, j ) v k ) ǫ, (24) and { v k } N k =1 ar e the righ t eigen vectors of P . In ad dition, letting ζ max = max k> 1 | λ k ( P ) − α | , min { i,j } / ∈E N X k =1 h k ( i, j ) ≥ min { i,j } / ∈E − 2 ǫ  (1 − α 2 a 2 )( p ii + p j j ) + a 2 α ([ P 2 ] ii + [ P 2 ] j j − 2[ P 2 ] ij )   1 − a 2 ζ 2 max  2 . (25) Pr oof. Representing the first order app roximation of λ k ( P ǫ ) using defin ition of z k ( i, j ) in ( 24), we have λ k ( P ǫ ) ≈ λ k ( P ) + z k ( i, j ) fo r ǫ ≪ 1 . Based on Theor em 5, we now d erive ˜ MSD ( P ǫ , α ) − ˜ MSD ( P , α ) ∝ N X k =1  λ k ( P ǫ ) − λ k ( P )  (1 − α 2 a 2 )( λ k ( P ǫ ) + λ k ( P )) + 2 a 2 αλ k ( P ) λ k ( P ǫ )   1 − a 2 ( λ k ( P ) − α ) 2  1 − a 2 ( λ k ( P ǫ ) − α ) 2  ≈ N X k =1 z k ( i, j )  2(1 − α 2 a 2 ) λ k ( P ) + 2 a 2 αλ 2 k ( P ) + (1 − α 2 a 2 + 2 a 2 αλ k ( P )) z k ( i, j )   1 − a 2 ( λ k ( P ) − α ) 2  1 − a 2 ( λ k ( P ) − α + z k ( i, j )) 2  = N X k =1 z k ( i, j )  2(1 − α 2 a 2 ) λ k ( P ) + 2 a 2 αλ 2 k ( P )   1 − a 2 ( λ k ( P ) − α ) 2  2 + O ( ǫ 2 ) , 7 noting that z k ( i, j ) is O ( ǫ ) f rom the definitio n (24). Minimizin g ˜ MSD ( P ǫ , α ) − ˜ MSD ( P, α ) is, hence, equiv ale nt to optim ization ( 23) when ǫ ≪ 1 . T ak ing into account that P is positive semi- definite, z k ( i, j ) ≤ 0 f or k ≥ 2 , and v 1 = 1 N / √ N wh ich im plies z 1 ( i, j ) = 0 , we p roceed to the lower bo und pro of using the d efinition of h k ( i, j ) and ζ max in the statement o f the pr oposition, a s follows N X k =1 h k ( i, j ) = N X k =2 z k ( i, j )  2(1 − α 2 a 2 ) λ k ( P ) + 2 a 2 αλ 2 k ( P )   1 − a 2 ( λ k ( P ) − α ) 2  2 ≥ 1  1 − a 2 ζ 2 max  2 N X k =2 z k ( i, j )  2(1 − α 2 a 2 ) λ k ( P ) + 2 a 2 αλ 2 k ( P )  . Substituting z k ( i, j ) from (24) to above, we ha ve N X k =1 h k ( i, j ) ≥ 2 ǫ  1 − a 2 ζ 2 max  2  N X k =1  v T k ∆ P ( i, j ) v k  (1 − α 2 a 2 ) λ k ( P ) + a 2 αλ 2 k ( P )   = 2 ǫ  1 − a 2 ζ 2 max  2 T r  ∆ P ( i, j ) N X k =1  (1 − α 2 a 2 ) λ k ( P ) + a 2 αλ 2 k ( P )  v k v T k  = 2 ǫ  1 − a 2 ζ 2 max  2 T r  ∆ P ( i, j )  (1 − α 2 a 2 ) P + a 2 αP 2   . Using the f acts that T r (∆ P ( i, j ) P ) = − p ii − p j j + 2 p ij and T r (∆ P ( i, j ) P 2 ) = − [ P 2 ] ii − [ P 2 ] j j + 2[ P 2 ] ij accordin g to definition of ∆ P ( i, j ) in (20), and p ij = 0 since we are addin g a no n-existent edge { i, j } , the lower bound (25) is deriv ed . Beside po sing the op timal edge prob lem as an o ptimization, Propo sition 10 also provides an up- per bound for the best imp rovement that mak ing a friendship brings to th e network. In v iew of (25), forming a conne ction betwe en two agents with more s e lf-reliance and less common n eighbo rs, minimizes the lower bound, which offers the most maneuver for MSD reduction . 5 Conclusion W e studied a distributed o nline lear ning problem over a social network . The go al of agen ts is to estimate the underlyin g state o f the world which follows a geometric random walk. Each individual receives a noisy signal about th e underlying state at each time p eriod, s o she commu nicates with her neighbo rs to recover the true state. W e viewed the p roblem with an op timization lens wh ere agen ts want to minimize a glo bal loss fun ction in a collab orative m anner . T o e stimate th e tr ue state, we propo sed tw o method ologies deriv ed from a different decompo sition of the global ob jectiv e. Given the structu re of the n etwork, we pr ovided a tig ht up per bou nd on the rate of change o f the par am- eter which a llows agents to f ollow the state with a bou nded variance. Moreover , we comp uted the av er aged, steady state, mean-sq uare deviation of the estimates f rom th e true state. T he key ob ser- vation was optima lity of one of th e estimato rs indicating the dep endenc e of learning quality on the decomp osition. Fu rthermo re, d efining th e regret as the average of er rors in the process of lear ning during a finite time T , we dem onstrated th at the r egret fu nction o f the prop osed algo rithms decay s with a r ate O (1 / √ T ) . Finally , under mild tech nical assumptions, we characterized the influ ence of network pattern on learning by o bserving that e ach conne ction brings a m onoton ic de crease in the MSD. Acknowledgmen ts W e gratefu lly acknowledge the support of A FOSR MURI CHASE, ONR BRC Program o n Decen- tralized, Online Optim ization, NSF under grants CAREER DMS-0954 737 and CCF-11 1692 8, as well as Dean’ s Research Fund. 8 Refer ences [1] M. H. De Groot, “ Reaching a consensus, ” Journal of the American Statistical Association , vol. 69, no. 34 5, pp. 118–121, 1974. [2] A. Jad babaie, P . Molavi, A. Sandron i, an d A. T ahb az-Salehi, “Non-b ayesian social learn ing, ” Games and Economic Behavior , vol. 76, no. 1, pp. 210–225, 2012. [3] E. Mossel and O. T am uz, “Efficient bayesian learn ing in social n etworks with g aussian esti- mators, ” arXiv pr eprint arXiv:1002. 0747 , 2010 . [4] O. Dekel, R. Gilad-Bach rach, O. Sh amir , and L. Xiao, “ Optimal d istributed onlin e predictio n using mini-batches, ” The J ourna l of Machine Learning Resear ch , vol. 13, pp. 165–202, 2012. [5] L. Xiao, S. Boyd, and S. Lall, “ A scheme for robust distrib uted sensor fusion based on a verage consensus, ” in F ourth International Symposium on Information Pr ocessing in Senso r Networks . IEEE, 2005, pp. 63–70. [6] S. Kar, J. M. M oura, a nd K. Ramana n, “Distributed p arameter estima tion in sensor networks: Nonlinear observ ation mo dels and imperfect commun ication, ” IEEE T ransactions on Informa- tion Theory , v ol. 58, no. 6, pp. 3575 –3605 , 201 2. [7] S. Shahram pour and A. Jadbabaie, “Ex ponen tially fast parameter estimatio n i n networks using distributed dual a veraging , ” arXiv preprint arXiv:1309.23 50 , 201 3. [8] D. Acemoglu , A. Nedic, a nd A. Ozdaglar, “Conver g ence of rule- of-thu mb learnin g rules in social networks, ” in 47th IEEE Conference on Decision and Contr ol , 2008 , pp. 1714–172 0. [9] R. M. Frongillo, G. Schoeneb eck, and O. T amuz, “Social learn ing in a chang ing world, ” in Internet and Network Economics . Springer, 20 11, pp. 146–15 7. [10] U. A. Khan , S. Kar, A. Jadbaba ie, an d J. M. Mour a, “On conn ectivity , o bservability , and stability in distributed estimation, ” in 49th IEEE Con fer en ce on Decision and Contr o l , 2 010, pp. 6639–6 644. [11] R. Olfati-Saber , “Distributed kalman filter ing for sensor network s, ” in 46th IE EE Con fer en ce on Decision and Contr ol , 2007 , pp. 549 2–54 98. [12] N. Cesa-Bianchi, Pr ediction , learning, and games . Cambridge University Press, 2006 . [13] J. C. Du chi, A. Agarwal, and M. J. W ainwright, “Dual averaging for distributed optimization: conv ergen ce analy sis and netw o rk scaling, ” IEEE T ransactions on A utoma tic Contr ol , v o l. 57, no. 3, pp. 592–6 06, 2012 . [14] R. E. Kalman et al. , “ A new app roach to lin ear filtering and prediction problem s, ” Journal o f basic Engineering , vol. 82, no. 1, pp. 35–45, 1960. [15] M. Mesbahi an d M. Egerstedt, Graph theo r etic metho ds in mu ltiagent networks . Princeton University Pre ss, 2010. [16] J. A. T rop p, “User - friendly tail bound s for sums of rand om matrices, ” F oundatio ns of Compu - tational Mathematics , vol. 12, no . 4, pp. 389–434 , 2 012. [17] G. Biau, K. Bleakley , L. Gy ¨ orfi, an d G. Ottu cs ´ ak , “Non parametr ic sequ ential pr ediction o f time series, ” Journal of Nonparametric Statistics , v ol. 22, no . 3, pp. 297–317 , 2 010. [18] L. Gyorfi and G. Ottu csak, “Sequ ential prediction of unbo unded stationary time series, ” IEEE T r ansaction s on Information Theory , v ol. 53, no . 5, pp. 1866–18 72, 2 007. [19] L. Gy ¨ orfi, G. Lugo si et al. , Strate g ies for sequential prediction of stationary time series . Springer, 2 000. 9 Supplemen tary Ma terial Proof of Lemma 2 . W e subtract (1) from (4) to get ˆ x i,t +1 − x t +1 = a  X j ∈N i p ij ˆ x j,t − x t + α ( y i,t − ˆ x i,t )  − r t = a  X j ∈N i p ij ( ˆ x j,t − x t ) + α ( y i,t − ˆ x i,t )  − r t , where we u sed A ssumption 1 in the latter step. Replacing y i,t from (2 ) in a bove, and simplifying using definition of ˆ ξ i,t , yields ˆ ξ i,t +1 = a  X j ∈N i p ij ˆ ξ j,t + α ( y i,t − ˆ x i,t )  − r t = a  X j ∈N i p ij ˆ ξ j,t − α ˆ ξ i,t  + ( aα ) w i,t − r t . Using definitio n (6) to write th e above in the m atrix fo rm completes the proof fo r ˆ ξ t . Th e proo f for ˜ ξ t follows precisely in the s ame fashion. Proof of Proposition 3 . W e start by the f ac t that th e innovation and observation noise are zero mean, so (7) implies E [ ˆ ξ t +1 ] = Q E [ ˆ ξ t ] , an d E [ ˜ ξ t +1 ] = Q E [ ˜ ξ t ] . Th erefor e, for mean stability of th e linear equations, the spectral radius of Q must be less tha n u nity . Considering the expre ssion for Q fr om (8), for a fixed α we m ust hav e | a | < 1 ρ ( P − αI N ) = 1 max { 1 − α, | α − λ N ( P ) |} . (26) T o maxim ize the right hand side over α , we need to solve the min-max problem min α  max { 1 − α, | α − λ N ( P ) |}  . Noting that 1 − α an d α − λ N ( P ) are straigh t lines with negative and positive slopes, re spectiv e ly , the minimum occurs at the intersection of the tw o lines. Evaluating the r ight hand side of (26) at the intersection point α ∗ = 1+ λ N ( P ) 2 , completes the proof. Proof of Theorem 5 . W e presen t the proof for ˜ MSD ( P, α ) by obser ving that (7) E [ ˜ ξ t +1 ˜ ξ T t +1 ] = Q E [ ˜ ξ t ˜ ξ T t ] Q T + E [ ˜ s t ˜ s T t ] , since the in novation and ob servation noise are zero mean and uncorre lated. Therefor e, letting ˜ S = E [ ˜ s t ˜ s T t ] , since ρ ( Q ) < 1 by hypoth esis, the steady state satisfies a L yapunov equation as below ˜ Σ = Q ˜ Σ Q T + ˜ S . Let Q = Q T = U Λ U T represent the Eig en decomp osition of Q . Let also u i denote th e i -th eigenv e c- tor of Q cor respond ing to eigenv alue λ i . Un der stability of Q the solution of the L yapun ov equatio n is as follows ˜ Σ = ∞ X τ =0 Q τ ˜ S Q τ = ∞ X τ =0 N X i =1 N X j =1 λ τ i u i u T i ˜ S λ τ j u j u T j = N X i =1 N X j =1 u i u T i ˜ S u j u T j ∞ X τ =0 λ τ i λ τ j = N X i =1 N X j =1 u i u T i ˜ S u j u T j 1 − λ i λ j . 10 Therefo re, the ˜ MSD defined in 4, can be computed as ˜ MSD = 1 N T r ( N X i =1 N X j =1 u i u T i ˜ S u j u T j 1 − λ i λ j ) = 1 N N X i =1 N X j =1 ( u T j u i )( u T i ˜ S u j ) 1 − λ i λ j = 1 N N X i =1 u T i ˜ S u i 1 − λ 2 i , where we used the fact that u T j u i = 0 , f or i 6 = j , and u T i u i = 1 , f or any i ∈ V . T aking into a ccount that Q = a ( P − αI N ) and ˜ S = σ 2 r ( 1 N 1 T N ) + ( a 2 α 2 σ 2 w ) P 2 , we derive ˜ MSD = 1 N N X i =1 u T i ( σ 2 r ( 1 N 1 T N ) + ( a 2 α 2 σ 2 w ) P 2 ) u i 1 − λ 2 i = 1 N N X i =1 ( u T i 1 N ) 2 σ 2 r 1 − λ 2 i + 1 N N X i =1 a 2 α 2 σ 2 w λ 2 i ( P ) 1 − λ 2 i = σ 2 r 1 − a 2 (1 − α ) 2 + 1 N N X i =1 a 2 α 2 σ 2 w λ 2 i ( P ) 1 − a 2 ( λ i ( P ) − α ) 2 , where the las t step is due to the facts that λ i = a ( λ i ( P ) − α ) and 1 N / √ N is one of the eigenvectors of Q with correspon ding eigenv alu e a (1 − α ) , so it is orthogo nal to other eigenv ectors, i.e., u T i 1 N = 0 , for u i 6 = 1 N / √ N . T he proof for ˆ MSD follows in the same fashion. Proof of Theorem 8 . The closed form solution of the error pro cess (7 ) is, ξ t +1 = Q t +1 ξ 0 + t X τ =0 Q t − τ s τ , which implies ξ t +1 ξ T t +1 = Q t +1 ξ 0 ξ T 0 Q t +1 + Q t +1 ξ 0  t X τ =0 Q t − τ s τ  T +  t X τ =0 Q t − τ s τ  ξ T 0 Q t +1 +  t X τ =0 Q t − τ s τ  t X τ =0 Q t − τ s τ  T , (27) since Q is symmetric. One can see that k 1 T T X t =1 Q t ξ 0 ξ T 0 Q t k ≤ 1 T  k ξ 0 k 2 1 − ρ 2 ( Q )  , and k 1 T T − 1 X t =0 Q t +1 ξ 0  t X τ =0 Q t − τ s τ  T k ≤ k ξ 0 k s T T − 1 X t =0 ρ ( Q ) t +1 t X τ =0 ρ ( Q ) t − τ ≤ 1 T  s k ξ 0 k  1 − ρ ( Q )  2  . On the other hand , as we see in the pr oof of Theore m 5, letting S = E [ s τ s T τ ] , we have Σ = P ∞ τ =0 Q τ S Q τ . Based on definition (18), equation (27), and the bound s abov e, we derive R ( T ) ≤ 1 T  k ξ 0 k 2 1 − ρ 2 ( Q )  + 1 T  2 s k ξ 0 k  1 − ρ ( Q )  2  + 1 T     T − 1 X t =0  t X τ =0 Q t − τ s τ  t X τ =0 Q t − τ s τ  T − t X τ =0 Q τ S Q τ     + k 1 T T − 1 X t =0 ∞ X τ = t +1 Q τ S Q τ k . (28) Let H ( s 0 , ..., s T − 1 ) = T − 1 X t =0  t X τ =0 Q t − τ s τ  t X τ =0 Q t − τ s τ  T , 11 and observe that E [ H ( s 0 , ..., s T − 1 )] = P T − 1 t =0 P t τ =0 Q τ S Q τ . It can be verified that for any 0 ≤ t < T , k H ( s 0 , ..., s t ..., s T − 1 ) − H ( s 0 , ..., s ′ t , ..., s T − 1 ) k ≤ 4 s 2  1 − ρ ( Q )  2 . Thus, letting V ar = 16 T s 4  1 − ρ ( Q )  4 , and appealing to Lemma 7, we get P      T − 1 X t =0  t X τ =0 Q t − τ s τ  t X τ =0 Q t − τ s τ  T − t X τ =0 Q τ S Q τ     ≥ c  ≤ N e − c 2 / 8 V ar . Setting the probab ility above equal to δ , this implies that with probability at least 1 − δ , we h av e 1 T     T − 1 X t =0  t X τ =0 Q t − τ s τ  t X τ =0 Q t − τ s τ  T − t X τ =0 Q τ S Q τ     ≤ 1 √ T 8 s 2 q 2 log N δ (1 − ρ ( Q )) 2 . Moreover , we e vid ently ha ve k 1 T T − 1 X t =0 ∞ X τ = t +1 Q τ S Q τ k ≤ 1 T  s 2  1 − ρ 2 ( Q )  2  . Plugging the two bounds above in ( 28) completes the proof. Proof of Proposition 9 . Considering the expression for MSD in Theorem 5, we ha ve ˜ MSD ( P, α ) − ˜ MSD ( P − ǫ , α ) = ˜ W M S D ( P, α ) − ˜ W M S D ( P − ǫ , α ) ∝ N X i =1  λ i ( P ) − λ i ( P − ǫ )  (1 − α 2 a 2 )( λ i ( P − ǫ ) + λ i ( P )) + 2 a 2 αλ i ( P ) λ i ( P − ǫ )   1 − a 2 ( λ i ( P ) − α ) 2  1 − a 2 ( λ i ( P − ǫ ) − α ) 2  . Based on definitions (20) and (2 1), it follows from W eyl’ s eigenv alue ineq uality th at λ k ( P ) − λ k ( P − ǫ ) ≤ λ 1  ǫ ∆ P ( i, j )  = 0 , for any k ∈ V . Combined with the assumptions P ≥ 0 and | aα | < 1 , this im plies that the num erator of the expression above is always non-po siti ve. The denom inator is always positiv e d ue to stability of the erro r p rocess ˜ ξ t in (7), and hence, ˜ MSD ( P, α ) ≤ ˜ MSD ( P − ǫ , α ) . 12

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment