Exponentially Fast Parameter Estimation in Networks Using Distributed Dual Averaging

Exponentially F ast Para meter Estimation in Networks Using Distrib uted Dual A verag ing † Shahin Shahrampou r ‡ and Ali Jadbabaie ‡ Abstract — In this paper we p resent an optimization-based view of distributed param eter estimation and observational social lear nin g in networks. Agents receiv e a sequence of random, independ ent a nd id entically distributed (i.i.d.) signals, each of which individually may not be informati ve about the underlying true state, but the signals together are globally informa tive enough to make the true state id entiﬁable. Using an opt imization-based characterization of Bayesian learning as proximal stochastic gradient descent (with Kullback-Leibler diver gence from a prior as a proxima l function), we show how to efﬁciently use a distributed, online variant of Nesterov ’s dual a veraging method t o solve the estimation with purely local informa tion. W hen the true state is globally identiﬁable, and the network is connected, we prov e th at agents ev entually learn the tru e parameter using a randomized gossip scheme. W e demonstrate that wit h high probability the conv ergence is exponentially fast with a rate depend ent on the KL diver gence of obser vations under the true state fro m observ ations under the second likeliest state. Furthermore, our work also highlight s the possibility of learning under continu ous adaptation of network which is a consequ ence of empl oying constant, u nit stepsize f or the algorithm. I . I N T R O D U C T I O N Distributed estimation , detectio n, an d observational social learning has been an in tense foc us of research over the p ast 3 decad es [1]–[9 ], with applications rangin g from sensor n et- works to social and econom ic networks. In these scenarios, agents in a n etwork need to learn th e value o f a par ameter , that might r epresent a state or decision (o ften called the state of th e world), but each individual agen t lacks the necessary informa tion to estimate the state on its own. Instead, the global spread of in formation in the network pr ovides agen ts with adequate d ata fo r recovering the true state an d a s a result, agents iteratively exchan ge in formation with their neighbo rs. In distributed sensor and robo tic networks, agents use local diffusion to aug ment their imperfe ct ob servations with informatio n from their neighb ors [6], [10] –[14]. On the oth er h and, rece nt developments in distrib uted optimization ha ve led to many advances and interesting decentralized algorithms, generalizing these resu lts and a t the same time ope ning new venu es for development of principled distributed estimation algorithms. E xamples of such p apers in clude the works of resear chers such as Nedi ´ c and Ozdagla r [15 ], Lob el an d Ozdaglar [16] , Ram, Ne di ´ c and V eerav alli [1 7] and Nedi ´ c, Olshevsky , Ozdaglar and † This work was s upporte d by AFOSR MURI CHASE , and ONR BRC Program on Decentraliz ed, Online Optimiza tion. ‡ The auth ors are with the Depa rtment of Electric al and Syste ms Engineer - ing and Genera l Roboti cs, Automation, Sensing and Percep tion (GRASP) Laboratory , Unive rsity of Pennsylv ania, Philadel phia, P A 19104-6228 US A. { shahin,jadbabai } @seas.upenn.edu Tsitsiklis [18], Lop es and Sayed [19] , and more recently the re sults of Duchi, Agarwal and W ainwrig ht [2 0]. Of particular im portance to the work in this paper is the work of Duchi et a l. in [20 ] where the authors develop a d istributed method b ased o n du al averaging of subgradien ts. Using proper d iminishing stepsize ru le, th eir algorithm converges to the optimal solutio n in deter ministic network chan ge as well a s stoch astic. The goal o f this pap er is to provid e an optimization -based formu lation of parameter estimation and social learn ing and d ev elop a link between the two. Our mo ti vation for the cu rrent study is the recent re sults of [7] and [8] in which the authors d e velop non -Bayesian learning schemes to circumvent the com plexities associated with fully Bayesian estimation [1], as well as th e results of [20] on distributed optimization . The proposed algorithms in [7] and [8], inv olve agents that repeatedly receiv e heterogeneo us, priv ate, random i.i.d. signals gener ated from a global likelihoo d function and the g oal o f agents is to learn the tru e state of the world using lo cal marginals. Both paper s show that und er mild assumption s all agents ev entually estimate the true parameter correc tly . In [8], agen ts update their pr ior beliefs using pri vate obser vations and then compute a we ighted av erage of their beliefs with that of their n eighbor s, while in [7] , ag ents update the logarithms of their be liefs using the lo cal log-likelihood f unction. In both cases, und er mild assumptions agents ev entually learn the tr ue state. W e show that the r esults of [7 ], have a very inte resting op timization- based in terpretation . Exp loring this connectio n and building an optimization -based rationale help s us qu antify th e p ros and cons of different appr oaches to the problem . A key unifying observation that links both rec ur- si ve Bayesian learning and Maximum L ikelihood Estima- tion(MLE) problems to o nline optimization (e ven in the cen- tralized setting) is the view of MLE in Bayesian fr amew ork as an optimization in wh ich th e inn er p roduct of th e be lief vector and the global log-likelihood function (represented as a vector ) is maximized, subject to the belief vector being a p robability distrib ution over the space of p arameters. Perhaps less well-known, is the fact that Bayesian parameter estimation can b e derived fro m the exact same setup if th e Kullback-Leibler div ergence fr om a prio r belief is add ed to the optimization cost or used as a pr ox imal fu nction . W e show an efﬁcient distributed c ounterp art of this idea u sing a stochastic variant of Nesterov’ s projected dual a verag- ing [2 1]. Ag gregating the ir priv ate log-likeliho od fun ctions, agents av erage their local infor mation, and in the same time step update estimates of the centralized beliefs in a step akin to applying Bayes rule on the aggregated log-likelihoo ds. When the true state is globally id entiﬁable an d the network is conne cted, we show that agents reach consen sus on the beliefs in p robability . Mo re speciﬁcally , we prove tha t with high probability the con vergence is expon entially fast with a rate dependent on the average expected discriminatio n information fo r the tru e state over the seco nd likeliest state captured by the KL d i vergence of the observations under two aforemen tioned states. W e furth er show th at indeed there is no need fo r a dimin ishing stepsize rule as in g eneral subgrad ient appr oaches, and a ﬁxed stepsize of 1 can be used. Interestingly , the method recovers the d istributed MAP algorithm proposed by [7] as a spe cial ca se. The r est of th e paper is organized as follows. In th e next section, we introduce the model under which age nts interact, deﬁne ou r learning p roblem and formulate it as a co nstrained maximization . In section III , we recover Bayesian estimatio n with dual a veraging . In sectio n IV we sh ow applyin g gossip distributed dual averaging under con stant, unit step size ru les results in lear ning in th e pr obability sense, and the conver - gence is expo nentially fast. Section V conclu des. I I . P R E L I M I N A R I E S A. Agents an d Ob servation W e consider a network consisting o f a ﬁnite num ber of agents V = { 1 , 2 , . . . , n } . The agents indexed b y i ∈ V seek a ﬁxed, uniq ue, true state of the world θ ∗ ∈ Θ with Θ = { θ 1 , θ 2 , . . . , θ m } denoting a ﬁnite set of possible states. At each time t ≥ 0 , belief o f agent i is denoted by µ i,t ( θ ) ∈ ∆Θ , wh ere ∆Θ is a pr obability distribution over the set Θ . In p articular, µ i, 0 ( θ ) ∈ ∆Θ denotes th e prio r belief of ag ent i about th e states of the world . For each agent i , we assume the prior µ i, 0 is in the interior of the probab ility simplex and as a resu lt h as no zero elements 1 . The learning model is giv en by a condition al likelihood function ℓ ( s t | θ j ) which is governed by a state of the world θ j ∈ Θ . The signal s t = ( s t 1 , s t 2 , . . . , s t n ) ∈ S 1 × · · · × S n is generated at each time t , and s t i ∈ S i denotes the signal priv ately ob served b y agent i at time t , where S i is the signal space for agent i . ℓ i ( . | θ j ) represents the i - th marginal of ℓ ( . | θ j ) , and we let the vector ℓ i ( . | θ ) = [ ℓ i ( . | θ 1 ) , ..., ℓ i ( . | θ m )] T , f or any i ∈ V , where ℓ i ( . | θ j ) > 0 for all signals at all times. Agent i at time t , h as access to the param etrized likelihoo d of the realized priv ate signal s t i , i.e., it knows th e value of ℓ i ( s t i | θ ) , but do es n ot have access to the likelihoo d fu nctions of other ag ents, i.e ., it do es not know ℓ j ( . | θ ) for any j 6 = i . Gener ated signals are i.i.d. over time and also indepen dent over ag ents. W e also deﬁne ¯ Θ i as the set of states that are o bservationally equivalent to θ ∗ for agent i ; in o ther words, ¯ Θ i = { θ j ∈ Θ : ℓ i ( s i | θ j ) = ℓ i ( s i | θ ∗ ) ∀ s i ∈ S i } with pro bability one . Let ¯ Θ = ∩ n i =1 ¯ Θ i be the set o f states that are ob servationally e quiv alent to θ ∗ from all a gents p erspective. W e assume 1 W e will see that this assumption is just for dealing with log-lik elihood functio ns and technica l issues; otherwise, we only need strict positi vity of belie fs over the true state. A1. The tru e state is glob ally identiﬁable, and hen ce, ¯ Θ = { θ ∗ } . A2. Each log-marginal log ℓ i ( . | θ j ) h as a bo unded vari- ance. The probability tr iple (Ω , F , P θ ∗ ) is d eﬁned such that Ω = ( ⊗ n i =1 S i ) N , F t is the smallest σ -ﬁeld contain ing the informa tion about all agents u p to time t , and P θ ∗ is the true p robability measur e with respect to Ω with E ∗ being its correspo nding expectation operator . N repre sents the natu ral number s and F = ∪ ∞ t =1 F t . Deﬁnition 1 : Agent i ∈ V asymp totically learns the true parameter θ ∗ on a p ath { s t } ∞ t =1 if, along that path , µ i,t ( θ ∗ ) → 1 as t → ∞ . The d eﬁnition is intuitive as learn ing occu rs when agents assign probab ility o ne to th e uniq ue true parameter . B. T ime Mo del an d Communica tion Structu r e The interaction b etween agents is captured by an u ndi- rected grap h G = ( V , E ) , whe re V is the set of agents and if there is a lin k between agent i and ag ent j , the pair { i, j } belongs to the set E . W e let N i = { j ∈ V : { i, j } ∈ E } be the set o f neig hbors of agen t i . Agents communic ation con forms to an in variant gossip algorithm [22], wherein each node has a clock which tick s accordin g to a r ate 1 Poisson process. Equivalently , ther e is a single global clo ck which ticks accor ding to a r ate n Poisson process at times T t , where { T t − T t − 1 } are i.i.d. exponential ran dom variables with rate n . In the ana lysis, we use the index t to refer to the t -th time slot [ T t − 1 , T t ) , t ≥ 0 . At each tick T t of the glob al clo ck, agent I t ∈ V is p icked unifor mly at rand om. Then, it contacts a neigh bor J t ∈ V with pro bability P I t J t , and they up date the ir belief. Denoting the c ommun ication matrix by W ( t ) , this a mounts to W ( t ) taking the fo rm W ( t ) = I − ( e I t − e J t )( e I t − e J t ) T 2 , with p robability 1 n P I t J t , wh ere e i is th e i -th u nit vector in the standard basis o f R n . Hence, the m atrix P = [ P ij ] has nonnegative entries and P ij > 0 only if { i, j } ∈ E . By deﬁnition, P is row stochastic with largest eigenv alue 1. W e assume A3. The n etwork is connected i.e., th ere exists a path from any agent i to any agent j , and the second largest eigenv alue of E [ W ( t )] is strictly less than one in m agnitude . The connectivity constraint in assumption (A 3) g uarantees the information ﬂow in the network. The assum ption, for instance, holds if the und erlying structur e o f the network is connected and n onbipa rtite. C. Pr oblem S etup an d F ormulation The MLE problem of ﬁnding the likeliest tru e state, can be form ulated in terms of a belief vector µ as the following optimization max µ ∈ ∆Θ  f ( µ ) , µ T n X i =1 E ∗ s i [log ℓ i ( s i | θ )]  , (1) with µ ∗ being its optimal solution. In the next section we d iscuss that a regular ization term can be added to the objective function of (1) or used a s a common pr oximal function among the agents. Alternatively , o ne might cast (1) as a qu est for ﬁnding the MLE solutio n θ ∗ = argm ax θ j ∈ Θ  E ∗ s [log ℓ ( s | θ j )]  . (2) The eq uiv alence of ( 1) and (2) follows imm ediately f rom the indep endence o f agents ob servations, and the g lobal identiﬁability of θ ∗ (assumption A1) which guaran tees that (1) has a unique max imizer . I n th e sequel, without loss of generality , we assume the c ompon ents of the vector E ∗ s log ℓ ( s | θ ) ar e in descend ing ord er , i.e. E ∗ s [log ℓ ( s | θ 1 )] > E ∗ s [log ℓ ( s | θ 2 )] ≥ ... ≥ E ∗ s [log ℓ ( s | θ m )] , (3) where the strict inequa lity on the left-hand side o f (3) is due to uniqu eness o f θ ∗ = θ 1 . Henc e, θ 1 is the un ique true state that is aime d to be recovered, and µ ∗ = e 1 . I I I . B A Y E S I A N E S T I M AT I O N V I A N E S T E R OV ’ S D UA L A V E R AG I N G The lear ning prob lem formu lated in (1) is a m aximization over a c losed conve x set, so th e structur e o f the p roblem allows us to ap ply a distrib uted gene ralization of the central- ized dual averaging m ethod pr oposed in [21]. First, howe ver , we show how Bay esian lea rning can be viewed with an optimization lens. A common a pproach to tackle problem (1) is to co nsider the empirical average as th e cost function , an d solve the online stochastic lear ning pro blem. T o this end, we employ a re gular ized d ual av eragin g scheme generating a sequence of iterates { µ t , z t } ∞ t =0 , where µ t ∈ ∆Θ and z t ∈ R m . At time period t the algo rithm r eceiv es g t , th e stoch astic gr adient of the objective function , and perfo rms the following set o f centralized u pdates: z t +1 = z t + g t and µ t +1 = ψ Y ∆Θ ( z t +1 , α t ) , (4) where { α t } ∞ t =0 is a non-incr easing sequence of p ositi ve stepsize, ψ ( . ) is a so called pr oximal fu nction , an d ψ Y ∆Θ ( z , α ) , argmin x ∈ ∆Θ  − < z , x > + 1 α ψ ( x )  , (5) with < z , x > being the standard inn er produc t in th e space of R m . The dua l update z , essentially integrates the stochastic gradients, and the second update projects th e integration on the feasible set while regularizing the projection using a proxim al function . A particularly r elev ant example o f a pro ximal fun ction is the Kullback-Leibler (KL) div ergence (also known as relative entropy) f rom an initial belief µ 0 deﬁned as [23 ]: ψ ( x ) = D K L ( x || µ 0 ) , m X i =1 [ x ] i log [ x ] i µ 0 ( θ i ) , (6) for any x ∈ ∆Θ , where [ x ] i is the i -th comp onent o f the vector x . It is straig htforward to verif y that KL d iv ergence from µ 0 is strongly conve x with respect to the ℓ 1 -norm on the probab ility simplex { x | x ≥ 0 , P m i =1 [ x ] i = 1 } 2 . The following proposition shows how the set of upd ates (4) equipp ed with the KL d i vergence could be viewed as an optimization counterp art of Bayesian ru le. Pr oposition 2: Given up date rules ( 4) with stepsize se- quence { α t = 1 } ∞ t =0 , using KL divergence as the proxim al function , following the stoch astic gradien t at each time period t , an d letting z 0 = 0 , we obtain the Bayes r ule as µ t ( θ ) = µ t − 1 ( θ ) ⊙ ℓ ( s t | θ ) P m j =1 µ t − 1 ( θ j ) ℓ ( s t | θ j ) , (7) where ⊙ is com ponent- wise mu ltiplication. Pr oof: T o solve (1) with updates (4), since the stochas- tic gradient is g t = P n i =1 log ℓ i ( s t i | θ ) , p erformin g th e ﬁrst update, we have z t = n X i =1 t − 1 X τ =0 log ℓ i ( s τ i | θ ) = t − 1 X τ =0 log ℓ ( s τ | θ ) . Using ψ ( x ) = P m j =1 [ x ] j log [ x ] j µ 0 ( θ j ) as the pr oximal function, we need to solve µ t = argmin x ∈ ∆Θ  − x T z t + m X j =1 [ x ] j log [ x ] j µ 0 ( θ j )  . (8) Leaving the positivity constrain t implicit, we can wr ite (8) as th e maximization of the following Lagran gian L ( x, λ ) = x T t − 1 X τ =0 log ℓ ( s τ | θ ) − m X j =1 [ x ] j log [ x ] j µ 0 ( θ j ) + λ ( x T 1 − 1) , (9) where 1 is the vector of all ones. Differentiating (9 ) we ge t ∂ ∂ [ x ] j L ( x, λ ) = t − 1 X τ =0 log ℓ ( s τ | θ j ) − log [ x ] j + log µ 0 ( θ j ) − 1 + λ ∂ ∂ λ L ( x, λ ) = x T 1 − 1 . Setting th e a bove equations to zero , we get x = exp λ − 1 µ 0 ⊙ t − 1 Y τ =0 ℓ ( s τ | θ ) (10) x T 1 = 1 , (11) and replacing x in (11) by (10) we have exp λ − 1 = 1 P m j =1 µ 0 ( θ j ) Q t − 1 τ =0 ℓ ( s τ | θ j ) . (12) 2 At origin we consider the limit; in other words, we deﬁne 0 log(0) = 0 . Hence, by (1 0) and (12) we h av e µ t ( θ ) = µ 0 ( θ ) ⊙ Q t − 1 τ =0 ℓ ( s τ | θ ) P m j =1 µ 0 ( θ j ) Q t − 1 τ =0 ℓ ( s τ | θ j ) , and (7) follows by composition ality of the ab ove equation. W e derived a closed-form solution for µ t ( θ ) that essen- tially perf orms the Bayesian update; each agen t aggr egates informa tion up to time t , an d then, infers the posterior from prior . One ca n prove the almost sure convergence of µ t ( θ ) combinin g the argu ments in [ 24] an d [8] . Howe ver , we are interested in solv ing (1) in a d ecentralized man ner, and we only use a gener alized result of Proposition 2 later . I V . D I S T R I B U T E D S T O C H A S T I C L E A R N I N G W e no w show that the centralized optimization studied in the previous sectio n can be distributed over the network. Contrary to th e cen tralized algorithm, each agent i ∈ V , at t -th slo t only observes g i,t , the stoc hastic gr adient o f its associated log-likelihood function, while it does not hav e access to the signals of other agents. The c ommun ication structure is ba sed on a randomized gossip sche me. Let the global Poisson clock at the beginn ing of the t -th slot tick for agen t i (with pro bability 1 n ), and let a gent i co ntact a neighbo ring node j (with pr obability P ij ). Then, agents i an d j average th eir accumulated o bservations fro m previous slots, and a dd their new stochastic gradie nts to form the following online updates z i,t +1 = z i,t + z j,t 2 + g i,t and z j,t +1 = z i,t + z j,t 2 + g j,t , (13) while in that slot, any o ther agent k 6∈ { i, j } d oes not contact its neighbor s, and only follows its own stochastic gradien t g k,t , so we have z k,t +1 = z k,t + g k,t . (14) Having upd ated their ob servations, all agen ts calculate their estimates µ i,t +1 ( θ ) = ψ Y ∆Θ ( z i,t +1 , α t ) , (15) where Q ψ ∆Θ ( z , α ) is p reviously d eﬁned in (5). Letting Z t =      z 1 ,t z 2 ,t . . . z n,t      and G t =      g 1 ,t g 2 ,t . . . g n,t      , the set of updates (1 3) and (14) ca n b e r epresented in the matrix form as follows Z t +1 = ˜ W ( t ) Z t + G t , (16) where ˜ W ( t ) = W ( t ) ⊗ I m × m , and the rand om matrix W ( t ) with probab ility 1 n P ij takes the form W ( t ) = I − ( e i − e j )( e i − e j ) T 2 . (17) W e u se the above distributed stochastic scheme to optimize (1) eq uipped with th e KL divergence deﬁned in (6). It is noteworthy th at in the distributed setting, em ploying the KL div ergence from the initial belief, each agen t exhibits in ertia to a d efault opinion over the states. W e prove in the n ext lemma th at µ i,t ( θ ) p reserves a Bayes-like evolution. Lemma 3 : Gi ven th e set of u pdate rules (13)-(14)-(15) with stepsize sequence { α t = 1 } ∞ t =0 , following its stochastic gradient g i,t = lo g ℓ i ( s t i | θ ) at t -th time per iod, if we let z i, 0 = 0 3 , agent i ’ s estimator ev olves as µ i,t ( θ ) = µ i, 0 ( θ ) ⊙ ex p[ t Φ i,t ( θ )] P m j =1 µ i, 0 ( θ j ) exp[ t Φ i,t ( θ j )] , ( 18) where Φ i,t ( θ ) = 1 t t − 1 X τ =0 n X k =1  t − 1 − τ Y ρ =1 W ( t − ρ )  ik log ℓ k ( s τ k | θ ) . (19) Pr oof: The discrete-time linear system (16) has the closed-for m solution Z t =  t Y ρ =1 ˜ W ( t − ρ )  Z 0 + t − 1 X τ =0  t − τ − 1 Y ρ =1 ˜ W ( t − ρ )  G τ . Letting Z 0 = 0 , since ˜ W ( t ) = W ( t ) ⊗ I m × m , b y basic proper ties of Kro necker product, we can extract z i,t from Z t for each i to get z i,t = t − 1 X τ =0 n X k =1  t − 1 − τ Y ρ =1 W ( t − ρ )  ik log ℓ k ( s τ k | θ ) = t Φ i,t ( θ ) . W e now ne ed to solve (15) to comp lete the proof. Th e argument follows in the same fashion as Proposition 2. Forming the Lagrang ian as in (9), wr iting the ﬁrst order condition s u sing z i,t derived ab ove, an d following the same steps as in (10), (11) an d (12), the closed-f orm solution (1 8) follows immediately . Equation (18) shows that at each time perio d t ≥ 0 , the set of d istributed up date ru les (1 3)-(14)-(15) construct a Gibbs d istrib ution over the states. As we shall see in th e next subsection, Φ i,t ( θ ) p lays a key r ole on the c on vergence of th e sequence of distributions gen erated over time. A. Conver gence An alysis W e now exhibit that aggr egating info rmation over time, agents ha ve ar bitrarily close opinion s in a con nected net- work. This is captur ed by th e fact that the limit of ( 19) is indepen dent of i which indexes agen ts. In this direction , we study the limit behavior of (1 9) in the following lemma. Lemma 4 : Under assumptions (A2) a nd (A3), the vector Φ i,t ( θ ) deﬁned in (19) converges in th e probability sense as follows Φ i,t ( θ ) p − → Φ ∞ ( θ ) = 1 n n X k =1 E ∗ s k [log ℓ k ( s k | θ )] . Pr oof: The sequen ce { W ( t ) } ∞ t =0 is dou bly stochastic, and th e product term in Φ i,t ( θ ) p reserves dou bly stochastic- ity , so for any j ∈ { 1 , 2 , ..., m } we have 3 Rega rdless of the zero initial v alue, all the results hold asymptotically , and thi s condition only simpliﬁes our deri vat ion. v ar [Φ i,t ( θ j )] = 1 t 2 t − 1 X τ =0 n X k =1  t − 1 − τ Y ρ =1 W ( t − ρ )  2 ik v ar [log ℓ k ( s k | θ j )] ≤ 1 t n X k =1 v ar [log ℓ k ( s k | θ j )] . Hence, bounded v ariance ass ump tion (A2) guaran tees Φ i,t ( θ ) − E [Φ i,t ( θ )] p − → 0 . It can be sh own [22] tha t E [ W ( t )] = E [ W (0)] = I − 1 2 n D + P + P T 2 n , where the diagonal m atrix D is o f th e form D i = P n j =1 [ P ij + P j i ] . Th e fact tha t the sequence { W ( t ) } ∞ t =0 is i.i.d. and doubly stochastic, and the second largest eigen v alue of E [ W ( t )] is less th an o ne in mag nitude (A3), entails [25] W ( t ) W ( t − 1 ) ...W (1) − → 1 n 11 T , almost sur ely , which results in E [Φ i,t ( θ )] = n X k =1 1 t t − 1 X τ =0  t − 1 − τ Y ρ =1 W ( t − ρ )  ik E ∗ s k [log ℓ k ( s k | θ )] − → 1 n n X k =1 E ∗ s k [log ℓ k ( s k | θ )] , where we used th e fact that Ces ` aro mean preserves the limit. There is an interesting con nection between the pr evious lemma an d the distributed MAP algorith m propo sed in [7] where autho rs estab lish that the point maximizer of Φ i,t ( θ ) over Θ co n verges in probability to θ ∗ in a stro ngly connected network. Howe ver , we still need to d emonstrate that the estimator µ i,t ( θ ) is weakly co nsistent f or any agent i . T o this end, we p rove that ap plying a Bayes-like update in th e same time-scale of receiving z i,t , the belief vector µ i,t ( θ ) conv erges in pro bability to the u nique ma ximizer of (1) which is a Dirac distribution over θ ∗ . Theor em 5 : Giv en condition s in the Lem mas 3 and 4, agent i ’ s estimator is we akly co nsistent, th at is, µ i,t ( θ ) p − → µ ∗ = e 1 as t → ∞ . Pr oof: W e have the explicit f orm of the e stimator µ i,t accordin g to Le mma 3. Ther efore, µ i,t ( θ ∗ = θ 1 ) = µ i, 0 ( θ 1 ) exp[ t Φ i,t ( θ 1 )] P m j =1 µ i, 0 ( θ j ) exp[ t Φ i,t ( θ j )] =  1 + X j ≥ 2 µ i, 0 ( θ j ) µ i, 0 ( θ 1 ) exp[ t Φ i,t ( θ j ) − t Φ i,t ( θ 1 )]  − 1 . Under pu rview of Lemma 4 and equation (3), [Φ i,t ( θ j ) − Φ i,t ( θ 1 )] co n verges to a negati ve numbe r for any j ≥ 2 , and hence µ i,t ( θ 1 ) p − → 1 . The fact th at µ i,t ( θ ) ∈ ∆Θ implies that µ i,t ( θ j ) p − → 0 for all j ≥ 2 , so µ i,t ( θ ) p − → µ ∗ = e 1 . Theorem 5 too under scores the trade-off between the adaptation and learning in the network . In many d istributed optimization settings the stepsize seq uence must vanish to allow nodes to reach consensus. Howe ver , the result of Theorem 5 holds for unit stepsize sequ ence which guarantees learning even und er continuo us informa tion in jection to the network. This stems from th e fact that the alg orithm allows z i,t to grow unbo unded ly in e ach dir ection, while it lets the true state to be the inﬂu ential compon ent by having th e largest exponen tial r ate in the gene rated Gibbs distribution. B. Learning R ate Analysis In this section we c haracterize the con vergence rate of the e stimator µ i,t ( θ ) . Mo re speciﬁcally , we prove that con- vergence occu rs exponentially fast with a rate depend ent on the average expected discrimination information for θ 1 = θ ∗ over θ 2 , where θ 2 is the state with the second l argest e xpected log-likelihood (3). Deﬁnition 6 : The expected discrimination infor mation of agent i for θ 1 = θ ∗ over any θ j is D K L  ℓ i ( . | θ 1 ) || ℓ i ( . | θ j )  = E ∗ s i  log ℓ i ( s i | θ 1 ) ℓ i ( s i | θ j )  . Denoting by D ( θ j ) , the average expected discrimination information f or θ 1 = θ ∗ over θ j is d eﬁned as D ( θ j ) , 1 n n X i =1 D K L  ℓ i ( . | θ 1 ) || ℓ i ( . | θ j )  . (20) As an immediate con sequence of th e deﬁnition a bove, one can see from Lemma 4 that D ( θ j ) = Φ ∞ ( θ 1 ) − Φ ∞ ( θ j ) for any j ≥ 1 , and D ( θ 1 ) = 0 . Theor em 7 : Giv en co nditions in the L emmas 3 and 4, for any ǫ > 0 and t large enough, the estimator µ i,t ( θ 1 ) can b e bound ed as   µ i,t ( θ 1 ) − 1   ≤ K exp[( − D ( θ 2 ) + ǫ ) t ] , (21 ) with probab ility a t least 1 − δ ( ǫ, t ) , wh ere K is a con stant. Pr oof: Following the lines in the pr oof of Theorem 5, we hav e µ i,t ( θ 1 ) =  1 + X j ≥ 2 µ i, 0 ( θ j ) µ i, 0 ( θ 1 ) exp[ t Φ i,t ( θ j ) − t Φ i,t ( θ 1 )]  − 1 ≥ 1 − X j ≥ 2 µ i, 0 ( θ j ) µ i, 0 ( θ 1 ) exp[ t Φ i,t ( θ j ) − t Φ i,t ( θ 1 )] , where in th e last step we used the ine quality 1 − λ ≤ (1 + λ ) − 1 ∀ λ ≥ 0 . Letting b j , Φ i,t ( θ j ) − Φ ∞ ( θ j ) , w e d erive   µ i,t ( θ 1 ) − 1   ≤ X j ≥ 2 µ i, 0 ( θ j ) µ i, 0 ( θ 1 ) exp[ t Φ i,t ( θ j ) − t Φ i,t ( θ 1 )] ≤ max k µ i, 0 ( θ k ) µ i, 0 ( θ 1 ) X j ≥ 2 exp[ t Φ i,t ( θ j ) − t Φ i,t ( θ 1 )] = max k µ i, 0 ( θ k ) µ i, 0 ( θ 1 ) X j ≥ 2 exp[( − D ( θ j ) + b j − b 1 ) t ] . One can see in th e proo f of Lemma 4 that v ar [Φ i,t ( θ j )] decays with a rate C / t f or some constant C > 0 ; h ence, for any ǫ > 0 and j ≥ 1 , by Chebyshev’ s ineq uality we obtain P ( | b j | ≥ ǫ ) ≤ C ǫ 2 t . Combining with D ( θ m ) ≥ ... ≥ D ( θ 2 ) > D ( θ 1 ) = 0 by ( 3), for any ǫ > 0 an d t large eno ugh, we have   µ i,t ( θ 1 ) − 1   ≤ max k µ i, 0 ( θ k ) µ i, 0 ( θ 1 ) X j ≥ 2 exp[( − D ( θ 2 ) + 2 ǫ ) t ] = ( m − 1 ) max k µ i, 0 ( θ k ) µ i, 0 ( θ 1 ) exp[( − D ( θ 2 ) + 2 ǫ ) t ] , with prob ability at least 1 − C ǫ 2 t . Hence, the constants in (21) are d etermined as K = ( m − 1 ) max k µ i, 0 ( θ k ) µ i, 0 ( θ 1 ) and δ ( ǫ, t ) = 4 C ǫ 2 t , and we ar e do ne. Theorem 7 suggests that the pro posed distributed stochas- tic learning method in (13)-(14)-(15) con verges exp onentially fast with h igh p robability . Moreover , agen ts learn th e true state with a ra te dep endent on the KL div ergence of ob- servations u nder the true state fr om observations un der the second likeliest state. Th is, indeed, stresses the efﬁciency of the algorithm . V . C O N C L U S I O N W e studied a distributed p arameter estimation pr oblem over networks when ag ents receiv e a sequ ence o f i.i.d. s ignals but the sign als are not informativ e enou gh to id entify the true par ameter . Using a rando mized, go ssip dual averaging, agents aggregate local log-likelihood fu nctions, and then perfor m a Bayes-like update on the a veraged information to collectively rec over the truth. Assumin g connectivity of the network and glob al identiﬁability of the true state, we showed that agents belief s r each con sensus and collapse to a degener ate distribution over the true param eter , an d with h igh pr obability the conv ergence is exponentially fast. W e also proved th at the r ate of exponen tial depen ds on the KL diver genc e of ob servations under tr ue state from observations un der secon d likeliest state. As a salient fea ture of the algorithm, we sh owed that contrar y to other stochastic gradient descent metho ds, the stepsize can be ch osen to be ﬁxed and set to 1. Future direction s include ad dition of dynamics to the p arameter and relaxing the independen ce condition s on observations a s well a s specialization to the Gaussian case, where on e o nly needs to up date the mean and variance. A C K N O W L E D G M E N T S The authors would like to th ank Robin Pemantle fo r many helpful co mments and d iscussions. R E F E R E N C E S [1] V . Borkar and P . V araiya, “ Asymptotic agreement in distribut ed esti- mation, ” IEEE T ransact ions on Automatic Contr ol , v ol. 27, no. 3, pp. 650–655, 1982. [2] J. N. Tsit siklis, “Decentr alized detecti on by a la rge number of sensors, ” Mathemat ics of Contr ol, Signals, and Systems , v ol. 1, no. 2, pp. 167– 182, 1988. [3] J. N. Tsitsiklis et al. , “Decentra lized detectio n, ” Advances in Statistic al Signal Pr ocessing , vol. 2, pp. 297–344, 1993. [4] E. Mossel and O. T amuz, “Efﬁcie nt bayesian learni ng in social netw orks with gaussian estimato rs, ” Arxiv pr eprint arX iv:1002 .0747 , 2010. [5] U.A.Khan, S. Kar , A. Jadbaba ie, and J. Moura, “On conne cti vity , observ abilit y , and stabil ity in distribut ed estimati on, ” in 49th IEEE Confer ence on D ecisio n and Contr ol (CDC) . IEEE, 2010, pp. 6639– 6644. [6] S. Kar , J. Moura, and K. Ramanan, “Distribut ed paramete r estimat ion in sensor networ ks: Nonlinear observ ation m odels and imperfec t communicat ion, ” IEEE T ransactio ns on Informati on Theory , vol. 58, no. 6, pp. 3575–3605, 2012. [7] K. Rad and A. T ahbaz-Sal ehi, “Distrib uted paramete r estimation in netw orks, ” in 49th IEE E Confer ence on Decision and Contr ol (CDC) . IEEE, 2010, pp. 5050–5055. [8] A. Jadbabaie, P . Molavi, A. Sandroni, and A. T ahbaz -Salehi, “Non- bayesia n social learning, ” Games and Economic Behavior , vol. 76, no. 1, pp. 210–225, 2012. [9] O. Dekel , R. Gilad-Bach rach, O. Shamir , and L. Xiao, “Optimal distrib uted online predicti on using mini-batch es, ” The Journal of Mach ine Learning Researc h , vol . 13, pp. 165–202, 2012. [10] J. T sitsikli s, “Problems in decentrali zed decisi on making and compu- tatio n. ” DTIC Document, T ech. Rep., 1984. [11] A. Jadbabaie, J. Lin, and A. Morse, “Coordinatio n of groups of mobile autonomous agents using nearest neighbor rules, ” IE EE T ransactions on Automa tic Contr ol , vo l. 48, no. 6, pp. 988–1001, 2003. [12] M. Mesbahi and M. M. E gerstedt , Graph theor etic method s in multi- ag ent networks . Princeton Uni v Press, 2010. [13] F . Bullo, J. Cort ´ es, and S. Mart ´ ınez, Distribute d contr ol of r obotic net- works: a mathematica l appr oac h to motion coordin ation algorithms . Princet on Uni v Pr , 2009. [14] R. Olfati-Sabe r and J. Shamma, “Consensus ﬁlters for sensor networks and distrib uted sensor fusion, ” in 44th IEEE Confer ence on Decisio n and Contr ol , Se ville, Spain, Dec. 2005, pp. 6698 – 6703. [15] A. Nedic and A. Oz daglar , “Distribute d subgradient methods for multi- agent optimizat ion, ” IEEE T ransactions on Automatic Contr ol , vol. 54, no. 1, pp. 48–61, 2009. [16] I. Lobel and A. Ozdagla r , “Distrib uted subgradient met hods over random ne tworks, ” in Pro c. Allerto n Conf . Commun., Contr ol, Comput , 2008. [17] S. Ram, A. Nedic , and V . V ee rav alli, “Distribut ed stochastic sub- gradien t proje ction algorithms for con vex optimiz ation, ” J ournal of optimizat ion theory and applicati ons , vol. 147, no. 3, pp. 516–545, 2010. [18] A. Nedic, A. Olshevsk y , A. Ozdaglar , and J. Tsitsiklis, “On distributed av eraging algorithms and quantizati on ef fects, ” IEEE T ransacti ons on Automat ic Contr ol , vol. 54, no. 11, pp. 2506–2517, 2009. [19] C. Lopes and A. Sayed, “Incremen tal adapti v e strategi es ove r dis- trib uted network s, ” IEEE T ransac tions on Sign al Proce ssing , vol. 55, no. 8, pp. 4064–4077, 2007. [20] J. Duchi, A. Agarwal, and M. W ainwri ght, “Dual av eraging for distrib uted optimization: con ve rgence analysis and netw ork scali ng, ” IEEE T ransactions on Aut omatic Contr ol , pp. 592–607, March 2012. [21] Y . Nesterov , “Primal-dua l subgradient methods for con vex problems, ” Mathemat ical pr ogr amming , vol. 120, no. 1, 2009. [22] S. Boy d, A. Ghosh, B. Prabhakar , and D. Shah, “Randomized gossip algorit hms, ” IEEE T ransact ions on Information Theory , vol . 52, no. 6, pp. 2508–2530, 2006. [23] T . M. Cove r and J. A. Thomas, Elements of information theory . John W iley & Sons, 2012. [24] D. Blackwe ll and L . Dubins, “Merg ing of opinions with incre asing informati on, ” The Annals of Mathemat ical Statistics , vol. 33, no. 3, pp. 882–8 86, 1962. [25] A. T ahbaz-Sal ehi and A. Jadbabaie, “Consen sus over ergod ic station- ary graph processes, ” IEEE T ransac tions on Automati c Contr ol , vol. 55, no. 1, pp. 225–230, 2010.

Exponentially Fast Parameter Estimation in Networks Using Distributed Dual Averaging

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment