Distributed Detection : Finite-time Analysis and Impact of Network Topology

This paper addresses the problem of distributed detection in multi-agent networks. Agents receive private signals about an unknown state of the world. The underlying state is globally identifiable, yet informative signals may be dispersed throughout …

Authors: Shahin Shahrampour, Alex, er Rakhlin

Distributed Detection : Finite-time Analysis and Impact of Network   Topology
1 Distrib uted Detection : Finite-time Analysis and Impact of Network T opology Shahin Shahrampour † , Alexander Rakhlin ‡ and Ali Jadbabaie † Abstract This paper addresses the problem of distributed detection in multi-agent netw orks. Agents receive priv ate signals about an unknown state of the world. The underlying state is globally identifiable, yet informativ e signals may be dispersed throughout the network. Using an optimization-based frame work, we de velop an iterati ve local strategy for updating individual beliefs. In contrast to the existing literature which focuses on asymptotic learning, we provide a finite-time analysis . Furthermore, we introduce a Kullback-Leibler cost to compare the ef ficiency of the algorithm to its centralized counterpart. Our bounds on the cost are expressed in terms of network size, spectral gap, centrality of each agent and relativ e entropy of agents’ signal structures. A key observation is that distributing more informative signals to central agents results in a faster learning rate. Furthermore, optimizing the weights, we can speed up learning by improving the spectral gap. W e also quantify the effect of link failures on learning speed in symmetric networks. W e finally provide numerical simulations which verify our theoretical results. I . I N T RO D U C T I O N Recent years hav e witnessed an intense interest on distributed detection, estimation, prediction and optimization [1]–[7]. Decentralizing the computation burden among agents has been widely regarded in networks ranging from sensor and robot to social and economic networks [8]–[11]. In this broad class of problems, agents in a network need to perform a global task for which they only ha ve partial information. Therefore, they recursiv ely exchange information with their † Shahin Shahrampour and Ali Jadbabaie are with the Department of Electrical and Systems Engineering at the University of Pennsylvania, Philadelphia, P A 19104 USA. (e-mail: shahin@seas.upenn.edu; jadbabai@seas.upenn.edu). ‡ Alexander Rakhlin is with the Department of Statistics at the University of Pennsylvania, Philadelphia, P A 19104 USA. (e-mail: rakhlin@wharton.upenn.edu). June 14, 2018 DRAFT 2 neighbors, and the global dispersion of information in the network provides them with adequate data to accomplish the task. In the big picture, many of these schemes can also be embedded in the context of consensus protocols which hav e gained a gro wing popularity ov er the past three decades [12]–[14]. Earlier works on decentralized detection ha ve considered scenarios where each agent sends its observ ations to a fusion center that decides ov er the true value of a parameter [1], [2], [8]. In these situations, the fusion center faces a classical hypothesis testing (centralized detection) problem after collecting the data from agents. Recently , another model of learning and detection has been proposed by Jadbabaie et al. [15]. In this frame work, the world is gov erned by a fixed true state or hypothesis that is aimed to be recov ered by a network of agents. The state belongs to a finite set, and might represent a decision, an opinion, the price of a product or any quantity of interest. Each agent observ es a stream of private signals generated by a marginal of the global likelihood conditioned on the true state. Howe ver , the signals might not be informativ e enough for the agent to distinguish the underlying state of the world. Therefore, agents use local dif fusion to compensate for their imperfect kno wledge about the en vironment. In the literature, a host of schemes b uild on this model to describe distributed learning [15]–[18]. Despite the wealth of results on the asymptotic behavior of these methods, the finite-time analysis remains elusiv e. In [15], a non-Bayesian update rule is proposed in the context of social networks. Each indi vidual av erages her Bayesian posterior belief with the opinion of her neighbors, and the beliefs tend to the truth under mild technical assumptions. Follo wing up on the work of Duchi et al. [19] on distributed dual a veraging, an optimization-based algorithm is dev eloped in [16]. The authors demonstrate that the belief sequence generated according to their method is weakly consistent in undirected networks. Lalitha et al. [17] introduce another strategy which puts exponential weights on a linear combination of Bayesian log-posteriors. The con vergence conditions of their method are similar to those of [15]. On the other hand, Rahnama Rad et al. [18] present a distrib uted algorithm for continuous state space, and prove its con ver gence. In [15]–[17], the con vergence occurs exponentially f ast, and the asymptotic rate is characterized in terms of the r elative entr opy of indi viduals’ signal structures and their eigen vector centralities (see [20] for the rate analysis of [15]). As an important consequence, the rate in [16] only recov ers the empirical average of relati ve entropies since the method is restricted to undirected networks. The asymptotic analysis presented in the abov e-discussed papers only describes the dominant June 14, 2018 DRAFT 3 factors that influence learning in the long run. In real world applications, howe ver , the decision on the true state has to be made in a finite time. Therefore, it is crucial to study the finite- time v ariant of these schemes to gain insight into the interplay of network parameters which af fect learning. T o this end, we extend the work of Shahrampour et al. [16] to directed networks where agents are not equally central. Moreover , we introduce the notion of Kullbac k-Leibler (KL) cost to measure the learning rate of an indi vidual agent versus an expert who has all av ailable information for learning. The KL decentralization cost simply compares the performance of distributed algorithm to its centralized counterpart. W e deriv e an upper bound on the cost which prov es the spectral gap of the network is substantial beside agents’ centralities. It turns out that the upper bound scales in versely in the spectral gap, and logarithmically with the network size, number of states and time horizon. The rate also scales with the in verse of the relati ve entropy of the conditional marginals. More specifically , the KL cost grows when signals do not pro vide enough e vidence in fav or of the true state versus some other state of the world. Assuming that the network is realized with a default communication structure, each agent is endo wed with a centrality . W e establish that allocating more informativ e signals to more central agents can expedite learning. More interestingly , the importance of spectral gap opens ne w venues for optimal network design to facilitate agents’ interactions. Each agent assigns dif ferent weights to its neighbors’ information while communicating with them. W e demonstrate ho w agents can modify these weights to achiev e a faster learning rate. The key idea is to find the Marko v chain with the best mixing behavior that is consistent with the network structure and agents’ centralities. On the other hand, as a natural conjecture, we expect a more rapid learning rate in well-connected networks. W e study the ramification of link failures in the network, and prov e that in symmetric networks, less connecti vity amounts to a sluggish learning process. W e further apply our results on star , cycle and two-dimensional grid network. W e observe that in each case the effect of spectral gap can be translated to the network diameter . Intuitiv ely , a larger diameter makes the information propagation dif ficult around the network. Finally , we present numerical experiments which perfectly match our theoretical findings. The rest of the paper is or ganized as follo ws: we describe the formal statement of the problem, and flesh out the distrib uted detection scheme in Section II. Section III is de voted to the finite-time analysis of the algorithm, whereas Section IV elaborates on the impact of network characteristics on the con ver gence rate. W e discuss briefly about applications of the model, and provide our June 14, 2018 DRAFT 4 numerical experiments in Section V. Section VI concludes. Notation : W e adhere to the following notation in the exposition of our results: [ n ] The set { 1 , 2 , ..., n } for any integer n x T T ranspose of the vector x x ( k ) The k -th element of vector x x [ k ] The k -th largest element of vector x I m Identity matrix of size m ∆ m The m -dimensional probability simplex e k Delta distribution on k -th component h· , ·i Standard inner product operator k · k p p -norm operator 1 V ector of all ones k µ − π k TV T otal variation distance between µ, π ∈ ∆ m D K L ( µ k π ) KL-di vergence of π ∈ ∆ m from µ ∈ ∆ m λ i ( W ) The i -th largest eigen value of matrix W For any f ∈ R m and µ ∈ ∆ m , we let E µ [ · ] (respecti vely , V ar µ [ · ] ) represent the expectation (respecti vely , v ariance) of f under the measure µ , i.e., we hav e E µ [ f ] = m X j =1 µ ( j ) f ( j ) V ar µ [ f ] = m X j =1 µ ( j ) ( f ( j ) − E µ [ f ]) 2 . I I . T H E P RO B L E M D E S C R I P T I O N A N D A L G O R I T H M In this section, we describe the observ ation and network model, and outline the centralized setting for the problem. Then, we provide a formal statement of the distributed setting, and characterize the distributing cost. A. Observation Model W e consider an en vironment in which Θ = { θ 1 , θ 2 , . . . , θ m } denotes a finite set of states of the world. W e hav e a network of n agents that seek the unique , true state of the world θ 1 ∈ Θ . At each time t ∈ [ T ] , the belief of agent i is denoted by µ i,t ∈ ∆ m , where ∆ m is a probability June 14, 2018 DRAFT 5 distribution o ver the set Θ . In particular , µ i, 0 ∈ ∆ m denotes the prior belief of agent i ∈ [ n ] about the states of the world assumed to be uniform with no loss of generality 1 . The learning model is gi ven by a conditional likelihood function ` ( ·| θ k ) which is governed by a state of the world θ k ∈ Θ . For each i ∈ [ n ] , let ` i ( ·| θ k ) denote the i -th marginal of ` ( ·| θ k ) , and we use the vector representation ` i ( ·| θ ) = [ ` i ( ·| θ 1 ) , ..., ` i ( ·| θ m )] T to stack all states. At each time t ∈ [ T ] , the signal s t = ( s 1 ,t , s 2 ,t , . . . , s n,t ) ∈ S 1 × · · · × S n is generated based on the true state θ 1 . Therefore, for each i ∈ [ n ] , the signal s i,t ∈ S i is a sample drawn according to the likelihood ` ( ·| θ 1 ) where S i is the sample space. The signals are i.i.d. ov er time, and also the mar ginals are independent, i.e., ` ( ·| θ k ) = Π n i =1 ` i ( ·| θ k ) for any k ∈ [ m ] . For the sake of con venience, we define ψ i,t , log ` i ( s i,t | θ ) which is a sample corresponding to Ψ i , log ` i ( ·| θ ) for any i ∈ [ n ] . A1. W e assume that all log-mar ginals are uniformly bounded such that k ψ i,t k ∞ ≤ B for any s i,t ∈ S i , i.e., we hav e | log ` i ( ·| θ k ) | ≤ B for any i ∈ [ n ] and k ∈ [ m ] . Assumption A1 is made for technical reasons, but such a bound can be found, for instance, when the signal space is discrete and provides a full support for distribution. Let us define ¯ Θ i as the set of states that are observationally equi valent to θ 1 for agent i ∈ [ n ] ; in other words, ¯ Θ i = { θ k ∈ Θ : ` i ( s i | θ k ) = ` i ( s i | θ 1 ) ∀ s i ∈ S i } with probability one. As e vident from the definition, any state θ k 6 = θ 1 in the set ¯ Θ i is not distinguishable from the true state by observ ation of samples from the i -th marginal. Let ¯ Θ = ∩ n i =1 ¯ Θ i be the set of states that are observationally equi valent to θ 1 from all agents perspecti ve. A2. W e assume that no state in the world is observationally equi valent to the true state from the standpoint of the network, i.e., the true state is globally identifiable, and we hav e ¯ Θ = { θ 1 } . Assumption A2 guarantees that the global likelihood pro vides sufficient information to mak e the true state uniquely identifiable. Let F t be the smallest σ -field containing the information about all agents up to time t . Then, when the learning process continues for T rounds, the probability triple (Ω , F , P ) is defined as follo ws: the sample space Ω = ⊗ T t =1 ( ⊗ n i =1 S i ) , the σ -field F = ∪ T t =1 F t , and the true probability measure P = ⊗ T t =1 ` ( ·| θ 1 ) . Finally , the operator E denotes the expectation with respect to P . 1 The assumption of uniform prior only lets us avoid notational clutter . The analysis in the paper holds for any prior with full support. June 14, 2018 DRAFT 6 B. Network Model The interaction between agents is captured by a directed graph G = ([ n ] , E ) , where [ n ] is the set of nodes corresponding to agents, and E is the set of edges. Agents i receiv es information from j only if the pair ( i, j ) ∈ E . W e let N i = { j ∈ [ n ] : ( i, j ) ∈ E } be the set of neighbors of agent i . Throughout the learning process agents truthfully report their information to their neighbors. W e represent by [ W ] ii ≥ 0 the self-r eliance of agent i , and by [ W ] ij > 0 the weight that agent i assigns to information recei ved from agent j in its neighborhood. Then, the matrix W is constructed such that [ W ] ij denotes the entry in its i -th ro w and j -th column. Therefore, W has nonne gativ e entries, and [ W ] ij > 0 only if ( i, j ) ∈ E . For normalization purposes, we further assume that W is stochastic; hence, n X j =1 [ W ] ij = X j ∈N i [ W ] ij = 1 . A3. W e assume that the network is str ongly connected , i.e., there exists a directed path from any agent i ∈ [ n ] to any agent j ∈ [ n ] . W e further assume for simplicity that W is diagonalizable 2 . The strong connectivity constraint in assumption A3 guarantees the information flo w in the network. The assumption implies that λ 1 ( W ) = 1 is unique, and the other eigen values of W are strictly less than one in magnitude [21]. Giv en the matrix of social interactions W , the eigen vector centrality is a non-negati ve vector π such that for all i ∈ [ n ] , π ( i ) = n X j =1 [ W ] j i π ( j ) . (1) for k π k 1 = 1 . Then, π ( i ) denoting the i -th element of π is the eigen vector centrality of agent i . In the matrix form, the preceding relation takes the form π T W = π T , which means π is the stationary distribution of W . Assumption A3 entails that the Markov chain W is irreducible and aperiodic, and the unique stationary distribution π has strictly positi ve components [21]. 2 Note that the diagonalizability is not necessary , and it only forms a clean playground for technical analysis by avoiding Jordan blocks. June 14, 2018 DRAFT 7 C. Centralized Detection T o motiv ate the dev elopment of distributed scheme, we commence by introducing centralized detection 3 . In this case, the scenario could be described as a two player repeated game between Nature and a centralized agent (expert) that has global information to learn the true state. More specifically , the expert observes the sequence of signals { s t } T t =1 that are in turn re vealed by Nature, and knows the entire network characteristics. At any round t ∈ [ T ] , the expert accumulates a weighted avera ge of log-marginals, and forms the belief µ t ∈ ∆ m about the states, where ∆ m = { µ ∈ R m | µ  0 , P m k =1 µ ( k ) = 1 } denotes the m -dimensional probability simplex. Letting ψ t , n X i =1 π ( i ) ψ i,t = n X i =1 π ( i ) log ` i ( s i,t | θ ) , (2) the sequence of interactions could be depicted in the form of the following algorithm: Centralized Detection Input : A uniform prior belief µ 0 , a learning rate η > 0 . Initialize : Let φ 0 ( k ) = 0 for all k ∈ [ m ] . At time t = 1 , ..., T : Observe the signal s t = ( s 1 ,t , s 2 ,t , . . . , s n,t ) , update the vector function φ t , and form the belief µ t as follo ws, φ t = φ t − 1 + ψ t and µ t = argmin µ ∈ ∆ m  − µ T φ t + 1 η D K L ( µ k µ 0 )  . (3) At each time t ∈ [ T ] , the expert’ s goal is to maximize the expected log-marginals while sticking to the default belief µ 0 , i.e., minimizing the div ergence. The trade-off between the two behavior is tuned with the learning rate η . Let us note that according to Jensen’ s inequality for the concave function log( · ) , we hav e for e very i ∈ [ n ] and k ∈ [ m ] that − D K L ( ` i ( ·| θ 1 ) k ` i ( ·| θ k )) = E  log ` i ( ·| θ k ) ` i ( ·| θ 1 )  ≤ log E  ` i ( ·| θ k ) ` i ( ·| θ 1 )  = 0 , where the inequality turns to equality if and only if ` i ( ·| θ 1 ) = ` i ( ·| θ k ) , i.e., if f θ k ∈ ¯ Θ i . Therefore, it holds that E [log ` i ( ·| θ k )] ≤ E [log ` i ( ·| θ 1 )] , and recalling that the stationary distribution π 3 The method can be cast as special cases of F ollow the Regularized Leader [22] and Mirr or Descent [23] algorithm. June 14, 2018 DRAFT 8 consists of positi ve elements, we hav e for any k 6 = 1 that, E " n X i =1 π ( i )Ψ i ( k ) # = E " n X i =1 π ( i ) log ` i ( ·| θ k ) # < E " n X i =1 π ( i ) log ` i ( ·| θ 1 ) # = E " n X i =1 π ( i )Ψ i (1) # , where the strict inequality is due to uniqueness of the true state θ 1 , and the fact that ¯ Θ = ∩ n i =1 ¯ Θ i = { θ 1 } based on assumption A2 . In the sequel, without loss of generality , we assume the follwoing descending order , i.e. E " n X i =1 π ( i )Ψ i (1) # > E " n X i =1 π ( i )Ψ i (2) # ≥ · · · ≥ E " n X i =1 π ( i )Ψ i ( m ) # , (4) D. Distributed Detection W e no w extend the previous section to distributed setting modeled based on a network of agents. In the distrib uted scheme, each agent i ∈ [ n ] only observes the stream of priv ate signals { s i,t } T t =1 generated based on the parametrized likelihood ` i ( ·| θ 1 ) . That is, agent i ∈ [ n ] does not directly observe s j,t for any j 6 = i . As a result, it gathers the local information by av eraging the log-likelihoods in its neighborhood, and forms the belief µ i,t ∈ ∆ m at round t ∈ [ T ] as follows: Distributed Detection Input : A uniform prior belief µ i, 0 , a learning rate η > 0 . Initialize : Let φ i, 0 ( k ) = 0 for all k ∈ [ m ] and i ∈ [ n ] . At time t ∈ [ T ] : Observe the signal s i,t , update the function φ i,t , and form the belief µ i,t as follo ws, φ i,t = X j ∈N i [ W ] ij φ j,t − 1 + ψ i,t and µ i,t = argmin µ ∈ ∆ m  − µ T φ i,t + 1 η D K L ( µ k µ i, 0 )  . (5) As outlined above, each agent updates its belief using purely local dif fusion. W e are interested in measuring the ef ficiency of the distributed algorithm via a metric comparing that to its centralized counterpart. At any round t ∈ [ T ] , let us postulate that the cost which agent i ∈ [ n ] needs to pay to hav e the same opinion as the expert is D K L ( µ i,t k µ t ) ; then, the total decentralization cost that the agent incurs after T rounds is as follows Cost i,T , T X t =1 D K L ( µ i,t k µ t ) = T X t =1 E µ i,t  log µ i,t µ t  . (6) June 14, 2018 DRAFT 9 The function quantifies the difference between the agent that observes priv ate signals { s i,t } T t =1 and an expert that has { s t } T t =1 and π av ailable. Note importantly that Cost i,T is a random quantity since the expectation is not taken with respect to randomness of signals. W e conclude this section with the following lemma which reiterates that both algorithms are reminiscent of the well-kno wn Exponential W eights algorithm. Lemma 1: The update rules (3) and (5) hav e the explicit form solutions, µ t ( k ) = exp { η φ t ( k ) } h 1 , exp { η φ t }i and µ i,t ( k ) = exp { η φ i,t ( k ) } h 1 , exp { η φ i,t }i , respecti vely , for any i ∈ [ n ] and k ∈ [ m ] . Moreov er , φ i,t = t X τ =1 n X j =1  W t − τ  ij ψ j,τ . W e will now state the main results of the paper with underlying intuition behind them. The proofs are sometimes omitted and provided later in the appendix. I I I . F I N I T E - T I M E A N A LY S I S O F B E L I E F S A N D C O S T F U N C T I O N S In this section, we in vestigate the con vergence of agents’ beliefs to the true state in the network. Agents exchange information over time, and reach consensus about the true state. The connecti vity of the network plays an important role in the learning as W t → 1 π T /n as t → ∞ . T o examine the learning rate, we need to ha ve kno wledge about the mixture behavior of Marko v chain W . The following lemma sheds light on the mixture rate, and we in vok e it later for technical analysis. Lemma 2: Giv en strong connectivity of the network (assumption A3 ), the stochastic matrix W satisfies t X τ =1 n X j =1     W t − τ  ij − π ( j )    ≤ 4 log n 1 − λ max ( W ) , for any i ∈ [ n ] , where λ max ( W ) , max {| λ n ( W ) | , | λ 2 ( W ) |} W e now establish that agents ha ve arbitrarily close opinions in a connected netw ork. Furthermore, the con ver gence rate is governed by cardinality of state space and network characteristics. Lemma 3: Let the sequence of beliefs { µ i,t } T t =1 for each agent i ∈ [ n ] be generated by the Distrib uted Detection algorithm with the learning rate η . Giv en bounded log-marginals June 14, 2018 DRAFT 10 (assumption A1 ), global identifiability of the true state (assumption A2 ), and strong connectivity of the network (assumption A3 ), for each indi vidual agent i ∈ [ n ] it holds that 1 η log k µ i,t − e 1 k TV ≤ −I ( θ 1 , θ 2 ) t + r 2 B 2 t log m δ + 8 B log n 1 − λ max ( W ) + log m η , with probability at least 1 − δ , where for k ≥ 2 I ( θ 1 , θ k ) , n X i =1 π ( i ) D K L ( ` i ( ·| θ 1 ) k ` i ( ·| θ k )) . Lemma 3 v erifies that the belief µ i,t of each agent i ∈ [ n ] is str ongly consistent, i.e., it con ver ges almost surely to a delta distribution on the true state. The claim follo ws immediately by letting δ = 1 /t 2 and applying Borel-Cantelli lemma. Howe ver , we are interested in the interplay of parameters in finite-time and in particular the behavior of decentralization cost function in (6). Let us no w proceed to the next lemma to deriv e a variance-type bound on the cost. Lemma 4: The decentralization cost function (6) associated to the Distrib uted Detection al- gorithm with the learning rate η satisfies Cost i,T ≤ 2 η 2 T X t =1 V ar µ t [ q i,t ] , so long as η k q i,t k ∞ ≤ 1 / 4 at each round, where q i,t , φ i,t − φ t . The bound in Lemma 4 is ev ocativ e of numerous regret bounds de veloped for the well-known problem of prediction with e xpert advice corresponding to the centralized detection in an ad- versarial setting [24], [25]. Ho wev er , such bounds are in terms of second moment, rather than v ariance which is a smaller quantity (see e.g. the bound in Lemma 3 of [25] deri ved in terms of local norms). The follo wing theorem illuminates how the v ariance bound comes in handy by concentrating the measure around the true distribution. Theor em 5: Let the sequence of beliefs { µ i,t } T t =1 for each agent i ∈ [ n ] be generated by the Distributed Detection algorithm with the choice of learning rate η = 1 − λ max ( W ) 16 B log n . Giv en bounded log-marginals (assumption A1 ), global identifiability of the true state (assumption A2 ), and strong connecti vity of the network (assumption A3 ), we hav e Cost i,T ≤ max  8 B 2 I 2 ( θ 1 , θ 2 ) log  mT δ  , 4 B log n I ( θ 1 , θ 2 ) log [ mT ] 1 − λ max ( W )  + 1 , June 14, 2018 DRAFT 11 with probability at least 1 − δ . Pr oof: W e recall that q i,t in the statement of Lemma 4 satisfies k q i,t k ∞ =      t X τ =1 n X j =1   W t − τ  ij − π ( j )  ψ j,t      ∞ ≤ B t X τ =1 n X j =1     W t − τ  ij − π ( j )    ≤ 4 B log n 1 − λ max ( W ) , due to Lemma 2 and assumption A1 . Therefore, the choice of η = 1 − λ max ( W ) 16 B log n guarantees that q i,t satisfies η k q i,t k ∞ ≤ 1 / 4 for all t ∈ [ T ] . W e now explicitly calculate the v ariance of q i,t under the measure µ t . Then, we apply H ¨ older’ s inequality for primal-dual norm pairs to bound it as, V ar µ t [ q i,t ] = m X k =1 µ t ( k ) ( q i,t ( k ) − E µ t [ q i,t ]) 2 = m X k =1 µ t ( k ) ( h q i,t , e k i − h q i,t , µ t i ) 2 ≤ h q i,t , e 1 − µ t i 2 + m X k =2 µ t ( k ) h q i,t , e k − µ t i 2 ≤   q i,t   2 ∞   e 1 − µ t   2 1 + m X k =2 µ t ( k )   q i,t   2 ∞   e k − µ t   2 1 ≤   q i,t   2 ∞   e 1 − µ t   2 1 + 4   q i,t   2 ∞ m X k =2 µ t ( k ) = 4   q i,t   2 ∞   e 1 − µ t   2 TV + 4   q i,t   2 ∞   e 1 − µ t   TV , where in the last line we used the fact that k e k − µ t k 1 = 2 k e k − µ t k TV ≤ 2 for an y k ∈ [ m ] . T aking into account the condition η k q i,t k ∞ ≤ 1 / 4 , we obtain 2 η 2 V ar µ t [ q i,t ] ≤ 1 2    e 1 − µ t   2 TV +   e 1 − µ t   TV  ≤   e 1 − µ t   TV . (7) Follo wing e xactly the same steps in the proof of Lemma 3, it can be verified that for an y t ∈ [ T ] , the centralized algorithm yields 1 η log k µ t − e 1 k TV ≤ −I ( θ 1 , θ 2 ) t + r 32 B 2 t log m δ + log m η , with probability at least 1 − δ . T o hav e the identity abo ve work for ev ery t ∈ [ T ] with probability at least 1 − δ , we need to take a union bound ov er all t ∈ [ T ] , which changes the parameter δ to δ /T in the right hand side of the preceding relation. Let us av oid notational clutter, by defining June 14, 2018 DRAFT 12 a , I ( θ 1 , θ 2 ) and b , (32 B 2 log [ mT /δ ]) 1 / 2 , respecti vely . Then, in view of the identity above, with probability at least 1 − δ we can bound (7) as follo ws, 2 η 2 V ar µ t [ q i,t ] ≤ m exp n − aη t + bη √ t o ≤ m exp n − a 2 η t o for t ≥ t 1 ,  2 b a  2 ≤ 1 T for t ≥ t 2 , 2 aη log [ mT ] . Let t 0 = max { t 1 , t 2 } and consider the relation in abov e as well as the condition η k q i,t k ∞ ≤ 1 / 4 to observe 2 T X t =1 η 2 V ar µ t [ q i,t ] = 2 t 0 X t =1 η 2 V ar µ t [ q i,t ] + 2 T X t = t 0 +1 η 2 V ar µ t [ q i,t ] ≤ 2 t 0 X t =1 E µ t [ η 2 q 2 i,t ] + T X t = t 0 +1 1 T ≤ 2 t 0 X t =1 1 16 + 1 = t 0 8 + 1 , with probability at least 1 − δ . Plugging the bound abov e into Lemma 4 completes the proof. Regarding Theorem 5 the follo wing comments are in order: the rate is related to the in verse of I ( θ 1 , θ 2 ) which is a weighted av erage of KL-div ergence of observations under θ 2 (the second best alternati ve) from observations under θ 1 (the true state). Also, from the definition of I ( θ 1 , θ 2 ) in Lemma 3, the weights turn out to be agents’ centralities. Intuitiv ely , when signals hardly re veal the difference between the best two candidates for the true state, agents must make more effort to distinguish the two. In turn, this results in suf fering a larger cost caused by slo wer learning. The decentralization cost always scales logarithmically with the number of states m . Now define γ ( W ) , 1 − λ max ( W ) , (8) as the spectral gap of the network. Then, Theorem 5 suggests that for large networks, the cost scales in versely in the spectral gap, and logarithmically with the network size n . Finally , the detection cost with respect to time horizon is O (log T ) which is sub-linear . Therefore, the a verage cost (per iteration cost) asymptotically tends to zero. Moreov er , such dependence is quite natural as e ven an expert incurs a O (log T ) regret to detect the true state [24]. June 14, 2018 DRAFT 13 I V . T H E I M PAC T O F N E T W O R K T O P O L O G Y The results of pre vious section verify that network characteristics go vern the learning process. W e now discuss the role of agents’ centralities and the network spectral gap. A. Effect of Agent Centrality T o examine centrality , let us return to the definition of I ( θ 1 , θ 2 ) in Lemma 3, and imagine that the network is collaborative in the sense that the network designer wants to expedite learning. Then, to hav e the best information dispersion, the mar ginal which collects the most evidence in fa vor of θ 1 against θ 2 should be allocated to the most central agent. By the same token, in an adversarial network where Nature aims to delay the learning process, such marginal should be assigned to the least central agent. T o sum up, let us put forth the concept of network r e gularity as defined in [20] in the context of social learning. Recalling the definition of eigen vector centrality (1), we say a network G is more regular than G 0 if π 0 majorizes π , i.e., if for all j ∈ [ n ] j X i =1 π [ i ] ≤ j X i =1 π 0 [ i ] , (9) where π [ i ] denotes the i -th largest element of π . Letting u , [ D K L ( ` 1 ( ·| θ 1 ) k ` 1 ( ·| θ 2 )) , . . . , D K L ( ` n ( ·| θ 1 ) k ` n ( ·| θ 2 ))] T , it is a straightforward consequence of Lemma 1 prov ed in [20] that n X i =1 π [ i ] u [ i ] ≤ n X i =1 π 0 [ i ] u [ i ] , when π 0 majorizes π . Therefore, spreading more informati ve signals among central agents speeds up the learning procedure. B. Optimizing the Spectral Gap W e no w turn our attention to the spectral gap of network (8). Suppose that agents are gi ven a default communication matrix W which determines their neighborhood and centrality . The problem is to find the optimal spectral gap assuming that the neighborhood and centrality of each agent are fixed. The key idea is to change the mixing behavior of the Markov chain W . It June 14, 2018 DRAFT 14 is well-known, for instance, that we could do so using lazy random walks [26] which replaces W with 1 2 ( W + I n ) . T o generalize the idea, let us define a modified communication matrix W 0 , αW + (1 − α ) I n α ∈ [0 , 1] , (10) which has the same eigenstructure as W . Then, the eigen values of W 0 are weighted a verages of those of W with one. From standpoint of network designing, one can exploit the freedom in choosing α to optimize the spectral gap. Pr oposition 6: The optimal spectral gap of the modified communication matrix W 0 (10) is as follo ws, γ ∗ = 2 − 2 λ 2 ( W ) 2 − λ n ( W ) − λ 2 ( W ) for α ∗ = 2 2 − λ n ( W ) − λ 2 ( W ) , when λ n ( W ) + λ 2 ( W ) < 0 Pr oof: T o optimize the spectral gap, we need to minimize the second largest eigen value of W 0 in magnitude, that is, to solve the min-max problem min α ∈ [0 , 1] λ max ( W 0 ) = min α ∈ [0 , 1] max {| αλ 2 ( W ) + 1 − α | , | αλ n ( W ) + 1 − α |} . (11) Drawing the plots of | αλ 2 ( W ) + 1 − α | and | αλ n ( W ) + 1 − α | in terms of α verifies that the minimum occurs at the intersection of the lines αλ 2 ( W ) + 1 − α = − αλ n ( W ) + α − 1 , yielding α ∗ = 2 2 − λ n ( W ) − λ 2 ( W ) . Plugging α ∗ into the min-max problem (11), we calculate the optimal v alue λ ∗ max as λ ∗ max = λ 2 ( W ) − λ n ( W ) 2 − λ n ( W ) − λ 2 ( W ) , and since γ ∗ = 1 − λ ∗ max the proof follo ws immediately . C. Sensitivity to Link F ailur e It is intuitiv e that in a network with more links, agents are of fered more opportunities for communication. Adding links provides more av enues for spreading information, and improves the learning quality . W e study this phenomenon for symmetric networks where a pair of agents June 14, 2018 DRAFT 15 assign similar weights to each other , i.e., W T = W . In particular , we explore the connection of spectral gap with the link failure. In this regard, let us introduce the follo wing positi ve semi- definite matrix ∆ W ( i, j ) , ( i − j )( i − j ) T , (12) where i is the i -th unit vector in the standard basis of R n . Then, for i, j ∈ [ n ] the matrix ¯ W ( i, j ) , W + [ W ] ij ∆ W ( i, j ) , (13) corresponds to a ne w communication matrix that remo ves edges ( i, j ) and ( j, i ) from the network, and adds [ W ] ij = [ W ] j i to the self-reliance of agent i and agent j . Pr oposition 7: Consider the communication matrix ¯ W ( i, j ) in (13). Then, for any i, j ∈ [ n ] the follo wing identity holds λ max ( W ) ≤ λ max  ¯ W ( i, j )  , so long as W is positiv e semi-definite. Pr oof: W e recall that ∆ W ( i, j ) in (12) is positi ve semi-definite with λ n (∆ W ( i, j )) = 0 . Applying W eyl’ s eigen value inequality on (13), we obtain for any k ∈ [ n ] λ k ( W ) ≤ λ k  ¯ W ( i, j )  , which holds in particular for k = 2 . On the other hand, the matrix W is positiv e semi-definite, so we ha ve that λ max ( W ) = λ 2 ( W ) . Combining with the fact that ¯ W ( i, j ) is symmetric and positi ve semi-definite, the proof is completed. The proposition immediately implies that removing a link reduces the spectral gap. In this case, in view of the bound in Theorem 5, the decentralization cost has more latitude to vary . Therefore, to keep the costs small, agents tend to maintain their connections. Let us take note of the delicate point that monotone increase in the upper bound does not necessarily imply a monotone increase in the cost; howe ver , one can r oughly expect such behavior . W e elaborate on this issue in the numerical experiments. Finally , notice that the positi ve semi-definiteness constraint on W is not strong, since it can be easily satisfied by replacing a lazy random walk 1 2 ( W + I n ) with W . June 14, 2018 DRAFT 16 D. Star , Cycle and Grid Networks W e now examine the spectral gap impact for some interesting networks (Fig. 1), and deri ve explicit bounds for decentralization cost. As one of the famous examples in computer networks, we start with the star network. Regardless of the network size, existence of one central agent always preserves the network diameter , and therefore, we expect a benign scaling with network size. On the other side of the spectrum lies the cycle network where the diameter grows linearly with the netw ork size. W e should, hence, observe how the poor communication in cycle network af fects the learning rate. Finally , as a possible model for sensor networks, we study the grid network where the network size scales quadratically with the diameter . Fig. 1: Illustration of networks : star, cycle and grid networks with n agents. For each network, each indi vidual agent possesses a self-reliance of ω ∈ (0 , 1) . Cor ollary 8: Under conditions of Theorem 5 and the choice of learning rate η = γ ( · ) 16 B log n , for n large enough we have the following bounds on the decentralization cost: (a) For the star network in Fig. 1 Cost i,T ≤ O  log [ nmT ] min { 1 − ω , 1 − | 2 ω − 1 |}  . (b) For the cycle network in Fig. 1 Cost i,T ≤ O log [ nmT ] min  1 − | 2 ω − 1 | , 2(1 − ω ) sin 2 π n  ! . June 14, 2018 DRAFT 17 (c) For the grid network in Fig. 1 Cost i,T ≤ O   log [ nmT ] min n 1 − | 2 ω − 1 | , 2(1 − ω ) sin 2 π √ n o   . Pr oof: The spectrum of the Laplacian of star and cycle graphs are well-known [27]. W e hav e the eigen value set corresponding to communication matrix of star and cycle graphs as  1 , ω , . . . , ω , 2 ω − 1  and  ω + (1 − ω ) cos 2 π i n  n − 1 i =0 , respecti vely . Therefore, the proof of (a) and (b) follo ws immediately . The grid graph is the Cartesian product of two rings of size √ n (due to wraparounds at the edges), and hence, its eigen v alues are deri ved by summing the eigen values of two √ n -rings [27]. Therefore, the eigen v alue set takes the form  ω + (1 − ω ) cos π ( i + j ) √ n cos π ( i − j ) √ n  √ n − 1 i,j =0 , and the proof of (c) is completed. Let us use the notation ˜ O ( · ) to hide the poly log factors. Then, the bounds deri ved in Corollary 8 indicate that the algorithm requires ˜ O (1) iterations to achie ve a near optimal log-distance from the true state in the star netw ork. Howe ver , the rate deteriorates to ˜ O ( n 2 )( respecti vely , ˜ O ( n )) in the cycle (respectiv ely , grid) network. In all cases, the rate is proportional to the diameter of the network which is a natural indicator of information dissemination quality . V . N U M E R I C A L E X P E R I M E N T : B I N A RY S I G N A L D E T E C T I O N W e no w discuss distributed detection of signals transmitted through noisy channels. W e first particularize the model to binary signals, and then present our simulation results in that context. A. Signal Detection in Communication Channels In information theory , data transmission can be modeled via a sender , a receiv er and a channel. The channel is used to con vey information from one end to another . In general, a faulty communication is possible, and it might be caused by channel noises, and imperfect modulation or demodulation (see e.g. [28], [29]). In what follo ws, we ex emplify this point, and employ distributed detection to resolve it. June 14, 2018 DRAFT 18 Sen der I Sen der I I R ecei v er II R ecei v er I 1 1 0.5 0.5 Fig. 2: A communication channel which transmits digital data. Each recei ver cannot distinguish the message based on its own signals, so it communicates with the other receiv er to identify the message. Suppose a 2-digit binary number is to be transmitted o ver a communication channel as depicted in Fig. 2. Sender I and sender II broadcast T copies of the first and second digit, respectiv ely . Recei ver I (agent I ) can recognize the first digit accurately 4 while the second digit is distorted with probability 1/2. On the other hand, receiv er II (agent II ) collects the exact v alue of the second digit at the terminal, and observes a misrepresented first digit with probability 1/2. In this example, the state space is Θ = { θ 1 = 00 , θ 2 = 01 , θ 3 = 10 , θ 4 = 11 } , and let the true state be θ 1 = 00 . W e can see that none of the receivers can solely establish a reliable communication with senders as each of them has dif ficulty inferring one digit. More formally , it is straightforw ard to calculate that ` 1 ( s 1 | 00) = ` 1 ( s 1 | 01) ∀ s 1 ∈ { 0 , 1 } 2 and ` 2 ( s 2 | 00) = ` 2 ( s 2 | 10) ∀ s 2 ∈ { 0 , 1 } 2 , which simply means ¯ Θ 1 = { θ 1 , θ 2 } and ¯ Θ 2 = { θ 1 , θ 3 } . Ho wev er , the global identifiability of the true state holds as we hav e ¯ Θ = ¯ Θ 1 ∩ ¯ Θ 2 = { θ 1 } . Therefore, according to Lemma 3, exchanging information with each other , receiv ers are able to decipher the message transmitted by senders. B. Con ver gence of Beliefs For purpose of simulation, we generate a strongly connected network of n = 50 agents with a default communication matrix W . Assume that there exist m = 51 states in the world and 4 T o have the assumption A1 satisfied, we can think of accurate transmission as 1 − ε probability of success for some small ε > 0 . June 14, 2018 DRAFT 19 50 100 150 200 250 300 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 I t e r a t i o n B e l i e f o n t h e T r u e S t a t e Fig. 3: The belief e volution for all 50 agents in the network. The global identifiability of the true state and strong connecti vity of the network result in learning. agents are to discover the true state θ 1 . At time t ∈ [ T ] , a signal s i,t ∈ { 0 , 1 } is generated based on the true state such that ` i ( ·| θ 1 ) = ` i ( ·| θ i +1 ) . In other words, for agent i ∈ [ n ] , we hav e ¯ Θ i = { θ 1 , θ i +1 } and θ i +1 is observ ationally equi valent to the true state. Therefore, each agent i ∈ [ n ] fails to distinguish θ 1 from θ i +1 once relying on the priv ate signals. Howe ver , since we ha ve ¯ Θ = ∩ n i =1 ¯ Θ i = { θ 1 } , the true state is globally identifiable. Consequently , in vie w of Lemma 3, we expect that all agents reach a consensus on the true state (Fig. 3), and learn the truth exponentially fast. C. Optimizing the Spectral Gap W e now turn to optimizing the spectral gap to speed up learning. W e proved in Proposition 6 that ev ery default communication matrix can be adjusted to a matrix W 0 which has the optimal spectral gap when centralities are fixed. Setting the parameter α in (10) equal to α ∗ deri ved in Proposition 6, we obtain the optimal network. The dependence of decentralization cost to the spectral gap was theoretically prov ed in Theorem 5. Applying the results of Proposition 6 verifies that in the optimal network, agents suffer a lower decentralization cost comparing to the default network (Fig. 4). June 14, 2018 DRAFT 20 50 100 150 200 0 0.2 0.4 0.6 0.8 1 1.2 I t e r a t i o n A g e n t 1 Default Spectral Gap Optimal Spectral Gap 50 100 150 200 0 0.2 0.4 0.6 0.8 1 I t e r a t i o n A ge n t 1 4 50 100 150 200 0 0.2 0.4 0.6 0.8 1 I t e r a t i o n A ge n t 2 8 50 100 150 200 0 0.2 0.4 0.6 0.8 1 I t e r a t i o n A ge n t 4 2 Fig. 4: The plot of decentralization cost versus time horizon for agents 1, 14, 28 and 42 in the network. The cost in the network with the optimal spectral gap (green) is always less than the network with default weights (blue). D. Sensitivity to Link F ailur e Let us symmetrize the network in the previous section such that [ W ] ij = [ W ] j i . In this case e very agent is equally central, and we ha ve π = 1 /n . T o study the impact of link failure, we sequentially select a random pair of agents in the network, and remove their connection. Each time that a link is discarded, we compute the decentralization cost in the ne w network at iteration T = 300 , and continue the process until 50 bi-directional edges are eliminated from the network. In view of Proposition 7, we expect a monotone decrease in the spectral gap which amounts to a lar ger decentralization cost. W e plot the cost for four agents in the network, and observe that the behavior is almost (not quite) monotonic (Fig. 5). The monotone dependence of the upper bound to the spectral gap (Theorem 5) does not necessarily guarantee a monotone relationship between cost and the spectral gap. Therefore, we can only roughly expect such behavior . V I . C O N C L U S I O N W e considered a distributed detection model where a network of agents aim to learn the underlying state of the world. The priv ate signals do not provide enough information for agents June 14, 2018 DRAFT 21 10 20 30 40 50 1.72 1.74 1.76 1.78 N u m b e r o f R e m o v e d E d g e s Ag e n t 8 10 20 30 40 50 1.4 1.45 1.5 N u m b e r o f R e m o v e d E d g e s A g e n t 1 9 10 20 30 40 50 1.76 1.78 1.8 1.82 1.84 1.86 1.88 N u m b e r o f R e m o v e d E d g e s Ag e n t 2 2 10 20 30 40 50 1.24 1.26 1.28 1.3 1.32 1.34 1.36 N u m b e r o f R e m o v e d E d g e s Ag e n t 3 6 Fig. 5: The decentralization cost at round T = 300 for agents 8, 19, 22 and 36 in the network. Removing the links causes poor communication among agents and increase the decentralization cost. about the true state. Hence, agents engage in a local communication to compensate for their imperfect knowledge. Each agent iterati vely forms a belief about the state space using the collected data in its neighborhood. W e analyzed the learning procedure for a finite time horizon. T o study the ef ficiency of our algorithm versus its centralized counterpart, we brought forw ard the idea of KL cost. It turned out that netw ork size, spectral gap, centrality of each agent and relati ve entropy of agents’ signal structures are the key parameters that affect distributed detection. W e established that allocating more informativ e signals to central agents as well as optimizing the spectral gap can speed up learning. W e also proved that the learning rate deteriorates in the case of link failures, which can be seen as a side ef fect of poor communication. Finally , we w ould like to address a few issues in future works. In this paper , we discussed a communication model in which agents exchange information at e very round. In some netw orks, all-time communication is potentially costly or unnecessary . Alternativ ely , agents can only contact each other when their signals are not informativ e enough about the true state. As another direction, we can consider scenarios where the signal distrib utions are not stationary . This generalizes the model to dynamic parameters where we can in vestigate detection robustness in changing en vironments. June 14, 2018 DRAFT 22 A P P E N D I X : P RO O F S Proof of Lemma 1 . The proof is elementary , and it is only gi ven to keep the paper self- contained. W e write the Lagrangian associated to the update (3) as, L ( µ, λ ) = − µ T φ t + 1 η  µ, log µ µ 0  + λµ T 1 − λ, where we left the positivity constraint implicit. Dif ferentiating abo ve with respect to µ and λ , and setting the deri vati ves equal to zero, we get µ t ( k ) = µ 0 ( k ) exp { η φ t ( k ) − λ − 1 } and µ T t 1 = 1 , respecti vely , for any k ∈ [ m ] . Combining the equations above and noting that µ 0 is uniform, we hav e 1 m exp {− λ − 1 } m X k =1 exp { η φ t ( k ) } = 1 , which allo ws us to solve for λ and calculate the optimal solution µ t as follo ws, µ t ( k ) = exp { η φ t ( k ) } P m k =1 exp { η φ t ( k ) } . The proof for µ i,t follo ws precisely in the same fashion. T o calculate φ i,t , notice that in view of the first update in (5) we hav e        φ 1 ,t φ 2 ,t . . . φ n,t        = ( W ⊗ I m )        φ 1 ,t − 1 φ 2 ,t − 1 . . . φ n,t − 1        +        ψ 1 ,t ψ 2 ,t . . . ψ n,t        , where ⊗ denotes the Kronecker product. The equation above represents a discrete-time linear system. Giv en the fact that φ i, 0 ( k ) = 0 for all k ∈ [ m ] and i ∈ [ n ] , the closed-form solution of June 14, 2018 DRAFT 23 the system takes the form        φ 1 ,t φ 2 ,t . . . φ n,t        = t X τ =1 ( W ⊗ I n ) t − τ        ψ 1 ,τ ψ 2 ,τ . . . ψ n,τ        = t X τ =1  W t − τ ⊗ I n         ψ 1 ,τ ψ 2 ,τ . . . ψ n,τ        . Therefore, extracting φ i,t for each i ∈ [ n ] from the preceding relation completes the proof.  Proof of Lemma 2 . Since the network is strongly connected and the corresponding W is irreducible and aperiodic, by standard properties of stochastic matrices (see e.g. [21]), the diagonalizable matrix W satisfies   e T i W t − π T   1 ≤ nλ max ( W ) t , (14) for any i ∈ [ n ] , where π is the stationary distribution of a Markov chain with transition kernel W . Let us observe the following inequality nλ max ( W ) t − τ ≤ 2 for t − τ ≥ ˜ t , log  n 2  log λ max ( W ) − 1 , and recall that the identity k e T i W t − τ − π T k 1 ≤ 2 alw ays holds since an y power of W is stochastic. W ith that in mind, we use (14) to break the following sum into two parts to get t X τ =1 n X j =1     W t − τ  ij − π ( j )    = t X τ =1   e T i W t − τ − π T   1 = t − ˜ t X τ =1   e T i W t − τ − π T   1 + t X τ = t − ˜ t +1   e T i W t − τ − π T   1 ≤ t − ˜ t X τ =1 nλ max ( W ) t − τ + 2 ˜ t − 2 ≤ nλ max ( W ) ˜ t 1 − λ max ( W ) + 2 ˜ t = 2 1 − λ max ( W ) + 2 log  n 2  log λ max ( W ) − 1 , June 14, 2018 DRAFT 24 for any i ∈ [ n ] . Noting that 1 − λ max ( W ) ≤ log λ max ( W ) − 1 , we hav e t X τ =1 n X j =1     W t − τ  ij − π ( j )    ≤ 2 + 2 log  n 2  1 − λ max ( W ) ≤ 4 log n 1 − λ max ( W ) .  W e use the following inequality in [30] in the proof of Lemma 3. Lemma 9: (McDiarmid’ s Inequality) Let X 1 , ..., X N ∈ χ be independent random variables and consider the mapping H : χ N 7→ R . If for i ∈ { 1 , ..., N } , and e very sample x 1 , ..., x N , x 0 i ∈ χ , the function H satisfies | H ( x 1 , ..., x i − 1 , x i , x i +1 , ..., x N ) − H ( x 1 , ..., x i − 1 , x 0 i , x i +1 , ..., x N ) | ≤ c i , then for all ε > 0 , P  H ( x 1 , ..., x N ) − E [ H ( X 1 , ..., X N )] ≥ ε  ≤ exp ( − 2 ε 2 P N i =1 c 2 i ) . Proof of Lemma 3 . According to Lemma 1, we hav e µ i,t (1) = exp { η φ i,t (1) } P m k =1 exp { η φ i,t ( k ) } = 1 + m X k =2 exp { η φ i,t ( k ) − η φ i,t (1) } ! − 1 ≥ 1 − m X k =2 exp { η φ i,t ( k ) − η φ i,t (1) } , (15) where we used the fact that (1 + x ) − 1 ≥ 1 − x for any x ≥ 0 . Since we know k µ i,t − e 1 k TV = 1 2 1 − µ i,t (1) + m X k =2 µ i,t ( k ) ! = 1 − µ i,t (1) , we can simplify (15) as follows k µ i,t − e 1 k TV ≤ m X k =2 exp { η φ i,t ( k ) − η φ i,t (1) } . (16) June 14, 2018 DRAFT 25 For any k ∈ [ m ] , define Φ i,t ( k ) , t X τ =1 n X j =1  W t − τ  ij log ` j ( ·| θ k ) , and note that Φ i,t ( k ) is a function of nt random v ariables. As required in McDiarmid’ s inequality in Lemma 9, set H = Φ i,t ( k ) , fix the samples for nt − 1 random variables, and draw two different samples s j,τ and s 0 j,τ for some j ∈ [ n ] and some τ ∈ [ t ] . The fixed samples are simply cancelled in the subtraction, and we hav e | H ( ..., s j,τ , ... ) − H ( ..., s 0 j,τ , ... ) | =     W t − τ  ij  log ` j ( s j,t | θ k ) − log ` j ( s 0 j,t | θ k )     ≤  W t − τ  ij 2 B , where we used assumption A1 . Since any po wer of W is stochastic, summing over j ∈ [ n ] and τ ∈ [ t ] , we get t X τ =1 n X j =1   W t − τ  ij 2 B  2 ≤ 4 B 2 t. W e now apply McDiarmid’ s inequality in Lemma 9 to obtain P  φ i,t ( k ) − φ i,t (1) > E [Φ i,t ( k )] − E [Φ i,t (1)] + ε  ≤ exp  − ε 2 2 B 2 t  , for k = 2 , ..., m . Setting the probability above to δ /m and taking a union bound over all states, we hav e for any k = 2 , ..., m P  φ i,t ( k ) − φ i,t (1) ≤ E [Φ i,t ( k )] − E [Φ i,t (1)] + r 2 B 2 t log m δ  ≥ 1 − δ . (17) June 14, 2018 DRAFT 26 On the other hand, in view of assumption A1 , we hav e E [Φ i,t ( k ) − Φ i,t (1)] = t X τ =1 n X j =1  W t − τ  ij E [log ` j ( ·| θ k ) − log ` j ( ·| θ 1 )] = t X τ =1 n X j =1   W t − τ  ij − π ( j )  E [log ` j ( ·| θ k ) − log ` j ( ·| θ 1 )] + t X τ =1 n X j =1 π ( j ) E [log ` j ( ·| θ k ) − log ` j ( ·| θ 1 )] ≤ 2 B t X τ =1 n X j =1     W t − τ  ij − π ( j )    − t n X j =1 π ( j ) D K L ( ` j ( ·| θ 1 ) k ` j ( ·| θ k )) = 2 B t X τ =1 n X j =1     W t − τ  ij − π ( j )    − I ( θ 1 , θ k ) t ≤ 8 B log n 1 − λ max ( W ) − I ( θ 1 , θ k ) t, where we applied Lemma 2 to deriv e the last step. Using (4), we simplify above to get E [Φ i,t ( k ) − Φ i,t (1)] ≤ 8 B log n 1 − λ max ( W ) − I ( θ 1 , θ 2 ) t, (18) for any k = 2 , ..., m . Plugging (18) into (17) and combining with (16), we have k µ i,t − e 1 k TV ≤ m X k =2 exp  − η I ( θ 1 , θ 2 ) t + η r 2 B 2 t log m δ + 8 η B log n 1 − λ max ( W )  ≤ m exp  − η I ( θ 1 , θ 2 ) t + η r 2 B 2 t log m δ + 8 η B log n 1 − λ max ( W )  , with probability at least 1 − δ , and thereby completing the proof.  Proof of Lemma 4 . W e recall from the statement of the lemma that q i,t ( k ) = φ i,t ( k ) − φ t ( k ) , June 14, 2018 DRAFT 27 and calculate the ratio µ i,t ( k ) /µ t ( k ) for any k ∈ [ m ] as follo ws, µ i,t ( k ) µ t ( k ) = exp { η q i,t ( k ) } E µ 0 [exp { η φ t } ] E µ 0 [exp { η φ i,t } ] = exp { η q i,t ( k ) } E µ 0 [exp { η φ t } ] E µ 0 [exp { η φ t } exp { η q i,t } ] = exp { η q i,t ( k ) } 1 E µ 0 h exp { η φ t } E µ 0 [exp { η φ t } ] exp { η q i,t } i = exp { η q i,t ( k ) } 1 E µ 0 h µ t µ 0 exp { η q i,t } i = exp { η q i,t ( k ) } 1 E µ t [exp { η q i,t } ] . This entails 1 η E µ i,t  log µ i,t µ t  = E µ i,t [ q i,t ] − 1 η log E µ t [exp { η q i,t } ] ≤ E µ i,t [ q i,t ] − E µ t [ q i,t ] , where we used Jensen’ s inequality on the con vex function − log ( · ) . Setting the expectation measures in the right hand side of abo ve to µ t , and recalling the ratio µ i,t /µ t from abov e, we conclude that, 1 η E µ i,t  log µ i,t µ t  ≤ E µ t  µ i,t µ t q i,t  − E µ t [ q i,t ] = E µ t  exp { η q i,t } E µ t [exp { η q i,t } ] − 1  q i,t  = E µ t  exp { η q i,t } E µ t [exp { η q i,t } ] − 1  q i,t − E µ t [ q i,t ]  ≤ v u u t E µ t "  exp { η q i,t } E µ t [exp { η q i,t } ] − 1  2 #  V ar µ t [ q i,t ]  , (19) where we applied Cauchy-Schwarz inequality in the last line. Then, we appeal to Jensen’ s June 14, 2018 DRAFT 28 inequality again to get E µ t "  exp { η q i,t } E µ t [exp { η q i,t } ] − 1  2 # = E µ t "  exp { η q i,t } E µ t [exp { η q i,t } ]  2 # − 1 ≤ E µ t "  exp { η q i,t } exp { E µ t [ η q i,t ] }  2 # − 1 = E µ t  exp  2 η  q i,t − E µ t [ q i,t ]  − 1 . Note that the function g ( z ) = (exp { z } − 1 − z ) /z 2 is nondecreasing over reals, and let z = 2 η ( q i,t − E µ t [ q i,t ]) in g ( z ) . The condition η k q i,t k ∞ ≤ 1 / 4 immediately implies that z ≤ 1 , so recalling that z is the argument of exponential in above, we bound the right hand side as, E µ t  exp  2 η  q i,t − E µ t [ q i,t ]  − 1 ≤ 4 (exp(1) − 2) V ar µ t [ η q i,t ] ≤ 4 V ar µ t [ η q i,t ] . Plugging the bound abov e into (19), results in 1 η E µ i,t  log µ i,t µ t  ≤ q 4 V ar µ t [ q i,t ] V ar µ t [ η q i,t ] = 2 η V ar µ t [ q i,t ] . Summing abov e over t ∈ [ T ] and recalling (6), concludes the proof.  R E F E R E N C E S [1] R. R. T enney and N. R. Sandell Jr, “Detection with distributed sensors, ” IEEE T ransactions on Aer ospace Electr onic Systems , vol. 17, pp. 501–510, 1981. [2] J. N. Tsitsiklis et al. , “Decentralized detection, ” Advances in Statistical Signal Pr ocessing , vol. 2, no. 2, pp. 297–344, 1993. [3] V . Borkar and P . P . V araiya, “ Asymptotic agreement in distributed estimation, ” IEEE T ransactions on Automatic Control , vol. 27, no. 3, pp. 650–655, 1982. [4] S. Kar, J. M. Moura, and K. Ramanan, “Distrib uted parameter estimation in sensor netw orks: Nonlinear observation models and imperfect communication, ” IEEE T ransactions on Information Theory , vol. 58, no. 6, pp. 3575–3605, 2012. [5] O. Dekel, R. Gilad-Bachrach, O. Shamir , and L. Xiao, “Optimal distributed online prediction using mini-batches, ” The Journal of Machine Learning Research , vol. 13, no. 1, pp. 165–202, 2012. [6] A. Nedic and A. Ozdaglar , “Distributed subgradient methods for multi-agent optimization, ” IEEE T ransactions on A utomatic Contr ol , vol. 54, no. 1, pp. 48–61, 2009. [7] A. Nedic, A. Olshevsky , A. Ozdaglar , and J. N. Tsitsiklis, “On distributed av eraging algorithms and quantization effects, ” IEEE T ransactions on Automatic Control , vol. 54, no. 11, pp. 2506–2517, 2009. June 14, 2018 DRAFT 29 [8] J.-F . Chamberland and V . V . V eeravalli, “Decentralized detection in sensor networks, ” IEEE T ransactions on Signal Pr ocessing , vol. 51, no. 2, pp. 407–416, 2003. [9] F . Bullo, J. Cort ´ es, and S. Martinez, Distrib uted contr ol of r obotic networks: a mathematical appr oach to motion coor dination algorithms . Princeton Univ ersity Press, 2009. [10] N. A. Atanasov , J. Le Ny , and G. J. Pappas, “Distrib uted algorithms for stochastic source seeking with mobile robot networks, ” Journal of Dynamic Systems, Measurement, and Contr ol , 2014. [11] S. Shahrampour, S. Rakhlin, and A. Jadbabaie, “Online learning of dynamic parameters in social networks, ” in Advances in Neural Information Pr ocessing Systems , 2013. [12] J. N. Tsitsiklis, “Problems in decentralized decision making and computation. ” DTIC Document, T ech. Rep., 1984. [13] A. Jadbabaie, J. Lin, and A. S. Morse, “Coordination of groups of mobile autonomous agents using nearest neighbor rules, ” IEEE T ransactions on Automatic Contr ol , vol. 48, no. 6, pp. 988–1001, 2003. [14] R. Olfati-Saber and R. M. Murray , “Consensus problems in networks of agents with switching topology and time-delays, ” IEEE T ransactions on Automatic Control , vol. 49, no. 9, pp. 1520–1533, 2004. [15] A. Jadbabaie, P . Molavi, A. Sandroni, and A. T ahbaz-Salehi, “Non-bayesian social learning, ” Games and Economic Behavior , vol. 76, no. 1, pp. 210–225, 2012. [16] S. Shahrampour and A. Jadbabaie, “Exponentially fast parameter estimation in networks using distributed dual a veraging, ” in IEEE Conference on Decision and Control (CDC) , 2013, pp. 6196–6201. [17] A. Lalitha, A. Sarwate, and T . Javidi, “Social learning and distributed hypothesis testing, ” in International Symposium on Information Theory (ISIT) , 2014, pp. 551–555. [18] K. Rahnama Rad and A. T ahbaz-Salehi, “Distributed parameter estimation in networks, ” in IEEE Confer ence on Decision and Contr ol (CDC) , 2010, pp. 5050–5055. [19] J. C. Duchi, A. Agarwal, and M. J. W ainwright, “Dual averaging for distributed optimization: conv ergence analysis and network scaling, ” IEEE T ransactions on Automatic Control , vol. 57, no. 3, pp. 592–606, 2012. [20] A. Jadbabaie, P . Molavi, and A. T ahbaz-Salehi, “Information heterogeneity and the speed of learning in social networks, ” Columbia Business School Researc h P aper , no. 13-28, 2013. [21] J. S. Rosenthal, “Con ver gence rates for markov chains, ” Siam Re view , vol. 37, no. 3, pp. 387–405, 1995. [22] J. D. Abernethy , E. Hazan, and A. Rakhlin, “Interior-point methods for full-information and bandit online learning, ” IEEE T ransactions on Information Theory , vol. 58, no. 7, pp. 4164–4175, 2012. [23] A. Nemirovskii and D. Y udin, Pr oblem complexity and method ef ficiency in optimization . W iley (Chichester and Ne w Y ork), 1983. [24] N. Cesa-Bianchi, G. Lugosi et al. , Pr ediction, learning, and games . Cambridge Univ ersity Press Cambridge, 2006, vol. 1. [25] A. Rakhlin and K. Sridharan, “Online learning with predictable sequences, ” in Confer ence on Learning Theory , 2013, pp. 993–1019. [26] D. A. Levin, Y . Peres, and E. L. Wilmer , Markov chains and mixing times . American Mathematical Soc., 2009. [27] F . R. Chung, Spectral graph theory . American Mathematical Soc., 1997, vol. 92. [28] T . M. Cover and J. A. Thomas, Elements of information theory . John Wile y & Sons, 2012. [29] H. V . Poor, An introduction to signal detection and estimation . Springer , 1994. [30] C. McDiarmid, “Concentration, ” in Pr obabilistic methods for algorithmic discrete mathematics . Springer, 1998, pp. 195–248. June 14, 2018 DRAFT

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment