Decentralized Collaborative Learning of Personalized Models over Networks

Decen tralized Collab orativ e Learning of P ersonal ized Mo dels o ve r Net w o rks P a ul V anhaesebrouc k ∗ 1 , Au r ´ elien Bellet † 1 and Marc T ommasi † 2 1 INRIA 2 Univ ersit´ e de Lille Abstract W e consider a set of learning age nts in a col- lab orative peer-to- pe e r netw or k, where each agent learns a p ersonalize d mo del accor ding to its own learning ob jective. The q ue s tion addressed in this paper is : how can agents improv e upon their lo cally tr a ined mo del b y communicating with other agents that hav e similar ob jectives? W e introduce a nd a nalyze t wo asynchronous gossip algorithms running in a fully decentralized manner . Our ﬁrst ap- proach, inspired from label propagatio n, aims to smo oth pre-trained lo cal models over the net work while account ing for the conﬁdence that each agent has in its initial mo del. In our second approach, agents join tly learn a nd propaga te their mo del b y making iter ative upda tes based on both their lo cal dataset and the b ehavior of their neig hbors. T o optimize this challenging ob jectiv e, our decent ralized algorithm is based on ADMM. 1 In tro duction Increasing amounts of data are being produced by in- terconnected devices such as mobile phones , co nnected ob jects, sensors, et c . F or instance, history logs are generated when a smartphone user browses the web, gives pro duct ra tings a nd executes v ar ious a pplica- tions. The c urrently dominant appro ach to extract useful information from suc h data is to collect all us e r s’ per sonal data on a server (or a tightly co upled system hosted in a data center) and apply centralized ma- chine learning and data mining techniques. How ever, this centralization po ses a num ber of issues, such as the need for users to “surrender” their p er sonal data to the service provider without m uc h con trol on how the da ta will be used, while incurring p otentially high ∗ first.last @gmail.co m † first.last @inria.fr bandwidth a nd device ba ttery cos ts. Even when the learning algorithm can be distributed in a wa y that keeps data o n users’ devices, a central en tit y is of- ten still required fo r aggreg ation and co ordination (see e.g., McMahan et al., 2016). In this pap er, we en vision an alterna tive setting where many use rs (agents) with lo cal datase ts collab orate to lear n mo dels by engaging in a fully decen tralized pee r-to-p eer netw or k. Unlik e existing w ork focusing on problems wher e ag ents seek to a gree o n a global c onsensus mo del (see e.g., Nedic a nd Ozdaglar, 2009; W ei and Ozdagla r, 20 12; Duchi et al., 2012), we s tudy the case where each a gent learns a p ersonalize d mo del according to its own learning ob jective. W e assume that the netw or k graph is given and reﬂects a notio n of similarity b etw een ag ents (tw o ag ents ar e neighbors in the netw ork if they hav e a similar learning ob jective), but each agent is only a ware of its direc t neighbors. An agent can then learn a mo del fr om its (typically scarce) pers onal data bu t also from in terac tio ns with its neigh bor ho o d. As a motiv ating example, co nsider a decentralized recommender system (Bo utet et a l., 2013, 201 4) in which ea ch user rates a s mall num b er of mo vies o n a smartphone a pplica tion a nd expects per sonalized re commendations o f new movies. In or- der to train a reliable recommender for ea ch user , one should rely on the limited user’s data but also o n in- formation bro ught by user s with similar taste/ proﬁle. The p eer- to-p eer communication graph could be es- tablished when some users go the same movie theater or attend the same cultur a l even t, and some similar- it y weigh ts b etw e en users c o uld b e computed based on historical data (e.g., counting how many times p eople hav e met in such lo catio ns ). Our co nt ributions a re as follows. After for ma lizing the problem of interest, we prop ose tw o asynchronous and fully decentralized algo rithms for collab or ative lear n- ing of pe rsonalized mo dels. They b elong to the fam- ily of gossip a lgorithms (Shah, 200 9; Dimakis et al., 2010): agents only communicate with a single neighbor at a time, which makes our a lgorithms suitable for de- 1 ploymen t in large p eer -to-p eer real netw orks. Our ﬁrs t approach, called mo del pr op agation , is ins pired b y the graph-ba s ed lab el pr opagatio n tec hnique of Zhou et al. (2004). In a ﬁr st phase, each agent lea rns a mo del based on its lo cal data only , without communicating with o thers. In a seco nd phase, the mo del para meters are regular ized so as to b e smo oth over the netw ork graph. W e introduce some conﬁdence v alues to ac - count for p otential discrepancies in the ag ent s’ train- ing set sizes, and der ive a nov el as ynchronous goss ip algorithm which is simple and eﬃcient. W e pr ov e that this alg orithm conv erges to the optimal solution of the problem. Our seco nd approach, called c ol lab or ative le arning , is more ﬂexible as it int erwea v es learning and pr o pagation in a single pro c e ss. Spec iﬁca lly , it optimizes a trade-o ﬀ b etw een the s mo othness of the mo del para meters ov er the netw or k on the one hand, and the mo dels’ accuracy o n the lo cal da tasets on the other hand. F or this formulation, we pr o p o se an asyn- chronous gossip a lgorithm based on a decentralized version of Alternating Direc tio n Metho d of Multipliers (ADMM) (Boyd e t al., 2 011). Finally , we e v aluate the per formance o f o ur metho ds o n t wo synthetic co lla b o- rative tasks: mean estimation and linear cla ssiﬁcation. Our exp eriments show the sup erior ity of the prop os ed approaches ov er baseline strategies , and conﬁrm the eﬃciency of our decent ralized algor ithms. The rest of the pap er is or ganized as follows. Sec- tion 2 formally des c rib es the pr oblem of interest a nd discusses some related work. Our mo del pr opagatio n approach is intro duce d in Section 3, along w ith our de- centralized algor ithm. Sectio n 4 describ es our collab- orative lea r ning approa ch, and der ives a n e quiv alent formulation w hich is amenable to optimizatio n using decentralized ADMM. Finally , Section 5 shows our nu- merical results, and w e conclude in Section 6 . 2 Preliminaries 2.1 Notations and Problem Setti ng W e co nsider a set of n a gents V = J n K whe r e J n K := { 1 , . . . , n } . Given a conv ex loss function ℓ : R p × X × Y , the goal o f ag ent i is to learn a model θ i ∈ R p whose ex- pec ted loss E ( x i ,y i ) ∼ µ i ℓ ( θ i ; x i , y i ) is small with resp ect to an unknown a nd ﬁxed distribution µ i ov er X × Y . Each agent i has acce ss to a set o f m i ≥ 0 i.i.d. tra in- ing examples S i = { ( x j i , y j i ) } m i j =1 drawn fro m µ i . W e allow the tr a ining s et size to v ary widely acro ss agents (some may even hav e no data a t a ll). This is imp or- tant in practice as some agents may b e more “active” than others , may hav e r ecently joined the serv ice, etc . In isola tion, an agent i can lear n a “solitary” mo del θ sol i by minimizing the loss ov er its lo cal dataset S i : θ sol i ∈ ar g min θ ∈ R p L i ( θ ) = m i X j =1 ℓ ( θ ; x j i , y j i ) . (1) The goal for the agents is to improve upon their soli- tary mo del by leveraging infor mation from other use r s in the netw o rk. F ormally , we co nsider a weigh ted con- nected graph G = ( V , E ) ov er the s et V of ag ent s, where E ⊆ V × V is the set of undirected edg es. W e denote by W ∈ R n × n the symmetric nonneg - ative weigh t matrix asso cia ted with G , wher e W ij gives the w eight of edge ( i, j ) ∈ E and by co nven- tion, W ij = 0 if ( i, j ) / ∈ E or i = j . W e as s ume that the w eight s r e present the underlying similarity betw een the agents’ ob jectiv es: W ij should tend to be large (r esp. small) when the ob jectives of a gents i and j ar e s imilar (re s p. dissimilar ). While we ass ume in this pap er that the weights are giv en, in pr actical sce- narios o ne could for instance use some auxiliary infor- mation such as use r s’ proﬁles (when av ailable) and/or prediction disa greement to estimate the weigh ts. F or notational conv enience, w e deﬁne the diagonal matrix D ∈ R n × n where D ii = P n j =1 W ij . W e will also de- note by N i = { j 6 = i : W ij > 0 } the set of neigh bo rs of agent i . W e ass ume that the agents only hav e a lo cal view of the net w ork: they know their neighbo rs a nd the a sso ciated weight s, but not the global top olo gy or how many ag e nts par ticipa te in the netw ork. Our go al is to prop ose dece ntralized alg orithms for agents to colla b o ratively improve up on their s olitary mo del by leveraging info r mation from their neighbo rs. 2.2 Related W ork Several p eer-to- pe er algo rithms have b een devel- op ed for dece ntralized av eraging (Kemp e et al., 2003; Boyd et al., 2 006; Colin et al., 2 015) a nd optimiza- tion (Nedic and Ozdag lar, 2 009; Ram et al., 2010; Duc hi et al., 2 0 12; W ei a nd Ozdag lar, 201 2, 2 013; Iutzeler et al., 2 0 13; Co lin et a l., 201 6). These ap- proaches solve a c onsensu s pr oblem of the form: min θ ∈ R p n X i =1 L i ( θ ) , (2) resulting in a glo bal so lution common to all agents (e.g., a classiﬁer minimizing the prediction er ror ov er the union o f all datasets). This is unsuitable for o ur setting, where all agents hav e p erso na lized ob jectives. Our problem is re minis c e n t of Multi-T ask Learning (MTL) (Caruana, 1997), wher e one join tly learns mo d- els for rela ted ta sks. Y e t, ther e are several diﬀerences with our s etting. In MTL, the num ber o f tasks is often small, training sets are w ell-balanced acro ss tasks, and all task s ar e us ually as sumed to be p o s itively re la ted (a p opular assumption is that a ll mo dels sha re a com- mon subspace). La stly , the a lgorithms are centralized, aside fr o m the distributed MTL of W ang et al. (2 016) which is synchronous a nd relies on a cent ral server. 3 Mo del Propagation In this section, we present o ur model pr opagatio n ap- proach. W e ﬁrst introduce a global optimizatio n prob- lem, a nd then prop ose and analyze an asy nchronous gossip algorithm to solve it. 3.1 Problem F ormulation In this for m ulation, we a s sume that each agent i has learned a solitary mo del θ sol i by minimizing its lo ca l loss, as in (1). This can be done without a ny c om- m unication b etw een agents. Our goa l here consists in adapting these mo dels b y making them smo other ov er the netw o rk graph. In order acco unt for the fact that the solita ry mo dels were lear ned o n tra ining sets of diﬀerent sizes, w e will use c i ∈ (0 , 1] to denote the con- ﬁdence we put in the mo de l θ sol i of user i ∈ { 1 , . . . , n } . The c i ’s should b e pro p o rtional to the num b er of tra in- ing p oints m i — one may for instance set c i = m i max j m j (plus some small constant in the case where m i = 0). Denoting Θ = [ θ 1 ; . . . ; θ n ] ∈ R n × p , the ob jective func- tion we aim to minimize is as follo ws: Q MP (Θ) = 1 2  n X i 0 is a trade-oﬀ parameter and k·k denotes the Euclidean no rm. The ﬁrst term in the right ha nd side of (3) is a cla ssic qua dratic form used to smo oth the models within neighborho o ds: the distance be - t ween the new models of agents i and j is encour aged to b e small when the weigh t W ij is large. The sec- ond term preven ts mo dels with lar ge co nﬁdence from diverging to o m uch from their orig ina l v alues so that they can pro pagate useful informa tion to their neigh- bo rho o d. On the other ha nd, mo dels with low con- ﬁdence are allowed la rge deviations: in the extr eme case where agent i ha s very little or even no data (i.e., c i is neglig ible), its mo del is fully determined b y the neighboring mo dels. The presence of D ii in the sec - ond term is s imply for nor malization. W e hav e the following res ult (the pro o f is in Appendix A). Prop ositi o n 1 (Closed- form solution) . L et P = D − 1 W b e the sto chastic similarity matrix asso ciate d with the gr aph G and Θ sol = [ θ sol 1 ; . . . ; θ sol n ] ∈ R n × p . The solution Θ ⋆ = arg min Θ ∈ R n × p Q MP (Θ) is given by Θ ⋆ = ¯ α ( I − ¯ α ( I − C ) − αP ) − 1 C Θ sol , (4) with α ∈ (0 , 1) s u ch that µ = (1 − α ) /α , and ¯ α = 1 − α . Our formulation is a gener alization of the semi- sup e rvised lab el propaga tion technique of (Zhou et al., 2004), which can b e re c overed b y setting C = I (same conﬁdence for all no des). Note that it is st rictly more general: we can see from (4) that unless the co nﬁ- dence v alues a re eq ual for all agents, the co nﬁdence information cannot b e incorp or ated by using diﬀeren t solitary models Θ sol or by co nsidering a diﬀerent gr aph (beca use ¯ α α ( I − C ) − P is not sto chastic). The asyn- chronous goss ip algo rithm we present b elow thus a p- plies to lab el propaga tion for which, to the b est of our knowledge, no such alg orithm was previously kno wn. Computing the closed form so lutio n (4) requir es the knowledge o f the globa l net w ork and of all solitar y mo dels, which are unknown to the agents. Our start- ing point for the deriv ation of an asynchronous gossip algorithm is the following iter a tive form: for any t ≥ 0, Θ( t + 1) = ( αI + ¯ αC ) − 1  αP Θ( t ) + ¯ αC Θ sol  , (5) The s equence (Θ( t )) t ∈ N can b e s hown to co nverge to (4) regar dless of the choice of initial v alue Θ(0), see Appendix B for details. An interesting observ ation ab out this recur sion is tha t it can b e decomp osed int o agent-cen tric updates which o nly inv o lve neigh- bo rho o ds. Indeed, for an y agent i and a ny t ≥ 0 : θ i ( t + 1 ) = 1 α + ¯ αc i  α X j ∈N i W ij D ii θ j ( t ) + ¯ αc i θ sol i  . The iteration (5) c a n th us be understo o d a s a decen- tralized but synchronous pr o cess where, at e ach step, every agent co mm unicates with all its neighbors to collect their current mo del par ameters and use s this information to up date its mo del. Assuming tha t the agents do hav e a c cess to a global clock to synchronize the up dates (which is unrealistic in many pra ctical sce- narios), synchronization incurs large delays since all agents must ﬁnish the up date at step t be fo re a nyone starts s tep t + 1. The fact that agents must contact all their neighbor s at ea ch iteratio n further hinder s the eﬃciency o f the alg orithm. T o avoid these limitations, we prop ose b elow an asynchronous go ssip algorithm. 3.2 Async hronous Gossip Al gorithm In the asynchronous setting, each ag e n t has a lo c al clo ck ticking at the times of a r ate 1 Poisson proc ess, and wakes up when it ticks. As lo c a l clo cks are i.i.d., it is equiv a le n t to activ a ting a single node uniformly at random at each time step (Boyd et al., 2 006). 1 The idea b ehind our alg orithm is the following. At any time t ≥ 0, each ag ent i will main tain a (pos si- bly outdated) knowledge of its neig hbors’ mo dels. F or mathematical co nv enience, we will consider a ma trix e Θ i ( t ) ∈ R n × p where its i - th line e Θ i i ( t ) ∈ R p is agent i ’s mo del at time t , and for j 6 = i , its j -th line e Θ j i ( t ) ∈ R p is agent i ’s last know le dge of the mo de l of a gent j . F or any j / ∈ N i ∪ { i } and any t ≥ 0, we will ma int ain e Θ j i ( t ) = 0 . Let e Θ = [ e Θ ⊤ 1 , . . . , e Θ ⊤ n ] ⊤ ∈ R n 2 × p be the horizontal stacking of all the e Θ i ’s. If age nt i wak es up a t time step t , tw o consecutive actions are perfor med: • c ommu n ic ation st ep : agent i selects a r andom neighbor j ∈ N i with prob. π j i and b oth a gents upda te their knowledge of each other’s mo del: e Θ j i ( t + 1) = e Θ j j ( t ) and e Θ i j ( t + 1) = e Θ i i ( t ) , • up date st ep : age nts i and j up date their own mo d- els based on current knowledge. F or l ∈ { i , j } : e Θ l l ( t + 1) = ( α + ¯ αc l ) − 1  α X k ∈N l W lk D ll e Θ k l ( t + 1) + ¯ αc l θ sol l  . (6) All o ther v aria bles in the netw ork remain unchanged. In the co mm unication step ab ove, π j i corres p o nds to the pro bability that agent i sele c ts agent j . F or a ny i ∈ J n K , we have π i ∈ [0 , 1] n such that P n j =1 π j i = 1 and π j i > 0 if and only if j ∈ N i . Our alg orithm b elongs to the family of gossip algo- rithms a s each agent comm unicates with at most one neighbor at a time. Gossip algo rithms are known to be very eﬀective for decentralized computation in pee r -to- pee r netw orks (see Dimakis et al., 2 010; Shah, 20 09). Thanks to its asynchronous up dates, our a lgorithm ha s the p otential to be muc h faster than a synchronous ver- sion when executed in a large peer- to-p eer net w ork. The main result of this section shows that our algo- rithm conv e r ges to a state where a ll no des have their optimal model (a nd those of their ne ig hbors). Theorem 1 (Convergence) . L et e Θ(0) ∈ R n 2 × p b e some arbitr ary initial value and ( e Θ( t )) t ∈ N b e the se quenc e gener ate d by our algorithm. L et Θ ⋆ = arg min Θ ∈ R n × p Q MP (Θ) b e t he optimal solution to mo del pr op agatio n. F or any i ∈ J n K , we have: lim t →∞ E h e Θ j i ( t ) i = Θ ⋆ j for j ∈ N i ∪ { i } . 1 Our analysis straigh tforw ardly extends to the case where agents have clocks t ic king at diﬀerent rates. Sketch of pr o of. The ﬁrst step of the pro o f is to rewrite the algo r ithm a s an equiv a lent r andom iterative pr o- cess ov er e Θ ∈ R n 2 × p of the form: e Θ( t + 1) = A ( t ) e Θ( t ) + b ( t ) , for any t ≥ 0. Then, we show that the sp ectral radius of E [ A ( t )] is s maller than 1 , which a llows us to exhibit the conv ergence to the des ir ed quantit y . The pro o f ca n be fo und in Appendix C. 4 Collab orativ e Learning In the appr oach presented in the previous section, mo dels are learned lo ca lly by each agent and then propaga ted through the gr aph. In this sectio n, we allow the agents to simultaneously lear n their mo del and propagate it throug h the net work. In other w ords, agents iteratively up date their mo dels based on b oth their lo cal dataset and the b ehavior of their neighbors. While in ge neral this is computationally more costly than merely pro pa gating pre- tr ained models , w e ca n exp ect signiﬁcant improvemen ts in terms o f accuracy . As in the case of model propagation, w e ﬁrs t intro- duce the global o b jectiv e function a nd then prop ose an asynchronous goss ip algo rithm, which is based on the general paradigm of ADMM (Boyd et a l., 2011). 4.1 Problem F ormulation In c o ntrast to mo del propag ation, the ob jective func- tion to minimize he r e takes into acco unt the lo s s of each p ersona l model on the lo ca l dataset, rather than simply the distance to the s olitary mo del: Q CL (Θ) = n X i 0 is a trade-oﬀ par ameter. The asso ciated optimization problem is Θ ⋆ = arg min Θ ∈ R n × p Q CL (Θ). The ﬁrs t term in the right hand s ide of (7) is the sa me as in the mo del pro pagation ob jective (3) and tends to fav o r models that are smoo th o n the graph. How ever, while in mo del propagation enfor cing smo othness o n the mo dels may p otentially translate in to a signiﬁcant decrease of a c curacy on the lo cal da tasets (ev en for re l- atively small changes in para meter v alues with r esp ect to the solitary mo dels), here the second ter m pr even ts this. It allows more ﬂexibility in se ttings w he r e very diﬀerent pa rameter v a lues deﬁne mo dels whic h a ctu- ally give very similar pre dic tio ns. Note that the con- ﬁdence is built in the s econd ter m a s L i is a s um ov er the lo cal dataset of agent i . In g eneral, there is no clo sed-form expr e ssion for Θ ⋆ , but we can so lve the pro blem with a decentralized it- erative a lgorithm, as shown in the re s t of this section. 4.2 Async hronous Gossip Al gorithm W e prop os e a n a synchronous decentralized algo r ithm for minimizing (7) based on the Alternative Dire ction Metho d of Multipliers (ADMM). This ge ne r al metho d is a p o pula r wa y to so lve cons ensus problems of the form (2) in the dis tributed a nd decentralized settings (see e.g., Boyd et al., 201 1; W ei a nd O zdaglar, 2012, 2013; Iutzeler et al., 2 013). In our setting, we do not seek a consensus in the class ic sense of (2) since our goal is to lear n a per sonalized model for each ag e nt. How ever, we show b elow that we can reformulate (7) as an equiv ale n t p artial consensus problem which is amenable to decen tralized optimization with ADMM. Problem reformulation. Let Θ i be the set of |N i | + 1 v ariables θ j ∈ R p for j ∈ N i ∪ { i } , and deno te θ j by Θ j i . This is similar to the nota tio ns used in Section 3, ex- cept tha t here w e consider Θ i as living in R ( |N i | +1) × p . W e now deﬁne Q i CL (Θ i ) = 1 2 X j ∈N i W ij k θ i − θ j k 2 + µD ii L i ( θ i ) , so tha t we can rewrite o ur pr oblem (7) as min Θ ∈ R n × p P n i =1 Q i CL (Θ i ). In this formulation, the ob jective functions a sso ciated with the a gents a re dependent as they sha re s o me de- cision v ar iables in Θ. In or der to apply decentralized ADMM, we need to deco uple the ob jectives. The idea is to in tro duce a lo cal copy e Θ i ∈ R ( |N i | +1) × p of the decision v ariables Θ i for each agent i and to impo se equality co nstraints on the v ariables e Θ i i = e Θ i j for all i ∈ J n K , j ∈ N i . This partial consensus can b e seen a s requiring that tw o neighboring agents agree o n each other’s p ers o nalized mo del. W e further intro duce 4 secondary v ariables Z i ei , Z i ej , Z j ei and Z j ej for ea ch edg e e = ( i, j ), which can b e viewed as estimates of the mo dels e Θ i and e Θ j known by each end of e and will a l- low an eﬃcient decomp osition of the ADMM up dates. F orma lly , denoting e Θ = [ e Θ ⊤ 1 , . . . , e Θ ⊤ n ] ⊤ ∈ R (2 | E | + n ) × p and Z ∈ R 4 | E |× p , we introduce the formulation min e Θ ∈ R (2 | E | + n ) × p Z ∈C E n X i =1 Q i CL ( e Θ i ) s.t. ∀ e = ( i, j ) ∈ E , ( Z i ei = e Θ i i , Z j ei = e Θ j i Z j ej = e Θ j j , Z i ej = e Θ i j , (8) where C E = { Z ∈ R 4 | E |× p | Z i ei = Z i ej , Z j ej = Z j ei for all e = ( i, j ) ∈ E } . It is easy to see that Pr oblem (8) is equiv alent to the or iginal proble m (7) in the following sense: the minimizer e Θ ⋆ of (8) sa tisﬁes ( e Θ ⋆ ) j i = Θ ⋆ j for all i ∈ J n K and j ∈ N i ∪ { i } . F urther observe that the set of constr aints involving e Θ can be written D e Θ + H Z = 0 wher e H = − I of dimens io n 4 | E | × 4 | E | is dia gonal inv er tible a nd D of dimension 4 | E | × (2 | E | + n ) co nt ains exa ctly o ne entry of 1 in e a ch row. T he assumptions of W ei and Ozdag lar (20 13) ar e thus met and we can apply a synchronous decentralized ADMM. Before pre s entin g the algo rithm, w e derive the aug- men ted La grang ia n asso ciated with P roblem (8). Let Λ j ei be dual v ar iables asso ciated with constraints inv olv ing e Θ in (8). F or conv enience, we deno te by Z i ∈ R 2 |N i | the set o f seco ndary v ariables {{ Z i ei } ∪ { Z j ei }} e =( i,j ) ∈ E asso ciated with agent i . Sim- ilarly , we denote b y Λ i ∈ R 2 |N i | the set o f dual v ar i- ables {{ Λ i ei } ∪ { Λ j ei }} e =( i,j ) ∈ E . The augmented La- grangia n is given by: L ρ ( e Θ , Z, Λ) = n X i =1 L i ρ ( e Θ i , Z i , Λ i ) , where ρ > 0 is a p enalty para meter, Z ∈ C E and L i ρ ( e Θ i , Z i , Λ i ) = Q i CL ( e Θ i ) + X j : e =( i,j ) ∈ E h Λ i ei ( e Θ i i − Z i ei ) + Λ j ei ( e Θ j i − Z j ei ) + ρ 2  k e Θ i i − Z i ei k 2 + k e Θ j i − Z j ei k 2  i . Algorithm. ADMM co nsists in appr oximately min- imizing the augmented Lagr angian L ρ ( e Θ , Z, Λ) by al- ternating minimization with r esp ect to the primal v ari- able e Θ and the secondar y v ariable Z , together with an iterative up date of the dual v aria ble Λ. W e ﬁrst br ie ﬂy dis c us s how to instantiate the initial v alues e Θ(0), Z (0) a nd Λ (0 ). The only constraint on these initial v alues is to have Z (0) ∈ C E , so a simple option is to initialize all v ar iables to 0. That said, it is t ypically adv antageous to use a warm-start stra tegy . F or ins tance, each agent i can send its solitar y mo del θ sol i to its neighbors, a nd then set e Θ i i = θ sol i , e Θ j i = θ sol j for all j ∈ N i , Z i ei = e Θ i i , Z j ei = e Θ j i for all e = ( i, j ) ∈ E , and Λ(0) = 0. Alternativ ely , one ca n initialize the al- gorithm with the mo del pr opagatio n solution obtained using the method of Sectio n 3 . Recall fro m Sec tio n 3.2 that in the asynchronous set- ting, a single age n t wakes up at each time step and se - lects one o f its neig hbors. Assume that a gent i wak es up at some iteratio n t ≥ 0 and selects j ∈ N i . Denot- ing e = ( i, j ), the iter a tion go es as follows: 1. Agent i up dates its primal v aria bles: e Θ i ( t + 1) = arg min Θ ∈ R ( |N i | +1) × p L i ρ (Θ , Z i ( t ) , Λ i ( t )) , and s ends e Θ i i ( t + 1) , e Θ j i ( t + 1) , Λ i ei ( t ) , Λ j ei ( t ) to agent j . Agent j executes the same steps w.r.t. i . 2. Using e Θ j j ( t + 1) , e Θ i j ( t + 1 ) , Λ j ej ( t ) , Λ i ej ( t ) r e ceived from j , agent i updates its s econdary v ariables: Z i ei ( t + 1) = 1 2 h 1 ρ  Λ i ei ( t ) + Λ i ej ( t )  + e Θ i i ( t + 1 ) + e Θ i j ( t + 1) i , Z j ei ( t + 1) = 1 2 h 1 ρ  Λ j ej ( t ) + Λ j ei ( t )  + e Θ j j ( t + 1 ) + e Θ j i ( t + 1) i . Agent j up dates its seconda ry v ariables symmet- rically , so by constructio n we have Z ( t + 1) ∈ C E . 3. Agent i up dates its dual v ar iables: Λ i ei ( t + 1) = Λ i ei ( t ) + ρ  e Θ i i ( t + 1) − Z i ei ( t + 1)  , Λ j ei ( t + 1) = Λ j ei ( t ) + ρ  e Θ j i ( t + 1) − Z j ei ( t + 1)  . Agent j up dates its dual v ariables symmetrically . All other v ariables in the net w ork remain unc hanged. Step 1 has a simple solution for some lo s s functions commonly use d in machine learning (such a s quadratic and L 1 loss), and when it is not the case ADMM is typically robust to approximate so lutions to the corres p o nding subpro blems (obtained for instance af- ter a few steps of g radient descent), see Boyd et al. (2011) for examples and further practica l consider a- tions. Asy nchronous ADMM conv erges almo st s urely to an optimal so lution at a r ate of O (1 /t ) for conv ex ob jective functions (see W ei and Ozda glar, 20 13). 5 Exp erimen ts In this s e c tion, we provide num erical exper iment s to ev aluate the p erfor ma nce of our decentralized a lgo- rithms with resp e c t to accur acy , co n vergence rate and the amount of communication. T o this end, we in tro- duce tw o synthetic collab or ative tasks: mean estima- tion and linear classiﬁcation. 5.1 Collab orative Mean Es timation W e ﬁrst in tro duce a simple task in which the g o al of each agent is to es tima te the mean of a 1D dis- tribution. T o this end, we adapt the tw o intert win- ing mo o ns dataset p o pular in semi-sup ervised learning (Zhou et al., 2 004). W e consider a set of 300 agents, together with auxiliar y infor ma tion ab out each age n t i in the form of a vector v i ∈ R 2 . The true dis tribu- tion µ i of an agent i is either N (1 , 40) o r N ( − 1 , 40) depe nding o n whether v i belo ngs to the upp er or lower mo on, see Figure 1(a). Each a gent i receives m i sam- ples x 1 i , . . . , x m i i ∈ R from its distribution µ i . Its so li- tary model is then given b y θ sol i = 1 m i P m i j =1 x j i , w hich corres p o nds to the use of the quadratic loss function ℓ ( θ ; x i ) = k θ − x i k 2 . Finally , the gra ph ov er agents is the complete gra ph where the weigh t b etw een ag e n ts i and j is given by a Gaussian kernel on the a gents’ aux- iliary infor mation W ij = exp( −k v i − v j k 2 / 2 σ 2 ) , with σ = 0 . 1 for appr opriate sca ling. In all exp eriments, the par ameter α o f mo del propaga tion was set to 0 . 99, which gave the b est results on a held-o ut set of rando m problem insta nces. W e ﬁrst use this mean es timation task to illustrate the imp ortance o f consider ing con- ﬁdence v alues in our mo del pr o pagation for mulation, and then to ev aluate the eﬃcienc y of our as ynchronous decentralized alg orithm. Relev ance of conﬁdence v alues. Our g oal here is to show that in tro ducing conﬁdence v alues in to the mo del pro pagation appr o ach can sig niﬁcantly improve the ov e r all accur acy , esp ecia lly when the a gents re- ceive unbalanced amounts of data. In this exp er iment , we only compar e mo del propa gation with a nd without conﬁdence v alues, so we co mpute the optimal so lutions directly using the closed-for m solution (4). W e generate several problem instances with v arying standard deviatio n for the conﬁdence v alues c i ’s. More precisely , we s ample c i for each agent i from a uniform distribution centered a t 1 / 2 with width ǫ ∈ [0 , 1]. The nu m ber of samples m i given to agent i is then set to m i = ⌈ c i · 1 00 ⌉ . The larger ǫ , the more v aria nce in the size of the loca l datase ts. Figur es 1(b)-1 (d) give a visualizatio n of the models b efor e and after pr opa- gation on a pr oblem instance for the hardes t setting ǫ = 1. Figure 2 (left-middle) shows results av eraged ov er 1 000 ra ndo m pro blem instances for several v alues of ǫ . As exp ected, when the lo cal dataset sizes are well-balanced (small ǫ ), mo del propa g ation pe r forms the same with or w itho ut the use of conﬁdence v alues. Indeed, b oth ha v e simila r L 2 error with res p ect to the target mean, and the win r atio is ab o ut 0 . 5. Howev er , the p erforma nce gap in favor of using co nﬁdence v al- ues incre a ses shar ply with ǫ . F or ǫ = 1, the win ratio in favor o f using conﬁdence v alues is a b o ut 0 . 8 5. Strik- ingly , the err or of mo del pro pagation with conﬁdence v alues remains consta nt as ǫ increases . These results empirically conﬁr m the relev ance of intro ducing conﬁ- dence v alues into the ob jective function. Async hronous algorithm. In this second exper i- men t, we co mpare asynchronous mo del propa gation with the sync hronous v ar iant given by (5). W e are int erested in the av erage L 2 error of the mo dels as a function of the n um ber of pair wise communications (n um ber of ex changes from one agent to a nother). Note that a single iteratio n of the synchronous (resp. asynchronous) a lgorithm corr esp onds to 2 | E | (res p. 2) communications. F or the asynchronous a lgorithm, we set the neig hbor s election distributio n π i of age nt (a) Ground mo dels (b) Solitary mo dels (c) MP without conﬁden ce (d) MP with conﬁden ce Figure 1 : Illustr ation of the colla b o rative mean estimatio n task, where each p oint repres ent s an agent and its 2D co or dinates the asso ciated auxiliar y information. Figure 1(a) shows the ground truth mo dels (blue for mean 1 and r ed for mean -1). Figure 1(b) shows the solitary mo dels (lo c al av erages) for an ins tance where ǫ = 1. Figures 1(c)-1(d) show the mo dels after propag ation, without/with the use of conﬁdence v alues. 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Width ε 9 10 11 12 13 14 15 L2 error MP without conﬁdence MP with conﬁdence 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Width ε 0 . 50 0 . 55 0 . 60 0 . 65 0 . 70 0 . 75 0 . 80 0 . 85 Win ratio 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 Num ber of communications × 10 5 20 40 60 80 100 120 140 160 L2 error Sync. MP Async. MP Figure 2 : Results on the mean estimation task. (Left-middle) Mo del pro pagation with and without conﬁdence v alues w.r.t. the unbalancedness of the lo cal datasets. The left ﬁgure shows the L 2 error s, while the middle one shows the p ercentage of wins in favor of using conﬁdence v alues. (Right) L 2 error of the s ynchronous and asynchronous mo del propa gation alg orithms with resp ect to the num ber o f pa irwise communications. i ∈ J n K to b e uniform ov e r the set of neighbors N i . Figure 2 (right) shows the results on a problem in- stance genera ted as in the previo us exp eriment (with ε = 1). Since the asynchronous algo rithm is random- ized, we av erage its re s ults o n 100 ra ndom runs. W e see that our as y nchronous algor ithm a chiev es an accu- racy/co mmu nication tra de-oﬀ which is a lmo st as g o o d as that of the synchronous one, without requiring any synchronization. I t is th us exp ected to b e m uc h faster than the synchronous a lgorithm on lar ge decentralized net works with communication delays and/or without eﬃcient globa l synchronization. 5.2 Collab orative Linear Classiﬁcation In the pre vious mea n estimation task , the sq ua red dis- tance b etw e en tw o model para meters (i.e., es tima ted means) translates int o the same diﬀerence in L 2 error with resp ect to the targ et mea n. Therefor e, our colla b- orative lear ning formulation is ess ent ially equiv alent to our mo del pr opagatio n appro ach. T o show the b ene- ﬁts that can b e broug ht by collab orative lea rning, we now consider a linear cla s siﬁcation ta sk. Since tw o lin- ear s eparator s with sig niﬁcantly diﬀerent parameter s can lead to similar predictions on a given dataset, in- corp ora ting the lo ca l erro r s into the o b jectiv e function rather than simply the distances b etw een para meter s should lead to more accurate models. W e co nsider a set of 10 0 a g ents whose goal is to p er- form linear classiﬁca tio n in R p . F or eas e of visualiza- tion, the tar get (true) mo del of each agent lies in a 2-dimensional s ubspace: we repr esent it as a vector in R p with the ﬁrst tw o entries dr awn from a nor mal dis- tribution centered a t the origin and the remaining ones equal to 0. W e co nsider the similarity gra ph wher e the weigh t b etw een tw o agents i and j is a Gaussia n ker- nel on the distance b etw een targe t mo dels , where the distance here refers to the length of the c hord of the angle φ ij betw een ta rget mo dels pro jected on a unit circle. Mor e formally , W i,j = exp((cos( φ i,j ) − 1) /σ ) with σ = 0 . 1 for appropr iate sc aling. Edges with neg- ligible w eights ar e ignor ed to sp eed up computation. W e refer the reader to Appendix E for a 2D visual- ization o f the target mode ls and the links be t ween them. Every ag ent receives a random num b er of tr ain- ing p oints drawn unifor mly b etw een 1 and 20 . Each training p oint (in R p ) is drawn uniformly ar ound the origin, and the binary lab el is given b y the predictio n of the ta r get linear sepa rator. W e then add some lab el noise b y randomly ﬂipping each lab el with probability 0 . 05. The lo ss function used by the agents is the hing e loss, g iven b y ℓ ( θ ; ( x i , y i )) = max  0 , 1 − y i θ ⊤ x i  . As in the pre vious exp er iment , for each alg orithm we tune the v alue of α on a held-out set of ra ndom problem instances. Finally , we will ev a luate the quality o f the learned mo del of each a gent by computing the ac cu- 0 20 40 60 80 100 Dimension p 0 . 50 0 . 55 0 . 60 0 . 65 0 . 70 0 . 75 0 . 80 0 . 85 0 . 90 0 . 95 T est accuracy Model P . Collaborative L. Solitary models Consensus model 5 10 15 20 Number of training p oints 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 T est accuracy Collaborative Learning Model Propagation Solitary models 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 Number of communications × 10 5 0 . 60 0 . 65 0 . 70 0 . 75 0 . 80 0 . 85 T est accuracy Sync. CL Async. CL Async. MP Figure 3: Results o n the linea r classiﬁcation task. (Left) T est ac c uracy o f mo del pr opagatio n a nd co llab ora- tive lear ning with v ar ying feature space dimension. (Middle) Average tes t accura cy of mo del pro pagation and collab ora tive lear ning with r esp ect to the num ber of training p oints av a ilable to the agent (feature dimensio n p = 50). (Right) T est accur acy of synchronous and asynchronous co llab orative lea rning and asynchronous mo del propaga tion with res p ect to the n um ber of pairwise communications (linear class iﬁcation ta sk, p = 50). racy on a separate sample of 1 00 tes t p oints drawn from the same distribution as the training set. In the following, we use this linear classiﬁcatio n task to compar e the p erfor ma nce of colla bo rative learning against mo de l pro pagation, and to ev aluate the eﬃ- ciency of our asynchronous algor ithms. MP vs. CL. In this ﬁrst exp eriment, we compare the accuracy of the mo dels learned by mo del propa- gation and colla bo rative learning with feature space dimension p ranging fro m 2 to 100. Figure 3 (left) shows the results averaged over 1 0 rando mly gener- ated problem instances for e a ch v alue of p . As base- lines, we also plo t the average accur a cy o f the s olitary mo dels and of the g lobal consensus model minimizing (2). The accura cy of all mo de ls decrease s with the fea- ture space dimension, which comes from the fact that the exp ected num b er of tr aining samples remains con- stant. As exp ected, the consens us mo del achiev e s very po or perfor mance since ag ent s ha ve very diﬀerent o b- jectives. On the o ther hand, b oth mo del pro pagation and collab ora tive lear ning are able to impr ov e very sig - niﬁcantly over the solitar y mo dels, even in higher di- mensions where o n av e r age these initial mo dels barely outp e rform a ra ndom guess . F urther more, co llab ora- tive learning alwa ys outp erfo rms mo del pr opagatio n. W e further analyze these results b y plotting the a ccu- racy with r esp ect to the s iz e of the lo ca l training set (Figure 3, middle). As exp ected, the accura cy of the solitary mo dels is higher for larger training sets. F ur- thermore, co lla b orative learning conv er ges to mo dels which hav e similar accuracy regardless of the training size, eﬀectively cor recting for the initial unbalanced- ness. While mo del pro pagation also p er forms well, it is consistently outp erformed by collab or ative lear ning on a ll training sizes. This g a p is larg er for age n ts with more training data: in mo del propaga tio n, the large conﬁdence v alues as so ciated with these a gents pre ven t them from dev iating muc h from their s olitary mo del, thereby limiting their own g ain in acc ur acy . Async hronous alg orithms. This second exp eri- men t compar e s our a synchronous collab or ative lea rn- ing algor ithm with a s ynchronous v ar ia nt a lso based on ADMM (see App endix D for de ta ils) in terms of nu m ber pair wise o f co mm unications. Fig ure 3 (right) shows that o ur asynchronous algo rithm p erforms as go o d a s its synchronous counterpart a nd sho uld thus be la rgely preferred for deplo yment in r eal p eer- to -p eer net works. It is also w orth noting that as ynchronous mo del propag ation conv erges an or der of magnitude faster than collab orative learning, as it only propa- gates mo dels that ar e pre-trained lo cally . Mo del pr op- agation can th us pr ovide a v aluable warm-start initial- ization for collab orative lear ning. Scalabilit y . W e also observe e x pe r imentally that the nu m ber of iter ations needed by our decentralized al- gorithms to conv erge scales fav orably with the size of the net w ork (see Appendix E for details). 6 Conclusion W e pro p o sed, analyzed and ev a luated t w o asyn- chronous peer -to-p eer a lgorithms for the nov el s etting of decentralized collab orative lear ning of p ersonaliz e d mo dels. This work op ens up interesting p ers p ectives. The link b etw ee n the simila rity graph and the gener- alization p erformance of the resulting mo dels should be forma lly a nalyzed. This co uld in turn guide the de- sign of g e neric metho ds to estimate the gr a ph weights, making o ur appro aches widely applicable. Other direc- tions of interest include the developmen t of priv a cy- preserving alg orithms as well as extensio ns to time- evolving netw o rks a nd sequential arriv al of data. Ac knowledgmen ts This work was partially sup- po rted by grant ANR-16-CE 2 3-001 6-01 and by a gr ant from CPE R Nord-Pas de Ca lais/FEDER DA T A Ad- v anced data science and tec hnologies 2015-2 020. References Boutet, A., F r ey , D., Guer r aoui, R., J´ egou, A., and Kermarr ec, A.-M. (2013). WHA TSUP: A Decentral- ized Instant News Recommender. In Pr o c e e dings of the 27th IEEE International Symp osium on Par al lel and Distribute d Pr o c essing (IPDPS) , pages 7 41–75 2. Boutet, A., F rey , D., Guer raoui, R., Ker marrec, A.-M., and Patra, R. (20 14). Hyrec: leveraging browsers for scalable reco mmenders. In Pr o c e e dings of t he 15th International Midd lewar e Confer enc e , pages 85–9 6. Boyd, S., Ghosh, A., Prabhak a r, B ., and Sha h, D. (2006). Rando mized g ossip algor ithms. IEEE/ ACM T r ansactions on Networking (TON) , 14(SI):25 0 8– 2530. Boyd, S., Parikh, N., Chu, E ., Peleato, B., and E ck- stein, J . (2011). Distr ibuted optimization and statis- tical learning via the alternating direction metho d of m ultipliers. F oundations and T r ends R  in Machine L e arning , 3 (1):1–12 2 . Caruana, R. (19 97). Multitask Learning. Machine Le arning , 28(1):4 1–75. Colin, I., Bellet, A., Salmo n, J., and Cl ´ emen¸ con, S. (2015). Extending Gos s ip Algo rithms to Distributed Estimation o f U-s ta tistics. In Pr o c e e dings of the An- nual Confer enc e on Neur al Information Pr o c essing Systems (N IPS) . Colin, I., Bellet, A., Salmo n, J., and Cl ´ emen¸ con, S. (2016). Gos sip Dual Av eraging for Decentralized Optimization of Pairwise F unctions. In Pr o c e e dings of the 33r d Int ernational Confer enc e on Machine L e arning (ICML) . Dimakis, A. G., K ar, S., Moura, J. M. F., Rabbat, M. G., and Scaglio ne, A. (2010 ). Go ssip Algo rithms for Distributed Signal Pro ce s sing. Pr o c e e dings of t he IEEE , 98(11):1 847–1 864. Duc hi, J. C., Ag arwal, A., a nd W ainwright , M. J. (2012). Dual Averaging for Distributed Optimiza- tion: Conv er gence Analysis and Netw ork Scal- ing. IEEE T r ansactions on Automatic Contro l , 57(3):592 –606 . Iutzeler, F., Bianchi, P ., Ciblat, P ., a nd Hachem, W. (2013). Asynchronous Distributed Optimization us- ing a Rando mized Alternating Direc tion Metho d of Multipliers. In Pr o c e e dings of the 52nd IEEE Confer enc e on De cisio n and Contr ol (CDC) , pa ges 3671– 3676 . Kemp e, D., Dobra, A., and Gehrke, J. (2 003). Gossip- Based Computation of Aggregate Info r mation. In Pr o c e e dings of t he 44th A nnual IEEE Symp osium on F oundations of Computer Scienc e (FOCS) , pa ges 482–4 91. Li, C.-K., Tsai, M.-C., W ang, K.-Z., and W ong , N.-C. (2015). The sp ectrum of the pr o duct of op era tors, and the pro duct of their numerical ranges . Line ar Alge br a and its A pplic ations , 46 9:487 – 49 9. McMahan, H. B., Mo ore, E., Ramage, D., a nd Ag ¨ uera y Ar c as, B. (2016 ). F ederated Lear ning of Deep Net w orks using Mo del Averaging. T echnical rep ort, arXiv:16 0 2.056 29. Nedic, A. and Ozda g lar, A. E. (2009). Distributed Subgradient Metho ds for Multi-Agent Optimiza- tion. IEEE T r ansactions on Automatic Contr ol , 54(1):48– 61. Ram, S. S., Nedic, A., and V eerav a lli, V. V. (2010 ). Distributed Stochastic Subgradient Pro jection Al- gorithms for Conv ex Optimization. Journ al of Opti- mization The ory and Applic ations , 1 4 7(3):516 –545. Shah, D. (2009). Gos s ip Algo r ithms. F oundations and T r ends in Net working , 3 (1):1–12 5. Shi, W., Ling, Q., Y uan, K ., W u, G., a nd Yin, W. (2014). On the linea r conv ergence o f the admm in decentralized cons ensus optimization. IEEE T r ans- actions on Signal Pr o c essing , 62(7 ):1750– 1761. W ang, J., Ko lar, M., and Srebro, N. (20 1 6). Dis- tributed Multi-T ask Lea rning. In Pr o c e e dings of the 19th Intern ational Confer enc e on A rtiﬁcial Intel li- genc e and Statistics (AIST A TS) , pages 751 –760 . W ei, E. and Ozdagla r, A. E. (20 12). Distributed Al- ternating Direction Metho d of Multipliers. In Pr o- c e e dings of t he 51th IEEE Confer enc e on De cision and Contr ol (CDC) , pa ges 5 445– 5450. W ei, E. and Ozdag lar, A. E. (2 013). On the O(1/k) Conv er gence of Asynchronous Distributed Alterna t- ing Direction Metho d of Multipliers. In IEEE Glob al Confer enc e on Signal and Information Pr o c essing (Glob alSIP) . Zhou, D., Bo us quet, O ., Lal, T. N., W es to n, J., and Sch¨ olkopf, B. (2004). Lear ning with lo cal and global consistency . In Pr o c e e dings of the Annual Con- fer enc e on Neur al Information Pr o c essing Systems (NIPS) , v olume 16, pages 321–3 28. App endix A Pro of of Pr op osition 1 Prop ositi o n 1 (Closed-form solution) . L et P = D − 1 W b e the sto chastic similarity matr ix asso ciate d with the gr aph G and Θ sol = [ θ sol 1 ; . . . ; θ sol n ] ∈ R n × p . The solut ion Θ ⋆ = arg min Θ ∈ R n × p Q MP (Θ) is given by Θ ⋆ = ¯ α ( I − ¯ α ( I − C ) − αP ) − 1 C Θ sol , with α ∈ (0 , 1) such that µ = ¯ α/ α , and ¯ α = 1 − α . Pr o of. W e write the ob jective function in matrix form: Q MP (Θ) = 1 2  tr [Θ ⊤ L Θ] + µ tr [(Θ − Θ sol ) ⊤ D C (Θ − Θ sol )]  , where L = D − W is the graph La placian matrix a nd tr denotes the trace o f a matrix . As Q MP (Θ) is conv e x and quadratic in Θ, w e ca n ﬁnd its g lobal minim um b y s etting its deriv ative to 0. ∇Q MP (Θ) = L Θ + µD C (Θ − Θ sol ) = L Θ ∗ + µD C (Θ ∗ − Θ sol ) = ( D − W + µD C )Θ ∗ − µD C Θ sol . Hence, ∇Q MP (Θ) = 0 ⇔ ( I − P + µC )Θ ∗ − µC Θ sol = 0 ⇔ ( I − ¯ α ( I − C ) − αP )Θ ∗ − ¯ αC Θ sol = 0 , with µ = ¯ α /α . Since P is a s to chastic ma tr ix, its eigenv alues are in [ − 1 , 1]. Moreov er, ( I − C ) ii < 1 for all i , th us ρ ( ¯ α ( I − C ) + αP ) < 1 wher e ρ ( · ) denotes the sp ectr a l r adius. Consequently , I − ¯ α ( I − C ) − αP is inv ertible and w e get the des ired result. App endix B Con v ergence of t he I terativ e F orm (5) W e can rewrite the equation Θ( t + 1) = ( αI + ¯ αC ) − 1  αP Θ( t ) + ¯ αC Θ sol  , (5) as Θ( t ) =  ( αI + ¯ αC ) − 1 αP  t Θ(0) + t − 1 X k =0  ( αI + ¯ αC ) − 1 αP  k ( αI + ¯ αC ) − 1 ¯ αC Θ sol . Since α ( α + ¯ αc i ) < 1 for any i ∈ J n K , we have ρ  ( αI + ¯ αC ) − 1 αP  < 1 and therefore: lim t →∞  ( αI + ¯ αC ) − 1 αP  t = 0 , hence lim t →∞ Θ( t ) =  I − ( αI + ¯ αC ) − 1 αP  − 1 ( αI + ¯ αC ) − 1 ¯ αC Θ sol = ( I − ¯ α ( I − C ) − αP ) − 1 ¯ αC Θ sol = Θ ∗ . App endix C Pro of of Theorem 1 Theorem 1 (Conv ergence) . L et e Θ(0) ∈ R n 2 × p b e some arbitr ary initial value and ( e Θ( t )) t ∈ N b e t he se quen c e gener ate d by our mo del pr op agatio n algorithm. L et Θ ⋆ = arg min Θ ∈ R n × p Q MP (Θ) b e the optimal solution t o mo del pr op agation. F or any i ∈ J n K , we have: lim t →∞ E h e Θ j i ( t ) i = Θ ⋆ j for j ∈ N i ∪ { i } . Pr o of. In order to prove the conv ergence o f o ur algor ithm, we need to introduce an equiv alen t formulation as a random iterative pro cess ov er e Θ ∈ R n 2 × p , the horizontal stacking of all the e Θ i ’s. The c ommunic ation step o f a g ent i with its neighbor j cons ists in ov e r writing ˜ Θ j i and ˜ Θ i j with r esp ectively ˜ Θ j j and ˜ Θ i i . T his step will b e handled by multiplication with the ma tr ix O ( i , j ) ∈ R n 2 × n 2 deﬁned as O ( i, j ) = I + e j i ( e j j − e j i ) ⊤ + e i j ( e i i − e i j ) ⊤ , where for i, j ∈ J n K , the v ector e j i ∈ R n 2 has 1 as its ( i − 1) n + j -th co o rdinate and 0 in a ll others. The u p date step of no de i and j consists in replacing e Θ i i and e Θ j j with re sp e ctively the i -th line o f ( αI + ¯ αC ) − 1  αP e Θ i + ¯ αC Θ sol  and the j -th line of ( αI + ¯ αC ) − 1  αP e Θ j + ¯ αC Θ sol  . This s tep will b e han- dled by multiplication with the matrix U ( i, j ) ∈ R n 2 × n 2 and addition of the vector u ( i, j ) ∈ R n 2 × p deﬁned a s follows: U ( i, j ) = I + ( e i i e i ⊤ i + e j j e j ⊤ j )( M − I ) u ( i, j ) = ( e i i e i ⊤ i + e j j e j ⊤ j )( αI + ¯ αC ) − 1 ¯ αC e Θ sol , where M ∈ R n 2 × n 2 is a blo ck dia gonal matr ix with rep etitions of ( αI + ¯ αC ) − 1 αP o n the diagona l and e Θ sol ∈ R n 2 × p is built b y stacking horizo n tally n times the matrix Θ sol . W e c a n now write down a globa l iterative pro cess which is equiv alen t to our mo del propa gation algor ithm. F or any t ≥ 0: ˜ Θ( t + 1) = A ( t ) ˜ Θ( t ) + b ( t ) where, ( A ( t ) = I E U ( i, j ) O ( i, j ) b ( t ) = u ( i, j ) w.p. π j i n for i, j ∈ J 1 , n K , and I E is a n 2 × n 2 diagonal matrix with its ( i − 1) n + j -th v alue equal to 1 if ( i , j ) ∈ E o r i = j and equal to 0 other wise. Note that I E is used simply to simplify our analysis by setting to 0 the lines o f A ( t ) corres po nding to non-existing edges (whic h can be sa fely ignor ed). First, let us write the e xp e c ted v alue of ˜ Θ( t ) given ˜ Θ( t − 1): E h ˜ Θ( t ) | ˜ Θ( t − 1) i = E [ A ( t )] ˜ Θ( t − 1) + E [ b ( t )] . (9) Since the A ( t )’s and b ( t )’s are i.i.d., for any t ≥ 0 we have ¯ A = E [ A ( t )] and ¯ b = E [ b ( t )] where ¯ A = 1 n I E n X i,j π j i U ( i, j ) O ( i, j ) , ¯ b = 1 n n X i,j π j i u ( i, j ) . In o rder to pr ov e Theorem 1, we ﬁr st need to show that ρ ( ¯ A ) < 1 , where ρ ( ¯ A ) denotes the spectra l ra dius of ¯ A . First, r e c all that ρ  ( αI + ¯ αC ) − 1 αP  < 1 (see App endix Appendix B). W e thus hav e ρ ( M ) < 1 by co ns truction of M a nd λ ( I − M ) ⊂ (0 , 2) , where λ ( · ) denotes the sp ectrum o f a matrix. F urthermore, from the pr op erties in Li et al. (20 15) we know that λ  ( e i i e i ⊤ i + e j j e j ⊤ j )( I − M )  ⊂ [0 , 2] , and ﬁnally we hav e : λ  I + ( e i i e i ⊤ i + e j j e j ⊤ j )( M − I )  = λ ( U ( i, j )) ⊂ [ − 1 , 1] . As we also hav e λ ( O ( i, j )) ⊂ [0 , 1] therefore λ ( U ( i, j ) O ( i, j )) ⊂ [ − 1 , 1] . Let us ﬁrst supp os e that − 1 is an eigenv alue of U ( i, j ) O ( i, j ) ass o ciated with the eigenv ector ˜ v . F r o m the previous inequa lities we deduce that ˜ v must be an eigenv ector of O ( i, j ) a sso ciated with the eigenv alue +1 and an eigenv ector o f U ( i, j ) a s so ciated with the eig env a lue − 1. Then from ˜ v = O ( i , j ) ˜ v we hav e ˜ v j i = ˜ v j j and ˜ v i j = ˜ v i i . F rom − ˜ v = U ( i, j )˜ v we can deduce that ˜ v l k = 0 for any k 6 = l or k = l ∈ J n K \{ i, j } . Finally we can see that ˜ v i i = ˜ v j j = 0 and therefore ˜ v = 0. This prov e s b y co nt radiction that − 1 is not an eigenv alue of U ( i, j ) O ( i , j ) and furthermore that − 1 is not a n eigenv alue of ¯ A . Let us now suppos e that + 1 is a n eigenv a lue of ¯ A , asso cia ted with the eig e n vector ˜ v ∈ R n 2 . This would imply that ˜ v = ¯ A ˜ v = 1 n I E n X i,j π j i U ( i, j ) O ( i, j ) ˜ v . This can be ex pressed line b y line a s the follo wing set o f eq ua tions: n X k =1 ( π k 1 + π 1 k ) ˜ v 1 1 = e 1 ⊤ 1 n X k =1 ( π k 1 + π 1 k ) M O (1 , k ) ˜ v ( π 2 1 + π 1 2 ) ˜ v 2 1 = ( π 2 1 + π 1 2 ) ˜ v 2 2 if (1 , 2) ∈ E else ˜ v 2 1 = 0 ( π 3 1 + π 1 3 ) ˜ v 3 1 = ( π 3 1 + π 1 3 ) ˜ v 3 3 if (1 , 3) ∈ E else ˜ v 3 1 = 0 . . . ( π 1 2 + π 2 1 ) ˜ v 1 2 = ( π 1 2 + π 2 1 ) ˜ v 1 1 if (2 , 1) ∈ E else ˜ v 1 2 = 0 n X k =1 ( π k 2 + π 2 k ) ˜ v 2 2 = e 2 ⊤ 2 n X k =1 ( π k 2 + π 2 k ) M O (2 , k ) ˜ v ( π 3 2 + π 2 3 ) ˜ v 3 2 = ( π 3 2 + π 2 3 ) ˜ v 3 3 if (2 , 3) ∈ E else ˜ v 3 2 = 0 . . . ( π n − 2 n + π n n − 2 ) ˜ v n − 2 n = ( π n − 2 n + π n n − 2 ) ˜ v n − 2 n − 2 if ( n, n − 2) ∈ E else ˜ v n − 2 n = 0 ( π n − 1 n + π n n − 1 ) ˜ v n − 1 n = ( π n − 1 n + π n n − 1 ) ˜ v n − 1 n − 1 if ( n, n − 1) ∈ E else ˜ v n − 1 n = 0 n X k =1 ( π k n + π n k ) ˜ v n n = e n ⊤ n n X k =1 ( π k n + π n k ) M O ( n, k ) ˜ v W e can rewrite the ab ov e system as ˜ v j i = ( v j if ( i, j ) ∈ E or i = j 0 otherwise, 0 =  I − ( αI + ¯ αC ) − 1 αP  v . (10) with v ∈ R n × p . As seen in App endix App endix B , the matrix I − ¯ α ( I − C ) − αP is invertible. Cons equently v = 0 and thus ˜ v = 0, whic h prov e s by co n tradiction that +1 is not an eigenv alue of ¯ A . Now that we hav e s hown that ρ ( ¯ A ) < 1 , let us write the exp ected v alue of ˜ Θ( t ) by “ unrolling” the r ecursion (9): E h ˜ Θ( t ) i = ¯ A t ˜ Θ(0) + t − 1 X k =0 ¯ A k ¯ b. Let us denote ˜ Θ ∗ = lim t →∞ E h ˜ Θ( t ) i . B ecause ρ ( ¯ A ) < 1, we ca n wr ite ˜ Θ ∗ = ( I − ¯ A ) − 1 ¯ b, and ﬁnally ( I − ¯ A ) ˜ Θ ∗ = ¯ b. Similarly as in (10), w e ca n ident ify ˆ Θ ∈ R n × p such that ˜ Θ j ∗ i = ( ˆ Θ j if ( i, j ) ∈ E or i = j 0 otherwise, ¯ αC Θ sol = ( I − ¯ α ( I − C ) + αP ) ˆ Θ . Recalling the results from Appendix Appendix A , w e have ˆ Θ = ¯ α ( I − ¯ α ( I − C ) − αP ) − 1 C Θ sol , and w e th us have ˆ Θ = Θ ∗ = arg min Θ ∈ R n × p Q MP (Θ) , and the theorem follows. App endix D Sync hronous Decen t ralized ADMM Algorithm for Collab or ativ e Learning F or completeness, we pr esent here the synchr onous decen tralized ADMM a lgorithm for co llab orative learning . Based on our reformulation of Section 4 .2 a nd following W e i a nd Ozdaglar (2 0 12), the algor ithm to ﬁnd e Θ ⋆ consists in iterating ov er the following steps, starting at t = 0: 1. Every ag ent i ∈ J n K up dates its primal v ar iables: e Θ i ( t + 1) = arg min Θ ∈ R ( |N i | +1) × p L i ρ (Θ , Z i ( t ) , Λ i ( t )) , and sends e Θ i i ( t + 1) , e Θ j i ( t + 1) , Λ i ei ( t ) , Λ j ei ( t ) to agent j for all j ∈ N i . 2. Using v alues r eceived by its neig hbors, every agent i ∈ J n K upda tes its secondary v a riables for all e = ( i, j ) ∈ E such that j ∈ N i : Z i ei ( t + 1 ) = 1 2  1 ρ  Λ i ei ( t ) + Λ i ej ( t )  + e Θ i i ( t + 1) + e Θ i j ( t + 1)  , Z j ei ( t + 1 ) = 1 2  1 ρ  Λ j ej ( t ) + Λ j ei ( t )  + e Θ j j ( t + 1) + e Θ j i ( t + 1)  . By construction, this upda te ma int ains Z ( t + 1) ∈ C E . 3. Every ag ent i ∈ J n K up dates its dual v aria bles for all e = ( i, j ) ∈ E such that j ∈ N i : Λ i ei ( t + 1) = Λ i ei ( t ) + ρ  e Θ i i ( t + 1) − Z i ei ( t + 1)  , Λ j ei ( t + 1) = Λ j ei ( t ) + ρ  e Θ j i ( t + 1) − Z j ei ( t + 1)  . Synchronous ADMM is known to co nv erg e to an optima l solution at r ate O (1 /t ) when the ob jective function is conv e x (W ei a nd O zdaglar, 2012), and at a faster (linear) ra te when it is stro ngly conv ex (Shi et al., 201 4). How ever, it req uires g lobal sync hronization across the net w ork, whic h can b e very c o stly in practice. App endix E Additional Exp erimen tal Results T arget mo dels in collab orativ e l inear classi ﬁcation F or the exp er imen t o f Section 5 .2, Figure 4 shows the target mode ls o f the ag ent s as well as the links betw een them. W e can see that the target mode ls ca n be very diﬀerent from an a g ent to another , and that tw o ag e n ts a re linked when there is a small enough (yet non-negligible) angle betw een their target models . − 3 − 2 − 1 0 1 2 3 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 Figure 4: T ar g et mo dels o f the a gents (represented a s p oints in R 2 ) for the collab or ative linear cla s siﬁcation task. Two mo dels ar e linked tog ether when the angle betw een them is small, which cor resp onds to a small E uclidean distance after pro jection on to the unit cir cle. 100 200 300 400 500 600 700 800 900 1000 Number of agents 0 1 2 3 4 5 6 7 8 Number of communications × 10 4 Sync. CL Async. CL Async. MP Figure 5: Number of pairwise communications needed to reach 90% of the accur acy of the optimal mode ls with v arying num b er of agents (linear cla ssiﬁcation task, p = 50). Scalabilit y with resp ect to the n um b er of no des In this exp eriment, we study ho w the n um ber of iter ations needed by our decentralized a lgorithms to conv erge to go o d s olutions sc a le with the size of the net w ork. W e fo cus on the co llab orative linear clas siﬁcation task introduced in Section 5.2 with the nu m be r n of a gents ranging from 1 0 0 to 1000. The net w ork is a k -nearest neig hbor gr aph: each agent is link ed to the k agents for which the angle similarity introduced in Section 5.2 is largest, and W ij = 1 if i and j ar e neighbors and 0 otherwise. Figure 5 shows the num ber o f iterations needed by o ur algorithms to reach 90% of the accuracy o f the o ptimal s et of mo dels. W e can see that the num ber o f iteratio ns scales linear ly with n . In asynchronous gossip algo rithms, the num b er o f iter ations that ca n b e done in para lle l a lso s cales r oughly linear ly with n , so we ca n e x pe c t our algorithms to scale nicely to very large net w orks.

Decentralized Collaborative Learning of Personalized Models over Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment