Towards Effective Device-Aware Federated Learning

With the wealth of information produced by social networks, smartphones, medical or financial applications, speculations have been raised about the sensitivity of such data in terms of users' personal privacy and data security. To address the above i…

Authors: Vito Walter Anelli, Yashar Deldjoo, Tommaso Di Noia

Towards Effective Device-Aware Federated Learning
T o w ards Effectiv e Device-Aw are F ederated Learning Vito W alter Anelli, Y ashar Deldjo o, T o mmaso Di Noia, a nd An tonio F erra ra ⋆ P olytechnic Univers ity of Bari, Bari, It aly firstname. lastname@poliba.it Abstract. With the weal t h of information p rodu ced by s o cial netw orks, smartphones, m ed ical or financial applica t ions, sp eculations hav e been raised ab out the sensitivit y of suc h data in t erms of users’ p ersonal pri- v acy and d ata securit y . T o address the above issues, F ederated Learning (FL) has b een recently prop osed as a means to lea ve data and com- putational resources distributed o ver a lar ge num b er of nodes (clients) where a central coordinating server agg regates only lo cally compu t ed up- dates without k now ing the original d ata. In t h is w ork , w e extend th e FL framew ork b y pushing forw ard the state the art in t h e field on several dimensions: ( i) unlike the original F edAvg approac h relying solely on sin- gle criteria (i.e., lo cal dataset size), a suite of domai n- and client-sp e cific criteria constitute the basis to compute each local client’s contribution, (ii) th e multi-criteria contribution of each device is compu ted in a prior- itized fashion by leveraging a priority-a war e aggr e gation op er ator used in the field of information retriev al, and ( iii) a mec hanism is proposed for online-adjustment of the aggregatio n op erator parameters via a lo cal searc h strategy with backtrac k ing. Ex tensive exp eriments on a publicly a vai lable dataset indicate the merits of the p roposed approach comp ared to standard F ed Avg baseline. Keywords: federated learning, aggregatio n , data d istribution 1 In tro duction and Con text The v ast a mo un t o f data generated by billions of mobile and online IoT devices worldwide holds the promise o f significantly improved usability and us e r exp eri- ence in in tellig en t a pplications. This la rge-sca le quantit y of rich data has created an o ppo rtunit y to greatly adv ance the intelligence of machine lea rning mo dels by catering powerful de e p neura l netw ork mo dels. Despite this opp ortunity , nowa- days such p erv asive devic e s can c apture a lot of data ab out the user, information such as what she do es, what she sees and even where she go es [1 4]. Actually , most of these data contain sensitive information that a user may deem priv ate. T o resp ond to concerns ab out sensitivit y o f user da ta in terms of data pr iv acy and securit y , in the last few years, initia tiv es hav e been made b y g ov ernments to prioritize and improve the security and priv ac y of user data. F or instance , in ⋆ Corresponding auth or. 2 2018, Genera l Data Pro tection Reg ulation (GDPR) w as enforced b y the Euro- pea n Union to protect users ’ p erso nal priv acy and data secur it y . These issues and regulations p ose a new challenge to traditional AI mo dels where one par ty is in volved in co llecting, pro cessing a nd tra nsferring all data to other pa r ties. As a matter of fact, it is ea sy to foresee the risks and resp onsibilities involv ed in storing/pr o cessing such sensitive data in the traditiona l c entr alize d AI fashion. F ederated learning is a n approach recently pr op osed by Go ogle [9,10,13] with the go al to tr a in a glo bal machine learning model from a massive amount of data, which is distribute d o n the client devic es s uc h as p e rsonal mobile pho ne s and/or IoT devices. In principle, a FL mo del is able to deal with fundamental issues related to pr iv acy , ownership and lo cality of da ta [2]. In [13], author s introduced the F e der ate dA ver aging (F edAvg) algorithm, which combines lo ca l sto chastic gradient descent on each client v ia a c e ntral server that p erforms mo del ag- grega tio n b y averaging the v alues o f lo cal hyper pa rameters. T o ensure that the developmen ts made in FL scena rios uphold to real-world assumptions, in [3] the author s in tro duced LEAF, a mo dular benchmarking framework supplying developers/ researchers with a r ich num b er o f resourc e s including op en-sour ce federated datase ts , an e v alua tion fr amework, and a num b er o f reference imple- men tations. Despite its potentially disruptive contribution, w e a rgue that F edAvg exp oses some ma jor shor tcomings. First, the aggreg ation op eration in F e dAvg sets the contribution of each agent prop ortio na l to each individual c lien t’s lo cal dataset size. A wealth of qua lita tiv e measures s uch a s the num be r of sa mple classes held by ea c h agent, the divergence of each computed lo cal mo del fr om the global mo del — which may be c ritical for conv er gence [15] —, so me estimations ab out the ag en t computing and connection ca pa bilities o r about th e ir hone s t y and trustw orthiness are ig nored. While F edAvg only uses limited kno wledge abo ut lo cal data, w e arg ue that the integration o f the above-men tioned qualitative measures and the exp ert’s domain k nowledge is indisp ensable for increasing the quality of the global mo del. The work at hand cons iderably extends the F edAvg approach [13] by building on three ma in assumptions: – w e can substantially improve the quality o f the glo bal mo del by incor po rat- ing a set of criteria a bo ut domain and clients, and prop erly assigning the contribution of individual up date in the final mo del bas ed on these cr iteria; – the introduced criteria can be com bined b y us ing different aggr egation op- erators ; tow ard this goa l, we asser t ab out the p otential b enefits of using a prioritize d multi-criteria aggr e gation op er ator ov er the ident ified set o f cri- teria to define each individual’s lo cal up date contribution to the federatio n pro cess; – computation of para meters for the aggr egation op erator (the pr iority order of the ab ov e-mentioned criteria) via an online monitoring and adjustment is an imp ortant factor for improving the quality o f g lobal mo del. The r emainder of the pap er is structure d as follows. Section 2 is devoted to int r o ducing the propo sed FL sy s tem, it fir st describ es the standard FL mo del T ow ards Effective D evice-Awa re F ederated Learning 3 and then provides a fo rmal descriptio n of the prop osed FL appro ach and the key concepts b ehind integration of lo cal criteria and prior itized m ulti-cr iteria aggre g ation op erator in the prop osed system. Section 3 details the e xper imen ta l setup of the entire s ystem b y relying on LE AF, an op en-source benchmarking framework for federated s ettings, whic h co mes with a suite of datase ts rea listi- cally pre-pro cessed for FL scenarios. Section 4 pres en ts results and discuss ion. Finally , Sectio n 5 concludes the pa per and discusses future p ersp ectives. 2 F ederated Learning and Aggregation Op erator In the following, we introduce the ma in elements b ehind the pro po sed appro ach. W e start by presen ting a formal description to the standar d FL approach (cf. Section 2.1) and then we descr ibe our prop ose d FL a pproach (cf. Section 2 .2). 2.1 Bac kground: Standard FL In a FL s etup, a set A = { A 1 , ..., A K } of a gents (client s ) participate to the training feder ation with a ser ver S co ordinating them. Ea ch agent A k stores its lo cal data D k = { ( x k 1 , y k 1 ) , ( x k 2 , y k 2 ) , ..., ( x k |D k | , y k |D k | ) } , and nev er sha res them with S . In o ur setting, x k i represents the data sample i of a gent k and y k i is the co rresp onding lab el. The motiv ation b ehind a FL setup is mainly efficiency — K can b e v er y lar ge — and priv acy [1,13]. As local training da ta D k never leav es federa ting agent machines, FL mo dels ca n b e tra ine d on us e r priv ate (and sensitive) data, e.g., the histor y of her typed mes sages, which can be considerably different fro m publicly accessible datase ts . The final ob jective in FL is to lea rn a global model characterized by a pa- rameter vector w G ∈ R d , with d be ing the num b er o f para meters for the mo del, such that a globa l los s is minimized without a dire ct acces s to data across clients. The basic idea is to tra in the global mo del s eparately for each ag ent k o n D k , such that a lo cal loss is minimized and the agents have to share with S only the computed mo del pa rameters w k , which will be aggr egated at the server level. By means of a comm unicatio n proto col, the ag ent s and the globa l ser ver exchange infor mation ab o ut the parameters of the lo cal and global mo del. At the t -th round of communication, the central server S bro adcasts the current global model w G t to a fraction of agents A − ⊂ A . Then, ev e r y ag en t k in A − carries out so me optimization steps ov er its lo cal da ta D k in or der to optimize a lo cal loss. Finally , the computed loc al pa rameter vector w k t +1 is sent back to the central server. The c e n tra l server S computes a weigh ted mean of the resulting lo cal mo dels in or der to obtain an upda ted global mo del w G t +1 w G t +1 = |A − | X k =1 p k t +1 w k t +1 . (1) F or the sake of simplicity o f dis cussion, througho ut this work, we do not consider the time dimension a nd focus our atten tion on one time instance a s given b y Equation (2) 4 w G = |A − | X k =1 p k w k , (2) in which p k ∈ [0 , 1] is the weigh t a s so ciated with agent k and P |A − | k =1 p k = 1. W e ar gue that collecting informatio n ab out client s and incorp orating that knowledge to compute the appropriate agent-dependent v alue p k is impor tant for c o mputing a n effective and efficien t federated mo del. Moreover, it is worth noticing that p k may enco de a nd carry out some useful knowledge in the opti- mization of the global mo del with res pect to relev ant domain-sp ecific dimens io ns. 2.2 Prop osed F ederated Learning Approac h As discussed at the end of the pr e vious section, we ma y hav e differ en t factor s and/or criteria influencing the computation o f p k . Given a set o f prop erly identi- fied cr iteria ab out clients, it could be then p os sible to enhance the globa l mo del upda te pro cedure by using this infor ma tion. T o co nnect it to the formalis m presented befor e, let us as sume C = { C 1 , ..., C m } be a s e t of measurable pr op erties (criteria) characteriz ing lo cal a gent k o r lo cal data D k . W e use the term c k i ∈ [0 , 1] to denote, for eac h agent k , the deg ree of satisfaction of criter ion C i in a sp ecific ro und of comm unicatio n. Hence, in the prop osed FL agg regation proto col, the central server computes p k as p k = f ( c k 1 , ..., c k m ) Z = s k Z , (3) where f is a lo c al aggr e gation op er ation ov er the set of prop erties (criteria), which repr esent agent k , s k ∈ R is a numerical score ev aluating the k -th a gent contribution based on the m iden tified pro per ties and, finally , Z is a normaliza- tion factor . In order to ensure that P |A − | k =1 p k = 1 where p k ∈ [0 , 1], we co mpute Z = P |A − | k =1 s k . In the following, we br iefly discuss the identified set of criteria (together with a motiv ation for the selec tio n), the selected ag gregatio n op erator, a nd the online adjustment pro cedur e. Iden tification of l o cal criteria. I n F edAvg , the server p erfor ms aggr egation to compute p k , without knowing a ny information ab out par ticipating clients, except for a pure quant ita tiv e measure ab out local dataset size. O ur a ppr oach relies on the assumption that it might be m uch b etter to use multiple criteria enco ding different useful k nowledge ab o ut client s to obtain a mor e informative global mo del during tra ining. This makes it p o ssible for a domain exp ert to build the federated mo del by leveraging different any additional domain- and client-sp e cific kno wle dg e. F or instance, o ne ma y wan t to choose the criteria in suc h a wa y that the rounds of communication needed to r each a desir ed target accur acy are mini- mized. Moreover, a domain e xper t co uld ask users/clients to measure their a d- herence to some other ta rget prop erties (e.g . their na tionality , g ender, age, job, T ow ards Effective D evice-Awa re F ederated Learning 5 behaviora l c har acteristics, etc.), in order to build a global mo del emphasizing the contribution of so me cla sses of us ers; in this w ay , the do main exp ert may , in principle, build a mo del fav or ing some tar geted commercial purp oses. All in all, we may hav e a suite of criteria to re a ch the final global goal (in Section 3 we will see the example adopted in our exp erimental setup). Prioritized m u l ti-criteria aggreg atio n op erator. Once loca l criteria ev a l- uations hav e b een collected, the central server aggr egates them for each dev ic e in or der to obtain a fina l sco r e asso ciated to that device. Over the years, a wide range of aggr egation opera tors ha ve been prop osed in the field of infor mation retriev al (IR) [12]. W e selected so me prominent o nes and exploited them in our FL setup. In pa rticular, we fo cused on the w eig h ted averaging op erator, the or- dered weighted av er aging (OW A) mo dels [1 7 ,16], which extend the bina r y logic of AND and OR op era tors b y a llowing repr esent ation of in ter mediate quanti- fiers, the Cho quet-based mo dels [4,8,7], which are able to interpret po sitiv e a nd negative interactions b etw een cr iter ia, a nd finally the prior it y -based mo dels [6]. Due to the la ck of s pace, here we rep ort only the appro ach and the exp erimen- tal ev alua tion related to the last one, mo deled in ter ms o f a MCDM problem, bec ause of its better p erformance. The co re idea of the prioritize d multi-criteria aggr e gation op er ator prop osed in [6] is to assign a priorit y order to the in volved criteria. The main rationale behind the idea is to allow a do main exp ert to mo del circ umstances where the lack o f fulfillment o f a higher pr iority cr iterion cannot b e comp ensated with the fulfillmen t of a low er prio rity one [12]. As an e xample, we may consider the case where the domain exp ert may wan t to consider extr emely impo rtant the age of an agent’s user ra ther than its data set size, so that ev en a large lo cal dataset would b e p enalized if the user age criteria is not satisfied. F ormally , the prioritized m ulti-c riteria agg regation o per ator f : [0 , 1] m → [0 , m ] meas ures an overall sc or e from a prioritized set of criteria ev aluations on the lo cal mo del w k as in the following [6]: s k = f ( c k 1 , ..., c k m ) = m X i =1 λ i · c k ( i ) λ 1 = 1 , λ i = λ i − 1 · c k ( i − 1) , i ∈ [2 , m ] (4) where c k ( i ) is the ev a lua tion of C ( i ) for device k and the · ( i ) notation indicates the indices of a so r ted priority order for criteria, as sp ecified by the domain exp ert, fro m the mos t imp ortant to the least impo rtant one. F or each sco re c k ( i ) , an impor tance weight λ i is computed, dep ending b oth on the sp ecified priority order ov er the criteria and on the fulfillmen t and the weigh t of the immediately preceding criter ion. Example 1. Let us supp ose that we ar e in tere s ted in ev alua ting dev ic e k based on three cr iteria C 1 , C 2 , C 3 and their res pective ev alua tio ns ar e c k 1 = 0 . 5 , c k 2 = 0 . 8 , c k 3 = 0 . 9. Let the prio rity or der of criteria b e C (1) = C 1 , C (2) = C 2 , C (3) = C 3 , 6 from the mo st impo rtant to the least impo rtant; then, λ 1 = 1 , λ 2 = λ 1 · c k (1) = 0 . 5 , λ 3 = λ 2 · c k (2) = 0 . 4 . Hence, the fina l device score will b e s k = (1 · 0 . 5) + (0 . 5 · 0 . 8) + (0 . 4 · 0 . 9) = 1 . 26. If we c ha nge the prio r it y order to be C (1) = C 3 , C (2) = C 2 , C (3) = C 1 , we would then obtain λ 1 = 1 , λ 2 = λ 1 · c k (1) = 0 . 9 , λ 3 = λ 2 · c k (2) = 0 . 72 w ith a final device s c ore of s k = (1 · 0 . 9) + (0 . 9 · 0 . 8) + (0 . 4 · 0 . 5) = 1 . 8 2. W e see that this latter v alue is higher than the pr evious one since the most imp ortant criterion her e is better fulfilled.  Online adjustment. The agg regation op erator we are using takes as par ameter the priorit y or der of the inv olved cr iteria and, as a consequence, one of the problem is to identify the best order ing for Equation 4 whic h takes benefit o f the gathered information. Although by definitio n this prior it y order could b e defined by a do ma in ex p ert, here we pro pos e to cho ose the b est one in an o nline fashion s uch tha t we can maximize the p erformances of the mo del at each round of communication. Let ( C (1) ,t , ..., C ( m ) ,t ) b e the la st priority ordering o f the criteria used to compute the lo cal s c ores p k t (see Eq ua tion (3) and (4)) at time t . The sequence of steps needed to co mpute the up dates to the glo bal mo del is formalized in Algorithm 1 and commented in the following. Lines 1 –7 On each device, we lo c ally train the last br o adcasted global mo del w G t with the lo cal training data, in or der to compute w k t +1 ; then, we measur e the lo cal sco res for each of the identified criteria. Lines 9 –11 F or each device, we use the pr iority ordering of criteria alrea dy used in the pr evious round of co mm unicatio n to compute the lo ca l score p k t +1 . Line 12 A new c andidate global mo del w G t +1 is built by computing a weighted av era g ing of the lo ca l mo dels w.r.t. the co mputed p k t +1 . Lines 1 3–15 On each device, w G t +1 is lo cally tes ted using the lo ca l test set. Lines 1 6–29 An estimation of a globa l a ccuracy is computed w eighting local accuracies w.r .t. local test s et size; then, if the obtained accura cy is higher on av era ge than the ac curacy o bta ined with w G t , then w e update the glo bal v alue w G t +1 ← w G t +1 and we pro c e e d with the next ro und of communication; otherwise, another p ermutation is c o nsidered a nd, once a new p k t +1 is com- puted for each device, w e go back to step 3; if no o ther p ermutations are av ailable, the candidate glo bal mo del which pro duced the least w o rst test accuracy is assigned to w G t +1 . The a bove-men tioned steps ar e als o gr aphically illustrated by means of a plot in Figur e 1, where an exemplification with dummy v alues is presented. T raining steps pro ceed with the same para metr ization until a low er accura cy is o btained (blue p oint in round of communication 8); then, the previous mo del is r estored and the o ther configurations are tested, until a higher accuracy is found (e.g., orange p oint in r o und 8). When a higher accuracy cannot be found, the least worst option is selected (e.g., g reen p oint in round 10). T ow ards Effective D evice-Awa re F ederated Learning 7 Algorithm 1 Seq uence of steps executed by the server to compute the new global mo del with online adjustment of aggr egation op e r ator pa rameters. F unc- tions Mo delUp date , Pr op ertyMe asur e , and L o c alT estA c cur acy a r e executed lo- cally on the k -th device. V ar iable acc t is an es tima tio n of the g lo bal a ccuracy . Require: w G t , acc t , ( C (1) ,t , ..., C ( m ) ,t ) Ensure: w G t +1 , acc t +1 , ( C (1) ,t +1 , ..., C ( m ) ,t +1 ) 1: broadcast w G t to clients in A − 2: for each client k ∈ A − in parallel do 3: w k t +1 ← Mo delUp date( k , w G t ) 4: for each criterion C i ∈ C do 5: c k i,t +1 ← Prop ertyMeasure( k, w k t +1 , C i ) 6: end for 7: end for 8: P ← ( C (1) ,t , ..., C ( m ) ,t ) 9: for each client k ∈ A − do 10: p k t +1 ← f ( c k (1) ,t +1 , ..., c k ( m ) ,t +1 ) / Z 11: end for 12: w G t +1 ← P |A − | k =1 p k t +1 w k t +1 13: for each client k ∈ A in parallel do 14: acc k t +1 ← Lo calT estAccuracy( k , w G t +1 ) 15: end for 16: acc t +1 ← weigh ted av erage of acc k t +1 w.r.t. lo cal t est set size, ∀ k ∈ A 17: while acc t +1 < acc t do 18: if other priority orderings are av ailable then 19: P ← another p riorit y ordering of criteria ( C (1) , ..., C ( m ) ) ⋆ 20: repeat steps 9—16 21: else 22: P ← priorit y ordering for which w e get the maximum v alue for acc t +1 23: acc k t +1 ← accuracy of the mod el which p erformed b est 24: repeat steps 9—12 25: break 26: end if 27: end whil e 28: ( C (1) ,t +1 , ..., C ( m ) ,t +1 ) ← P 29: w G t +1 ← w G t +1 3 Exp erimen tal setup In this section we descr ib e the exp erimental setup used to v alidate the p erfor- mance of the prop osed FL system. Exp erimental E v aluation F ramew o rk. In order to p e rform the exp erimen- tal v alidatio n and per formance ev aluation, an extensive set of e x per imen ts has bee n carr ied out b y r elying on LEAF [3], a mo dular open- source b enc hma rking framework for federated s ettings, whic h co mes with a suite of datasets appropr i- ately pr e pro cessed for FL scenar ios. LEAF a lso provides repro ducible r eference 8 5 6 7 8 9 10 1 1 0 . 6 0 . 7 0 . 8 Round of comm u nication Accuracy P arametrization A P arametrization B P arametrization C Fig. 1. An illustration of the online parameter adjustment for the aggregation op erator. implemen ta tions and introduces b oth system and statistica l rigo rous metrics for understanding the qualit y of the FL appr oach. As for the metr ic s computation, the g lobal mo del is tested on each device ov er the lo cal tes t sets. The o b jective o f LEAF is to ca pture the distribution of per formance acro s s devices by cons idering the 10th and 90 th p erc e n tiles of the lo cal accuracy v alues a nd by es timating a global a ccuracy (lo cal ac c ur acy v alues are av e raged weigh ting them based on lo ca l test set size). In this work, we improve the v alidation of the FL setting by using an approa ch which offer s an overview of the whole tra ining p erforma nces, instea d of metr ic s describing a single r ound of communication. Mo re specifically , we me asur e the numb er of ro u nd of c ommu nic ation r e qu ir e d to al low a c ert ain p er c ent age o f devic es, which p articip ate to the fe der ation pr o c ess, to r e ach a t ar get ac cu r acy (e.g., 75% or 80%) , since this mea surement is able to fairly show how effective and efficient is the mo del acr oss the devices. F ederated dataset. W e run our expe r imen ts using the FEMNIST da taset [3], which contains handwritten characters and digits fro m v ario us writers and their true labels. Unlike the original F edAvg a lgorithm [13], which use s the MNIST dataset [11] artificially split b y labe ls, the FEMNIST da taset [3], is larg er and more r ealistically distr ibuted. The dataset c o n ta ins 805,26 3 examples of 62 clas ses of handwritten characters and digits fro m 3,55 0 writers and it is built by par- titioning da ta in E xtendedMNIST [5] — an extended version of MNIST with letters and digits — based on writers of digits/characters. It is imp or tan t to note tha t data in FE MNIST are inherently non-I ID distributed, as the lo cal training data can v ary b etw een clients; therefore, they ar e not representative of the whole p opulation distribution. W e use the describ ed dataset to per form a digit/character classification tas k, although for computationa l limits we use a subsampled version (1 0% of total, 371 clients inv olved). Con volutional mo del . Similar to [13], the class ification task is p erformed by using a conv olutiona l neural netw ork (CNN). The net work has t wo conv olutional T ow ards Effective D evice-Awa re F ederated Learning 9 lay ers with 5x5 filters — the firs t with 3 2 channels, the second with 64, each follow ed b y 2 x 2 max p o oling —, a fully connected lay er with 2048 units a nd ReLu activ ation, and a final so ftmax output lay e r , with a total of 6 ,603,71 0 parameters . Hyp erparameter settings. W e set the h yp erpar ameters for the whole set of our exp eriments a s follows, also guided by the results obtained in [13]. As for the F edAvg client fractio n para meter, in each r o und of comm unicatio n only 10% of clients a re selected to per form the computation. F o r what concerns the parameters of sto chastic gradient decen t (SGD), w e set the loca l batch size to 10 and the num ber of lo cal ep o chs e q ual to 5. This is the c o nfiguration that in the baseline makes it p ossible to reach the target accuracy in less rounds of communication. More over, we set the learning rate to η = 0 . 01. Finally , we s et the ma xim um num be r of rounds o f co mmunication p er each exp eriment to 1000 . Iden tified lo cal criteria. In our ex per imen tal setting, the prop osed FL system extends pur e qua n titative cr iteria in F edAvg [13] — data s et size — and leverages t wo new cr iteria. Please note that we are not stating that the prop osed ones a r e the only possible criteria. W e pres e n t them just to show how the in tr o ductio n of new information may lead to a be tter fina l mo del. More sp ecifically , in o ur exp erimental ev aluation, we aim a t b oth r educing the nu m b er of ro unds o f com- m unica tion necess ary to reach a targ et accuracy and making the global model not diverging tow ards lo cal sp ecializations a nd overfittings. The criteria hav e b een defined so that c k i ∈ [0 , 1] with 0 mea ning ba d pe r for- mance and 1 go o d p erforma nce. Moreover, in order to make ea ch criterio n lying in the sa me in ter v al scale, we nor malized them such tha t P |A − | k =1 c k i = 1. L o c al dataset size ( b ase DS ) The first criterio n w e considered is the one alre a dy used by F edAvg [13] na mely the lo cal data set size given by c k 1 = |D k | / | ∪ i ∈A − D i | . This cr iterion is a pur e quantitative me asur e ab out the lo cal data, which will serve both as baseline in empirical v alidation of the results (i.e., when used in isolation) and as part of the en tir e iden tified set of criter ia in the developed FL system (i.e., when used in a group). L o c al lab el diversity ( L d ) The second cons ide r ed criter ion is the diversity of lab els in eac h lo cal dataset, measuring the diversit y of each lo ca l dataset in terms of class lab els. W e asser t this criterio n to be imp ortant s ince it ca n provide a clue on how muc h ea c h device can b e useful for learning to predict different lab e ls. T o quantify this criter ion w e use c k 2 = δ ( D k ) / P i ∈A − δ ( D i ) where δ measures the num b er o f different lab els (cla sses) present over the sa mples of that dataset. L o c al mo del diver genc e ( Md ) With non- I ID distributions — a nd this is the cas e of our dataset — mo del per formance dr amaticaly g ets worse [1 8]. Mor eov er, a large num b er of lo ca l training epo chs may lead ea ch device to move further awa y from the initial global mo del, tow ards the opp osite of the global ob jectiv e [15]. Therefore, a p ossible so lutio n inspired by [1 5] is to limit these negativ e effects, 10 by p e nalizing higher div e r gences and hig hlightin g lo c a l mo dels that ar e not very far fr om the received global model. W e ev aluate the lo cal model divergence as c k 3 = ϕ k / P i ∈A − ϕ i where ϕ i = 1 √ || w G − w i || 2 +1 . 4 Results and Discussion In order to v alidate the empirical performance of the propos e d FL system, an extensive set o f exper iments has been carried out with resp ect to three under - study explor a tion dimensio ns in agr eement with the ass umption pr esented in Section 1 . The fina l results are pr e sent ed in T able 1. Note that the results are presented for r eaching t wo distinctive des ired targe t glo bal accuracy of 75% and 80%. 1 Each column indicates the p erc e n tag e of devices whic h pa rticipate to the federation pro cess that rea ch a desire d tar get accuracy 2 . In addition, we pres en t the results in three gr oups of ( Lo w , Mid , High ) for p ercentage of pa rticipating devices. Study A: E ffect of indivi d ual criteria. Study A contemplates answering the ques tio n: “ Ar e we able to intr o duc e a set of de vic e- and data- dep endent criteria thr ough the help o f which we c an tra in a b etter glob al mo del? ” . The results for this study ar e summar ized in the row Ind of T able 1. T o answer this question, we considered the effect of each three identified criteria base Ds , Md , Ld in isolation . The results with resp ect to b o th desired accura cies o f 7 5% a nd 80% show that the new identified c riteria ( M d and Ld ) have an impact in the final quality o f the global mo del, which is c omp ar able (in Lo w a nd M id case s) or sup erior with resp ect to the conv entional base Ds criteria (in the case of High) . F o r exa mple, when comparing Md and Ld , o ne can notice the results are equal to 25 .5 v.s. 2 7 with a marg ina l difference of only 6%. This is while, if we desire to satisfy a higher n umber of devices ( High case) to reach a cer tain accuracy , the intro duced/prop osed criteria sho w a quality substant ia lly better than the base Ds criteria . F or example, Ld has a mean p erforma nce of 40 5 compared with 552.5 obtained base Ds . This is equal to an improvemen t o f 36% with res p ect to existing ba s eline. These initial results a lready show how the global mo de l can benefit from considering other criteria than just the dataset size. Study B: Impact of Priority order in m ulti -criteria aggregatio n. Study B fo cuses on the question: “ Ar e we able to expl oit the p otential b enefit s of a prioritize d mu lti-criteria aggr e gation op er ator to build a mor e informative glob al mo del b ase d on the identifie d criteria? ”. The results for this study a r e summa- rized in r ow MCA of T able 1. T o a nswer this research question, we p erfor med one e xper imen t for each individua l p ermutation of criteria in the prior itized 1 W e chose these accuracy val u es since th ey represent reasonable accuracy v alues and prediction tasks higher than 80% are not reac hed in the 1,000 allow ed rounds of comm u nication. 2 The total num b er of participating devices in the federation is 371, thus 20%, as an example, indicates the round of communication required for 0.2 × 317=75 devices to reac h the d esired target accuracy . T ow ards Effective D evice-Awa re F ederated Learning 11 m ulti- c r iteria ag gregation setting. Since there are 3 iden tified criteria, we have in total 6 permutations of criteria. F or a fine-grained analysis, w e provide the results obtained for al l the p ermutation run s , denoted, e.g., by Ds ≻ Ld ≻ Md , Ds ≻ Md ≻ Ld . By lo oking at the results, we ca n notice that in Lo w and Mid categorie s, the b est re sults are obtained for Ds ≻ Ld ≻ Md and Ds ≻ Md ≻ Ld . These res ults share a similar c ha racteristic, which inv olves the fact that by considering Ds a s the fir st impo rtant criter ion, we can gra n t a smaller subset of devices the chance to r each to a desired tar get accuracy in faster pac e/rate. This result is in ag reement with individual results (see Ind in T able 1) in the sense that the criterion D s pr ovides the b est quality in L ow and Mid study c ases for b oth desir e d tar get ac cu ra cy of 75% and 80% . How ever, whe n concen tra ting on the High catego ry , o ne can notice Md ≻ Ds ≻ Ld provides the best p erfor - mance. T his result is a bit surprising and shows that to satisfy a higher n umber of devices, the criterion M d plays the most imp or tan t role. This result is sur- prising from the sense that in the individual results (see Ind in T able 1), Ld has the most impor tant per formance, while in the obtained result it has the low est priority . Interestingly , we may notice that in all these best cases, the pattern Ds ≻ Ld always o ccurs 3 . Study C: Impact of Online Adjustment of the Priorit y-Order in multi- criteria aggregation. Finally , s tudy C studies the question: “ Is it p ossible to up date p ar ameters for the aggr e gation op er ator (the priori t y or der of t he ab ove- mentione d criteria) via an online monitoring and adjustment or impr oving the quality of glob al mo del? ”. The results for this study a re summarized in row Fi- nal of T able 1. This study in fact is conce r ned with the dynamic b ehavior of our pro p os ed FL appro ach, by letting the serv er cho ose at ea c h round of co m- m unica tion the prio rity ordering max imizing the acc uracy (i.e, obtain the b est sub-optimal a ccuracy). Similar to the pr evious study , here we also run six exp er- imen ts, rela ted to the six p ossible initia lizations for the prio rity co mbinations. In T able 1 we s how results related to the b est run and to their mean. In this final exp erimental setting, we see a n overall improvemen t in the p erfor mances of the prop os e s appro ach when we initialize the prio rity ordering with M d ≻ Ds ≻ Ld . Also in this case, the pattern Ds ≻ Ld o ccurs . 5 Conclusions and F uture p ersp ectives In this w ork , w e prese nted a pr actical proto co l for effectively agg r egating data by prop osing a set of devic e- and data-awar e prop erties (criteria) that are ex- ploited by a central ser ver in order to obtain a mor e qualitative/informative global mo del. Our exp eriments s how that the standard feder ated lea rning stan- dard, F edAvg can b e substantially improved by tra ining hig h-quality models using relatively few ro unds o f communication, by using a pro per ly defined set of lo ca l criter ia and using aggreg ation stra tegy that can exploit the information 3 W e remember here that a preference relation ≻ is transitive. Hence D s ≻ Md ≻ Ld implies Ds ≻ Ld . 12 T able 1. The final results of the empirical ev aluation. Eac h table cell pro v id es the num b er of rounds of comm un ication necessary to make the percentage of devices (as sp ecified in the columns) reac h a desired target accuracy (either 75 % or 80% in our case). Runs that did not reac h th e target accuracy for the sp ecified p ercenta ge of devices in the allo wed roun ds (1,000) are marked with — . The b est results obtained in study MCA are shown in b old violet while the b est results in study Final, are shown in b old italic blue . T arget accuracy 75% Lo w Mid High Study/% de vices 20% 30% mean 40% 50% mean 70% 75% mean Ind Dataset size (b ase) 22 29 25.5 39 62 50.5 304 801 552.5 Model divergence 24 30 27 41 67 54 274 768 521 Lab el diversit y 25 32 28.5 43 70 56.5 278 532 405 MCA Ds ≻ Ld ≻ Md 20 29 24.5 39 60 49.5 300 823 561.5 Ds ≻ Md ≻ Ld 20 29 24.5 39 60 49.5 300 669 484.5 Ld ≻ Ds ≻ Md 24 31 27.5 41 68 54.5 259 768 513.5 Md ≻ Ds ≻ Ld 24 32 28 45 70 57.5 255 532 393.5 Ld ≻ Md ≻ Ds 23 30 26.5 41 68 54.5 270 729 499.5 Md ≻ Ld ≻ Ds 24 32 28 46 70 58 255 620 437.5 mean 22.5 30.5 26.5 41 .8 66 53.9 273.17 690.1 481.6 Final Md ≻ Ds ≻ Ld 12 19 15.5 2 6 57 41.5 1 64 494 329 mean 20.5 27.5 24 38.6 61.8 50.2 223 611.8 417.4 T arget accuracy 80% Lo w Mid High Study/% de vices 20% 30% mean 40% 50% mean 70% 75% mean Ind Dataset size (b ase) 31 45 38 72 136 104 — — — Model divergence 31 46 38.5 82 151 116.5 — — — Lab el diversit y 36 53 44.5 90 161 125. 5 — — — MCA Ds ≻ Ld ≻ Md 30 45 37.5 72 135 103.5 — — — Ds ≻ Md ≻ Ld 30 45 37.5 72 135 103.5 — — — Ld ≻ Ds ≻ Md 31 46 38.5 82 149 115. 5 — — — Md ≻ Ds ≻ Ld 36 53 44.5 84 161 122. 5 — — — Ld ≻ Md ≻ Ds 31 46 38.5 82 151 116. 5 — — — Md ≻ Ld ≻ Ds 36 53 44.5 90 161 125. 5 — — — mean 32.3 48 40.1 80.3 148.6 114.5 — — — Final Md ≻ Ds ≻ Ld 21 36 28.5 61 133 97 — — — mean 30 43.5 36.7 78.1 142.6 110.4 — — — from such criteria. F uture p ersp ectives for this work concern with the iden ti- fication o f other lo ca l criteria — both general purp ose a nd domain-s pecific —, the exp erimentation with o ther aggr egation o p era tors a nd with other interesting datasets, as w ell as the ex tension of this federa ted a pproach to other mac hine learning sys tems, such as those in recommendation domain. A cknow le dgements The author s wish to thank Angelo Sc hiav one for fruitful discussions and for helping with the implemen tation of the framework. T ow ards Effective D evice-Awa re F ederated Learning 13 References 1. Bagdasary an, E., V eit, A., Hua, Y., Estrin, D ., S hmatiko v, V.: How to backdoor federated learning. arXiv preprint arXiv:1807.0045 9 (2018) 2. Bona witz, K., Eichner, H., Griesk amp, W., Huba, D., Ingerman, A., Iv anov, V., Kiddon, C., Konecn´ y, J., Mazzo cchi, S., McMahan, H.B., Overv eldt, T.V., Petrou, D., Ramage, D ., Roselander, J.: T ow ards federated learning at scale: System design. CoRR abs/1902.01046 (2019), http://a rxiv.org/abs/1902. 01046 3. Caldas, S., W u, P ., Li, T., Koneˇ cn ` y, J., McMahan, H.B., Smith, V., T alw alk ar, A.: Leaf: A b enchmark for federated settings. arXiv preprint arXiv:1812.01097 (2018) 4. Choqu et, G.: Theory of capacities. A n nales de l’Institut F ourier 5 , 131–295 (1954). https://doi .org/10.5802/a if.53 5. Cohen, G., Afshar, S., T apson, J., v an Schaik, A.: Emnist: Exten ding mnist to handwritten letters. In: 2017 I nternatio n al Joint Conference on Neural Netw orks (IJCNN). pp. 2921–2926. I EEE (2017) 6. da Costa P ereira, C., Drag oni, M., Pasi, G.: Multidimensional relev ance: Pri- oritized aggregatio n in a p ersonalized information retriev al setting. I nf. Pro- cess. Ma n age. 48 (2), 340 –357 (2012). https://doi. org/10.1016/j .ipm.2 011.07.001 , https://do i.org/10.1016/j.ip m.2 011.07.001 7. Grabisc h, M.: The application of fuzzy integrals in m ulticriteria deci- sion making. Eu ropean Journal of Operational Researc h 89 (3), 445 – 456 (1996). https:// d oi.org/ https://doi.org/10 .1 016/0377-2217(95)00176-X , http://www .sciencedirect.com /sc ience/article/pii/037722179500176X 8. Grabisc h, M., Roub ens, M.: A pplication of the Choq uet integra l in multicriteria decision making. F uzzy Measures and Integrals pp. 348 – 374 (2000) 9. Konecn´ y, J., McMahan, B., R amage, D.: F ederated optimization: Dis- tributed optimiza tion b eyond th e d atacen ter. CoRR abs/1511.03575 (2015 ) , http://arx iv.org/abs/1511.03 575 10. Konecn´ y, J., McMahan, H .B., Ramage, D., R ic ht´ arik, P .: F ederated optimization: Distributed machine learning for on-d evice intellig ence. CoRR abs/1610.0252 7 (2016), http://arxiv.or g/abs/1610.02527 11. Lecun, Y ., Bottou, L., Bengio, Y ., Haffner, P .: Gradien t -based learning applied to docu ment recognition. Proceedings of the IEEE 86 (11), 2278–23 24 (Nov 1998 ). https://doi .org/10.1109/5 .726791 12. Marrara, S., P asi, G., Viviani, M.: Aggreg ation opera- tors in information retriev al. F uzzy Sets and S ystems 324 , 3–19 (2017). https://doi .org/10.1016/j .fss.201 6.1 2.018 , https://do i.org/10.1016/j.fs s.2 016.12.018 13. McMahan, B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Comm un ication-efficien t learning of deep netw orks from decentralized data. In : Proceedings of the 20th International Conference on A rtificial Intelligence and Statistics, AIS T A TS 2017, 20-22 April 2017, F ort Laud erdale, FL, US A. pp. 1273– 1282 (2017), http://pr oceedings. mlr.press/v54/mcmahan17a.html 14. Miller, K.W., V oas, J.M., Hurlburt, G.F.: BYOD: security and priv acy considera- tions. IT Professional 14 (5), 53–55 (2012). https://doi.o rg/10.1109/MITP .2012.93 , https://do i.org/10.1109/MITP .20 12.93 15. Sahu, A.K., Li, T., Sanjabi, M., Zaheer, M., T alwalk ar, A ., S mith, V.: On the conv ergence of federated optimization in heterogeneous netw orks. arXiv preprin t arXiv:1812.06 127 (2018) 14 16. Y ager, R.R.: On ordered w eighted a veraging aggrega tion operators in multicrite- ria decisionmaking. IEEE T rans. Systems, Man, and Cyb ernetics 18 (1), 183–190 (1988). https://doi.org/ 10.1109/21.87068 , https://doi.or g/10.1109/21.87068 17. Y ager, R.R.: Quantifier guided aggregation using o wa op erators. International Journal of Intellig ent S ystems 11 (1), 49–73 (1996) 18. Zhao, Y ., Li, M., Lai, L., Sud a, N., Civin, D., Chandra, V .: F ederated learning with non-iid data. CoRR abs/1806.00582 (2018), http://arx iv.org/abs/1806.005 8 2

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment