Towards Effective Device-Aware Federated Learning

T o w ards Eﬀectiv e Device-Aw are F ederated Learning Vito W alter Anelli, Y ashar Deldjo o, T o mmaso Di Noia, a nd An tonio F erra ra ⋆ P olytechnic Univers ity of Bari, Bari, It aly firstname. lastname@poliba.it Abstract. With the weal t h of information p rodu ced by s o cial netw orks, smartphones, m ed ical or ﬁnancial applica t ions, sp eculations hav e been raised ab out the sensitivit y of suc h data in t erms of users’ p ersonal pri- v acy and d ata securit y . T o address the above issues, F ederated Learning (FL) has b een recently prop osed as a means to lea ve data and com- putational resources distributed o ver a lar ge num b er of nodes (clients) where a central coordinating server agg regates only lo cally compu t ed up- dates without k now ing the original d ata. In t h is w ork , w e extend th e FL framew ork b y pushing forw ard the state the art in t h e ﬁeld on several dimensions: ( i) unlike the original F edAvg approac h relying solely on sin- gle criteria (i.e., lo cal dataset size), a suite of domai n- and client-sp e ciﬁc criteria constitute the basis to compute each local client’s contribution, (ii) th e multi-criteria contribution of each device is compu ted in a prior- itized fashion by leveraging a priority-a war e aggr e gation op er ator used in the ﬁeld of information retriev al, and ( iii) a mec hanism is proposed for online-adjustment of the aggregatio n op erator parameters via a lo cal searc h strategy with backtrac k ing. Ex tensive exp eriments on a publicly a vai lable dataset indicate the merits of the p roposed approach comp ared to standard F ed Avg baseline. Keywords: federated learning, aggregatio n , data d istribution 1 In tro duction and Con text The v ast a mo un t o f data generated by billions of mobile and online IoT devices worldwide holds the promise o f signiﬁcantly improved usability and us e r exp eri- ence in in tellig en t a pplications. This la rge-sca le quantit y of rich data has created an o ppo rtunit y to greatly adv ance the intelligence of machine lea rning mo dels by catering powerful de e p neura l netw ork mo dels. Despite this opp ortunity , nowa- days such p erv asive devic e s can c apture a lot of data ab out the user, information such as what she do es, what she sees and even where she go es [1 4]. Actually , most of these data contain sensitive information that a user may deem priv ate. T o resp ond to concerns ab out sensitivit y o f user da ta in terms of data pr iv acy and securit y , in the last few years, initia tiv es hav e been made b y g ov ernments to prioritize and improve the security and priv ac y of user data. F or instance , in ⋆ Corresponding auth or. 2 2018, Genera l Data Pro tection Reg ulation (GDPR) w as enforced b y the Euro- pea n Union to protect users ’ p erso nal priv acy and data secur it y . These issues and regulations p ose a new challenge to traditional AI mo dels where one par ty is in volved in co llecting, pro cessing a nd tra nsferring all data to other pa r ties. As a matter of fact, it is ea sy to foresee the risks and resp onsibilities involv ed in storing/pr o cessing such sensitive data in the traditiona l c entr alize d AI fashion. F ederated learning is a n approach recently pr op osed by Go ogle [9,10,13] with the go al to tr a in a glo bal machine learning model from a massive amount of data, which is distribute d o n the client devic es s uc h as p e rsonal mobile pho ne s and/or IoT devices. In principle, a FL mo del is able to deal with fundamental issues related to pr iv acy , ownership and lo cality of da ta [2]. In [13], author s introduced the F e der ate dA ver aging (F edAvg) algorithm, which combines lo ca l sto chastic gradient descent on each client v ia a c e ntral server that p erforms mo del ag- grega tio n b y averaging the v alues o f lo cal hyper pa rameters. T o ensure that the developmen ts made in FL scena rios uphold to real-world assumptions, in [3] the author s in tro duced LEAF, a mo dular benchmarking framework supplying developers/ researchers with a r ich num b er o f resourc e s including op en-sour ce federated datase ts , an e v alua tion fr amework, and a num b er o f reference imple- men tations. Despite its potentially disruptive contribution, w e a rgue that F edAvg exp oses some ma jor shor tcomings. First, the aggreg ation op eration in F e dAvg sets the contribution of each agent prop ortio na l to each individual c lien t’s lo cal dataset size. A wealth of qua lita tiv e measures s uch a s the num be r of sa mple classes held by ea c h agent, the divergence of each computed lo cal mo del fr om the global mo del — which may be c ritical for conv er gence [15] —, so me estimations ab out the ag en t computing and connection ca pa bilities o r about th e ir hone s t y and trustw orthiness are ig nored. While F edAvg only uses limited kno wledge abo ut lo cal data, w e arg ue that the integration o f the above-men tioned qualitative measures and the exp ert’s domain k nowledge is indisp ensable for increasing the quality of the global mo del. The work at hand cons iderably extends the F edAvg approach [13] by building on three ma in assumptions: – w e can substantially improve the quality o f the glo bal mo del by incor po rat- ing a set of criteria a bo ut domain and clients, and prop erly assigning the contribution of individual up date in the ﬁnal mo del bas ed on these cr iteria; – the introduced criteria can be com bined b y us ing diﬀerent aggr egation op- erators ; tow ard this goa l, we asser t ab out the p otential b eneﬁts of using a prioritize d multi-criteria aggr e gation op er ator ov er the ident iﬁed set o f cri- teria to deﬁne each individual’s lo cal up date contribution to the federatio n pro cess; – computation of para meters for the aggr egation op erator (the pr iority order of the ab ov e-mentioned criteria) via an online monitoring and adjustment is an imp ortant factor for improving the quality o f g lobal mo del. The r emainder of the pap er is structure d as follows. Section 2 is devoted to int r o ducing the propo sed FL sy s tem, it ﬁr st describ es the standard FL mo del T ow ards Eﬀective D evice-Awa re F ederated Learning 3 and then provides a fo rmal descriptio n of the prop osed FL appro ach and the key concepts b ehind integration of lo cal criteria and prior itized m ulti-cr iteria aggre g ation op erator in the prop osed system. Section 3 details the e xper imen ta l setup of the entire s ystem b y relying on LE AF, an op en-source benchmarking framework for federated s ettings, whic h co mes with a suite of datase ts rea listi- cally pre-pro cessed for FL scenarios. Section 4 pres en ts results and discuss ion. Finally , Sectio n 5 concludes the pa per and discusses future p ersp ectives. 2 F ederated Learning and Aggregation Op erator In the following, we introduce the ma in elements b ehind the pro po sed appro ach. W e start by presen ting a formal description to the standar d FL approach (cf. Section 2.1) and then we descr ibe our prop ose d FL a pproach (cf. Section 2 .2). 2.1 Bac kground: Standard FL In a FL s etup, a set A = { A 1 , ..., A K } of a gents (client s ) participate to the training feder ation with a ser ver S co ordinating them. Ea ch agent A k stores its lo cal data D k = { ( x k 1 , y k 1 ) , ( x k 2 , y k 2 ) , ..., ( x k |D k | , y k |D k | ) } , and nev er sha res them with S . In o ur setting, x k i represents the data sample i of a gent k and y k i is the co rresp onding lab el. The motiv ation b ehind a FL setup is mainly eﬃciency — K can b e v er y lar ge — and priv acy [1,13]. As local training da ta D k never leav es federa ting agent machines, FL mo dels ca n b e tra ine d on us e r priv ate (and sensitive) data, e.g., the histor y of her typed mes sages, which can be considerably diﬀerent fro m publicly accessible datase ts . The ﬁnal ob jective in FL is to lea rn a global model characterized by a pa- rameter vector w G ∈ R d , with d be ing the num b er o f para meters for the mo del, such that a globa l los s is minimized without a dire ct acces s to data across clients. The basic idea is to tra in the global mo del s eparately for each ag ent k o n D k , such that a lo cal loss is minimized and the agents have to share with S only the computed mo del pa rameters w k , which will be aggr egated at the server level. By means of a comm unicatio n proto col, the ag ent s and the globa l ser ver exchange infor mation ab o ut the parameters of the lo cal and global mo del. At the t -th round of communication, the central server S bro adcasts the current global model w G t to a fraction of agents A − ⊂ A . Then, ev e r y ag en t k in A − carries out so me optimization steps ov er its lo cal da ta D k in or der to optimize a lo cal loss. Finally , the computed loc al pa rameter vector w k t +1 is sent back to the central server. The c e n tra l server S computes a weigh ted mean of the resulting lo cal mo dels in or der to obtain an upda ted global mo del w G t +1 w G t +1 = |A − | X k =1 p k t +1 w k t +1 . (1) F or the sake of simplicity o f dis cussion, througho ut this work, we do not consider the time dimension a nd focus our atten tion on one time instance a s given b y Equation (2) 4 w G = |A − | X k =1 p k w k , (2) in which p k ∈ [0 , 1] is the weigh t a s so ciated with agent k and P |A − | k =1 p k = 1. W e ar gue that collecting informatio n ab out client s and incorp orating that knowledge to compute the appropriate agent-dependent v alue p k is impor tant for c o mputing a n eﬀective and eﬃcien t federated mo del. Moreover, it is worth noticing that p k may enco de a nd carry out some useful knowledge in the opti- mization of the global mo del with res pect to relev ant domain-sp eciﬁc dimens io ns. 2.2 Prop osed F ederated Learning Approac h As discussed at the end of the pr e vious section, we ma y hav e diﬀer en t factor s and/or criteria inﬂuencing the computation o f p k . Given a set o f prop erly identi- ﬁed cr iteria ab out clients, it could be then p os sible to enhance the globa l mo del upda te pro cedure by using this infor ma tion. T o co nnect it to the formalis m presented befor e, let us as sume C = { C 1 , ..., C m } be a s e t of measurable pr op erties (criteria) characteriz ing lo cal a gent k o r lo cal data D k . W e use the term c k i ∈ [0 , 1] to denote, for eac h agent k , the deg ree of satisfaction of criter ion C i in a sp eciﬁc ro und of comm unicatio n. Hence, in the prop osed FL agg regation proto col, the central server computes p k as p k = f ( c k 1 , ..., c k m ) Z = s k Z , (3) where f is a lo c al aggr e gation op er ation ov er the set of prop erties (criteria), which repr esent agent k , s k ∈ R is a numerical score ev aluating the k -th a gent contribution based on the m iden tiﬁed pro per ties and, ﬁnally , Z is a normaliza- tion factor . In order to ensure that P |A − | k =1 p k = 1 where p k ∈ [0 , 1], we co mpute Z = P |A − | k =1 s k . In the following, we br ieﬂy discuss the identiﬁed set of criteria (together with a motiv ation for the selec tio n), the selected ag gregatio n op erator, a nd the online adjustment pro cedur e. Iden tiﬁcation of l o cal criteria. I n F edAvg , the server p erfor ms aggr egation to compute p k , without knowing a ny information ab out par ticipating clients, except for a pure quant ita tiv e measure ab out local dataset size. O ur a ppr oach relies on the assumption that it might be m uch b etter to use multiple criteria enco ding diﬀerent useful k nowledge ab o ut client s to obtain a mor e informative global mo del during tra ining. This makes it p o ssible for a domain exp ert to build the federated mo del by leveraging diﬀerent any additional domain- and client-sp e ciﬁc kno wle dg e. F or instance, o ne ma y wan t to choose the criteria in suc h a wa y that the rounds of communication needed to r each a desir ed target accur acy are mini- mized. Moreover, a domain e xper t co uld ask users/clients to measure their a d- herence to some other ta rget prop erties (e.g . their na tionality , g ender, age, job, T ow ards Eﬀective D evice-Awa re F ederated Learning 5 behaviora l c har acteristics, etc.), in order to build a global mo del emphasizing the contribution of so me cla sses of us ers; in this w ay , the do main exp ert may , in principle, build a mo del fav or ing some tar geted commercial purp oses. All in all, we may hav e a suite of criteria to re a ch the ﬁnal global goal (in Section 3 we will see the example adopted in our exp erimental setup). Prioritized m u l ti-criteria aggreg atio n op erator. Once loca l criteria ev a l- uations hav e b een collected, the central server aggr egates them for each dev ic e in or der to obtain a ﬁna l sco r e asso ciated to that device. Over the years, a wide range of aggr egation opera tors ha ve been prop osed in the ﬁeld of infor mation retriev al (IR) [12]. W e selected so me prominent o nes and exploited them in our FL setup. In pa rticular, we fo cused on the w eig h ted averaging op erator, the or- dered weighted av er aging (OW A) mo dels [1 7 ,16], which extend the bina r y logic of AND and OR op era tors b y a llowing repr esent ation of in ter mediate quanti- ﬁers, the Cho quet-based mo dels [4,8,7], which are able to interpret po sitiv e a nd negative interactions b etw een cr iter ia, a nd ﬁnally the prior it y -based mo dels [6]. Due to the la ck of s pace, here we rep ort only the appro ach and the exp erimen- tal ev alua tion related to the last one, mo deled in ter ms o f a MCDM problem, bec ause of its better p erformance. The co re idea of the prioritize d multi-criteria aggr e gation op er ator prop osed in [6] is to assign a priorit y order to the in volved criteria. The main rationale behind the idea is to allow a do main exp ert to mo del circ umstances where the lack o f fulﬁllment o f a higher pr iority cr iterion cannot b e comp ensated with the fulﬁllmen t of a low er prio rity one [12]. As an e xample, we may consider the case where the domain exp ert may wan t to consider extr emely impo rtant the age of an agent’s user ra ther than its data set size, so that ev en a large lo cal dataset would b e p enalized if the user age criteria is not satisﬁed. F ormally , the prioritized m ulti-c riteria agg regation o per ator f : [0 , 1] m → [0 , m ] meas ures an overall sc or e from a prioritized set of criteria ev aluations on the lo cal mo del w k as in the following [6]: s k = f ( c k 1 , ..., c k m ) = m X i =1 λ i · c k ( i ) λ 1 = 1 , λ i = λ i − 1 · c k ( i − 1) , i ∈ [2 , m ] (4) where c k ( i ) is the ev a lua tion of C ( i ) for device k and the · ( i ) notation indicates the indices of a so r ted priority order for criteria, as sp eciﬁed by the domain exp ert, fro m the mos t imp ortant to the least impo rtant one. F or each sco re c k ( i ) , an impor tance weight λ i is computed, dep ending b oth on the sp eciﬁed priority order ov er the criteria and on the fulﬁllmen t and the weigh t of the immediately preceding criter ion. Example 1. Let us supp ose that we ar e in tere s ted in ev alua ting dev ic e k based on three cr iteria C 1 , C 2 , C 3 and their res pective ev alua tio ns ar e c k 1 = 0 . 5 , c k 2 = 0 . 8 , c k 3 = 0 . 9. Let the prio rity or der of criteria b e C (1) = C 1 , C (2) = C 2 , C (3) = C 3 , 6 from the mo st impo rtant to the least impo rtant; then, λ 1 = 1 , λ 2 = λ 1 · c k (1) = 0 . 5 , λ 3 = λ 2 · c k (2) = 0 . 4 . Hence, the ﬁna l device score will b e s k = (1 · 0 . 5) + (0 . 5 · 0 . 8) + (0 . 4 · 0 . 9) = 1 . 26. If we c ha nge the prio r it y order to be C (1) = C 3 , C (2) = C 2 , C (3) = C 1 , we would then obtain λ 1 = 1 , λ 2 = λ 1 · c k (1) = 0 . 9 , λ 3 = λ 2 · c k (2) = 0 . 72 w ith a ﬁnal device s c ore of s k = (1 · 0 . 9) + (0 . 9 · 0 . 8) + (0 . 4 · 0 . 5) = 1 . 8 2. W e see that this latter v alue is higher than the pr evious one since the most imp ortant criterion her e is better fulﬁlled.  Online adjustment. The agg regation op erator we are using takes as par ameter the priorit y or der of the inv olved cr iteria and, as a consequence, one of the problem is to identify the best order ing for Equation 4 whic h takes beneﬁt o f the gathered information. Although by deﬁnitio n this prior it y order could b e deﬁned by a do ma in ex p ert, here we pro pos e to cho ose the b est one in an o nline fashion s uch tha t we can maximize the p erformances of the mo del at each round of communication. Let ( C (1) ,t , ..., C ( m ) ,t ) b e the la st priority ordering o f the criteria used to compute the lo cal s c ores p k t (see Eq ua tion (3) and (4)) at time t . The sequence of steps needed to co mpute the up dates to the glo bal mo del is formalized in Algorithm 1 and commented in the following. Lines 1 –7 On each device, we lo c ally train the last br o adcasted global mo del w G t with the lo cal training data, in or der to compute w k t +1 ; then, we measur e the lo cal sco res for each of the identiﬁed criteria. Lines 9 –11 F or each device, we use the pr iority ordering of criteria alrea dy used in the pr evious round of co mm unicatio n to compute the lo ca l score p k t +1 . Line 12 A new c andidate global mo del w G t +1 is built by computing a weighted av era g ing of the lo ca l mo dels w.r.t. the co mputed p k t +1 . Lines 1 3–15 On each device, w G t +1 is lo cally tes ted using the lo ca l test set. Lines 1 6–29 An estimation of a globa l a ccuracy is computed w eighting local accuracies w.r .t. local test s et size; then, if the obtained accura cy is higher on av era ge than the ac curacy o bta ined with w G t , then w e update the glo bal v alue w G t +1 ← w G t +1 and we pro c e e d with the next ro und of communication; otherwise, another p ermutation is c o nsidered a nd, once a new p k t +1 is com- puted for each device, w e go back to step 3; if no o ther p ermutations are av ailable, the candidate glo bal mo del which pro duced the least w o rst test accuracy is assigned to w G t +1 . The a bove-men tioned steps ar e als o gr aphically illustrated by means of a plot in Figur e 1, where an exempliﬁcation with dummy v alues is presented. T raining steps pro ceed with the same para metr ization until a low er accura cy is o btained (blue p oint in round of communication 8); then, the previous mo del is r estored and the o ther conﬁgurations are tested, until a higher accuracy is found (e.g., orange p oint in r o und 8). When a higher accuracy cannot be found, the least worst option is selected (e.g., g reen p oint in round 10). T ow ards Eﬀective D evice-Awa re F ederated Learning 7 Algorithm 1 Seq uence of steps executed by the server to compute the new global mo del with online adjustment of aggr egation op e r ator pa rameters. F unc- tions Mo delUp date , Pr op ertyMe asur e , and L o c alT estA c cur acy a r e executed lo- cally on the k -th device. V ar iable acc t is an es tima tio n of the g lo bal a ccuracy . Require: w G t , acc t , ( C (1) ,t , ..., C ( m ) ,t ) Ensure: w G t +1 , acc t +1 , ( C (1) ,t +1 , ..., C ( m ) ,t +1 ) 1: broadcast w G t to clients in A − 2: for each client k ∈ A − in parallel do 3: w k t +1 ← Mo delUp date( k , w G t ) 4: for each criterion C i ∈ C do 5: c k i,t +1 ← Prop ertyMeasure( k, w k t +1 , C i ) 6: end for 7: end for 8: P ← ( C (1) ,t , ..., C ( m ) ,t ) 9: for each client k ∈ A − do 10: p k t +1 ← f ( c k (1) ,t +1 , ..., c k ( m ) ,t +1 ) / Z 11: end for 12: w G t +1 ← P |A − | k =1 p k t +1 w k t +1 13: for each client k ∈ A in parallel do 14: acc k t +1 ← Lo calT estAccuracy( k , w G t +1 ) 15: end for 16: acc t +1 ← weigh ted av erage of acc k t +1 w.r.t. lo cal t est set size, ∀ k ∈ A 17: while acc t +1 < acc t do 18: if other priority orderings are av ailable then 19: P ← another p riorit y ordering of criteria ( C (1) , ..., C ( m ) ) ⋆ 20: repeat steps 9—16 21: else 22: P ← priorit y ordering for which w e get the maximum v alue for acc t +1 23: acc k t +1 ← accuracy of the mod el which p erformed b est 24: repeat steps 9—12 25: break 26: end if 27: end whil e 28: ( C (1) ,t +1 , ..., C ( m ) ,t +1 ) ← P 29: w G t +1 ← w G t +1 3 Exp erimen tal setup In this section we descr ib e the exp erimental setup used to v alidate the p erfor- mance of the prop osed FL system. Exp erimental E v aluation F ramew o rk. In order to p e rform the exp erimen- tal v alidatio n and per formance ev aluation, an extensive set of e x per imen ts has bee n carr ied out b y r elying on LEAF [3], a mo dular open- source b enc hma rking framework for federated s ettings, whic h co mes with a suite of datasets appropr i- ately pr e pro cessed for FL scenar ios. LEAF a lso provides repro ducible r eference 8 5 6 7 8 9 10 1 1 0 . 6 0 . 7 0 . 8 Round of comm u nication Accuracy P arametrization A P arametrization B P arametrization C Fig. 1. An illustration of the online parameter adjustment for the aggregation op erator. implemen ta tions and introduces b oth system and statistica l rigo rous metrics for understanding the qualit y of the FL appr oach. As for the metr ic s computation, the g lobal mo del is tested on each device ov er the lo cal tes t sets. The o b jective o f LEAF is to ca pture the distribution of per formance acro s s devices by cons idering the 10th and 90 th p erc e n tiles of the lo cal accuracy v alues a nd by es timating a global a ccuracy (lo cal ac c ur acy v alues are av e raged weigh ting them based on lo ca l test set size). In this work, we improve the v alidation of the FL setting by using an approa ch which oﬀer s an overview of the whole tra ining p erforma nces, instea d of metr ic s describing a single r ound of communication. Mo re speciﬁcally , we me asur e the numb er of ro u nd of c ommu nic ation r e qu ir e d to al low a c ert ain p er c ent age o f devic es, which p articip ate to the fe der ation pr o c ess, to r e ach a t ar get ac cu r acy (e.g., 75% or 80%) , since this mea surement is able to fairly show how eﬀective and eﬃcient is the mo del acr oss the devices. F ederated dataset. W e run our expe r imen ts using the FEMNIST da taset [3], which contains handwritten characters and digits fro m v ario us writers and their true labels. Unlike the original F edAvg a lgorithm [13], which use s the MNIST dataset [11] artiﬁcially split b y labe ls, the FEMNIST da taset [3], is larg er and more r ealistically distr ibuted. The dataset c o n ta ins 805,26 3 examples of 62 clas ses of handwritten characters and digits fro m 3,55 0 writers and it is built by par- titioning da ta in E xtendedMNIST [5] — an extended version of MNIST with letters and digits — based on writers of digits/characters. It is imp or tan t to note tha t data in FE MNIST are inherently non-I ID distributed, as the lo cal training data can v ary b etw een clients; therefore, they ar e not representative of the whole p opulation distribution. W e use the describ ed dataset to per form a digit/character classiﬁcation tas k, although for computationa l limits we use a subsampled version (1 0% of total, 371 clients inv olved). Con volutional mo del . Similar to [13], the class iﬁcation task is p erformed by using a conv olutiona l neural netw ork (CNN). The net work has t wo conv olutional T ow ards Eﬀective D evice-Awa re F ederated Learning 9 lay ers with 5x5 ﬁlters — the ﬁrs t with 3 2 channels, the second with 64, each follow ed b y 2 x 2 max p o oling —, a fully connected lay er with 2048 units a nd ReLu activ ation, and a ﬁnal so ftmax output lay e r , with a total of 6 ,603,71 0 parameters . Hyp erparameter settings. W e set the h yp erpar ameters for the whole set of our exp eriments a s follows, also guided by the results obtained in [13]. As for the F edAvg client fractio n para meter, in each r o und of comm unicatio n only 10% of clients a re selected to per form the computation. F o r what concerns the parameters of sto chastic gradient decen t (SGD), w e set the loca l batch size to 10 and the num ber of lo cal ep o chs e q ual to 5. This is the c o nﬁguration that in the baseline makes it p ossible to reach the target accuracy in less rounds of communication. More over, we set the learning rate to η = 0 . 01. Finally , we s et the ma xim um num be r of rounds o f co mmunication p er each exp eriment to 1000 . Iden tiﬁed lo cal criteria. In our ex per imen tal setting, the prop osed FL system extends pur e qua n titative cr iteria in F edAvg [13] — data s et size — and leverages t wo new cr iteria. Please note that we are not stating that the prop osed ones a r e the only possible criteria. W e pres e n t them just to show how the in tr o ductio n of new information may lead to a be tter ﬁna l mo del. More sp eciﬁcally , in o ur exp erimental ev aluation, we aim a t b oth r educing the nu m b er of ro unds o f com- m unica tion necess ary to reach a targ et accuracy and making the global model not diverging tow ards lo cal sp ecializations a nd overﬁttings. The criteria hav e b een deﬁned so that c k i ∈ [0 , 1] with 0 mea ning ba d pe r for- mance and 1 go o d p erforma nce. Moreover, in order to make ea ch criterio n lying in the sa me in ter v al scale, we nor malized them such tha t P |A − | k =1 c k i = 1. L o c al dataset size ( b ase DS ) The ﬁrst criterio n w e considered is the one alre a dy used by F edAvg [13] na mely the lo cal data set size given by c k 1 = |D k | / | ∪ i ∈A − D i | . This cr iterion is a pur e quantitative me asur e ab out the lo cal data, which will serve both as baseline in empirical v alidation of the results (i.e., when used in isolation) and as part of the en tir e iden tiﬁed set of criter ia in the developed FL system (i.e., when used in a group). L o c al lab el diversity ( L d ) The second cons ide r ed criter ion is the diversity of lab els in eac h lo cal dataset, measuring the diversit y of each lo ca l dataset in terms of class lab els. W e asser t this criterio n to be imp ortant s ince it ca n provide a clue on how muc h ea c h device can b e useful for learning to predict diﬀerent lab e ls. T o quantify this criter ion w e use c k 2 = δ ( D k ) / P i ∈A − δ ( D i ) where δ measures the num b er o f diﬀerent lab els (cla sses) present over the sa mples of that dataset. L o c al mo del diver genc e ( Md ) With non- I ID distributions — a nd this is the cas e of our dataset — mo del per formance dr amaticaly g ets worse [1 8]. Mor eov er, a large num b er of lo ca l training epo chs may lead ea ch device to move further awa y from the initial global mo del, tow ards the opp osite of the global ob jectiv e [15]. Therefore, a p ossible so lutio n inspired by [1 5] is to limit these negativ e eﬀects, 10 by p e nalizing higher div e r gences and hig hlightin g lo c a l mo dels that ar e not very far fr om the received global model. W e ev aluate the lo cal model divergence as c k 3 = ϕ k / P i ∈A − ϕ i where ϕ i = 1 √ || w G − w i || 2 +1 . 4 Results and Discussion In order to v alidate the empirical performance of the propos e d FL system, an extensive set o f exper iments has been carried out with resp ect to three under - study explor a tion dimensio ns in agr eement with the ass umption pr esented in Section 1 . The ﬁna l results are pr e sent ed in T able 1. Note that the results are presented for r eaching t wo distinctive des ired targe t glo bal accuracy of 75% and 80%. 1 Each column indicates the p erc e n tag e of devices whic h pa rticipate to the federation pro cess that rea ch a desire d tar get accuracy 2 . In addition, we pres en t the results in three gr oups of ( Lo w , Mid , High ) for p ercentage of pa rticipating devices. Study A: E ﬀect of indivi d ual criteria. Study A contemplates answering the ques tio n: “ Ar e we able to intr o duc e a set of de vic e- and data- dep endent criteria thr ough the help o f which we c an tra in a b etter glob al mo del? ” . The results for this study ar e summar ized in the row Ind of T able 1. T o answer this question, we considered the eﬀect of each three identiﬁed criteria base Ds , Md , Ld in isolation . The results with resp ect to b o th desired accura cies o f 7 5% a nd 80% show that the new identiﬁed c riteria ( M d and Ld ) have an impact in the ﬁnal quality o f the global mo del, which is c omp ar able (in Lo w a nd M id case s) or sup erior with resp ect to the conv entional base Ds criteria (in the case of High) . F o r exa mple, when comparing Md and Ld , o ne can notice the results are equal to 25 .5 v.s. 2 7 with a marg ina l diﬀerence of only 6%. This is while, if we desire to satisfy a higher n umber of devices ( High case) to reach a cer tain accuracy , the intro duced/prop osed criteria sho w a quality substant ia lly better than the base Ds criteria . F or example, Ld has a mean p erforma nce of 40 5 compared with 552.5 obtained base Ds . This is equal to an improvemen t o f 36% with res p ect to existing ba s eline. These initial results a lready show how the global mo de l can beneﬁt from considering other criteria than just the dataset size. Study B: Impact of Priority order in m ulti -criteria aggregatio n. Study B fo cuses on the question: “ Ar e we able to expl oit the p otential b eneﬁt s of a prioritize d mu lti-criteria aggr e gation op er ator to build a mor e informative glob al mo del b ase d on the identiﬁe d criteria? ”. The results for this study a r e summa- rized in r ow MCA of T able 1. T o a nswer this research question, we p erfor med one e xper imen t for each individua l p ermutation of criteria in the prior itized 1 W e chose these accuracy val u es since th ey represent reasonable accuracy v alues and prediction tasks higher than 80% are not reac hed in the 1,000 allow ed rounds of comm u nication. 2 The total num b er of participating devices in the federation is 371, thus 20%, as an example, indicates the round of communication required for 0.2 × 317=75 devices to reac h the d esired target accuracy . T ow ards Eﬀective D evice-Awa re F ederated Learning 11 m ulti- c r iteria ag gregation setting. Since there are 3 iden tiﬁed criteria, we have in total 6 permutations of criteria. F or a ﬁne-grained analysis, w e provide the results obtained for al l the p ermutation run s , denoted, e.g., by Ds ≻ Ld ≻ Md , Ds ≻ Md ≻ Ld . By lo oking at the results, we ca n notice that in Lo w and Mid categorie s, the b est re sults are obtained for Ds ≻ Ld ≻ Md and Ds ≻ Md ≻ Ld . These res ults share a similar c ha racteristic, which inv olves the fact that by considering Ds a s the ﬁr st impo rtant criter ion, we can gra n t a smaller subset of devices the chance to r each to a desired tar get accuracy in faster pac e/rate. This result is in ag reement with individual results (see Ind in T able 1) in the sense that the criterion D s pr ovides the b est quality in L ow and Mid study c ases for b oth desir e d tar get ac cu ra cy of 75% and 80% . How ever, whe n concen tra ting on the High catego ry , o ne can notice Md ≻ Ds ≻ Ld provides the best p erfor - mance. T his result is a bit surprising and shows that to satisfy a higher n umber of devices, the criterion M d plays the most imp or tan t role. This result is sur- prising from the sense that in the individual results (see Ind in T able 1), Ld has the most impor tant per formance, while in the obtained result it has the low est priority . Interestingly , we may notice that in all these best cases, the pattern Ds ≻ Ld always o ccurs 3 . Study C: Impact of Online Adjustment of the Priorit y-Order in multi- criteria aggregation. Finally , s tudy C studies the question: “ Is it p ossible to up date p ar ameters for the aggr e gation op er ator (the priori t y or der of t he ab ove- mentione d criteria) via an online monitoring and adjustment or impr oving the quality of glob al mo del? ”. The results for this study a re summarized in row Fi- nal of T able 1. This study in fact is conce r ned with the dynamic b ehavior of our pro p os ed FL appro ach, by letting the serv er cho ose at ea c h round of co m- m unica tion the prio rity ordering max imizing the acc uracy (i.e, obtain the b est sub-optimal a ccuracy). Similar to the pr evious study , here we also run six exp er- imen ts, rela ted to the six p ossible initia lizations for the prio rity co mbinations. In T able 1 we s how results related to the b est run and to their mean. In this ﬁnal exp erimental setting, we see a n overall improvemen t in the p erfor mances of the prop os e s appro ach when we initialize the prio rity ordering with M d ≻ Ds ≻ Ld . Also in this case, the pattern Ds ≻ Ld o ccurs . 5 Conclusions and F uture p ersp ectives In this w ork , w e prese nted a pr actical proto co l for eﬀectively agg r egating data by prop osing a set of devic e- and data-awar e prop erties (criteria) that are ex- ploited by a central ser ver in order to obtain a mor e qualitative/informative global mo del. Our exp eriments s how that the standard feder ated lea rning stan- dard, F edAvg can b e substantially improved by tra ining hig h-quality models using relatively few ro unds o f communication, by using a pro per ly deﬁned set of lo ca l criter ia and using aggreg ation stra tegy that can exploit the information 3 W e remember here that a preference relation ≻ is transitive. Hence D s ≻ Md ≻ Ld implies Ds ≻ Ld . 12 T able 1. The ﬁnal results of the empirical ev aluation. Eac h table cell pro v id es the num b er of rounds of comm un ication necessary to make the percentage of devices (as sp eciﬁed in the columns) reac h a desired target accuracy (either 75 % or 80% in our case). Runs that did not reac h th e target accuracy for the sp eciﬁed p ercenta ge of devices in the allo wed roun ds (1,000) are marked with — . The b est results obtained in study MCA are shown in b old violet while the b est results in study Final, are shown in b old italic blue . T arget accuracy 75% Lo w Mid High Study/% de vices 20% 30% mean 40% 50% mean 70% 75% mean Ind Dataset size (b ase) 22 29 25.5 39 62 50.5 304 801 552.5 Model divergence 24 30 27 41 67 54 274 768 521 Lab el diversit y 25 32 28.5 43 70 56.5 278 532 405 MCA Ds ≻ Ld ≻ Md 20 29 24.5 39 60 49.5 300 823 561.5 Ds ≻ Md ≻ Ld 20 29 24.5 39 60 49.5 300 669 484.5 Ld ≻ Ds ≻ Md 24 31 27.5 41 68 54.5 259 768 513.5 Md ≻ Ds ≻ Ld 24 32 28 45 70 57.5 255 532 393.5 Ld ≻ Md ≻ Ds 23 30 26.5 41 68 54.5 270 729 499.5 Md ≻ Ld ≻ Ds 24 32 28 46 70 58 255 620 437.5 mean 22.5 30.5 26.5 41 .8 66 53.9 273.17 690.1 481.6 Final Md ≻ Ds ≻ Ld 12 19 15.5 2 6 57 41.5 1 64 494 329 mean 20.5 27.5 24 38.6 61.8 50.2 223 611.8 417.4 T arget accuracy 80% Lo w Mid High Study/% de vices 20% 30% mean 40% 50% mean 70% 75% mean Ind Dataset size (b ase) 31 45 38 72 136 104 — — — Model divergence 31 46 38.5 82 151 116.5 — — — Lab el diversit y 36 53 44.5 90 161 125. 5 — — — MCA Ds ≻ Ld ≻ Md 30 45 37.5 72 135 103.5 — — — Ds ≻ Md ≻ Ld 30 45 37.5 72 135 103.5 — — — Ld ≻ Ds ≻ Md 31 46 38.5 82 149 115. 5 — — — Md ≻ Ds ≻ Ld 36 53 44.5 84 161 122. 5 — — — Ld ≻ Md ≻ Ds 31 46 38.5 82 151 116. 5 — — — Md ≻ Ld ≻ Ds 36 53 44.5 90 161 125. 5 — — — mean 32.3 48 40.1 80.3 148.6 114.5 — — — Final Md ≻ Ds ≻ Ld 21 36 28.5 61 133 97 — — — mean 30 43.5 36.7 78.1 142.6 110.4 — — — from such criteria. F uture p ersp ectives for this work concern with the iden ti- ﬁcation o f other lo ca l criteria — both general purp ose a nd domain-s peciﬁc —, the exp erimentation with o ther aggr egation o p era tors a nd with other interesting datasets, as w ell as the ex tension of this federa ted a pproach to other mac hine learning sys tems, such as those in recommendation domain. A cknow le dgements The author s wish to thank Angelo Sc hiav one for fruitful discussions and for helping with the implemen tation of the framework. T ow ards Eﬀective D evice-Awa re F ederated Learning 13 References 1. Bagdasary an, E., V eit, A., Hua, Y., Estrin, D ., S hmatiko v, V.: How to backdoor federated learning. arXiv preprint arXiv:1807.0045 9 (2018) 2. Bona witz, K., Eichner, H., Griesk amp, W., Huba, D., Ingerman, A., Iv anov, V., Kiddon, C., Konecn´ y, J., Mazzo cchi, S., McMahan, H.B., Overv eldt, T.V., Petrou, D., Ramage, D ., Roselander, J.: T ow ards federated learning at scale: System design. CoRR abs/1902.01046 (2019), http://a rxiv.org/abs/1902. 01046 3. Caldas, S., W u, P ., Li, T., Koneˇ cn ` y, J., McMahan, H.B., Smith, V., T alw alk ar, A.: Leaf: A b enchmark for federated settings. arXiv preprint arXiv:1812.01097 (2018) 4. Choqu et, G.: Theory of capacities. A n nales de l’Institut F ourier 5 , 131–295 (1954). https://doi .org/10.5802/a if.53 5. Cohen, G., Afshar, S., T apson, J., v an Schaik, A.: Emnist: Exten ding mnist to handwritten letters. In: 2017 I nternatio n al Joint Conference on Neural Netw orks (IJCNN). pp. 2921–2926. I EEE (2017) 6. da Costa P ereira, C., Drag oni, M., Pasi, G.: Multidimensional relev ance: Pri- oritized aggregatio n in a p ersonalized information retriev al setting. I nf. Pro- cess. Ma n age. 48 (2), 340 –357 (2012). https://doi. org/10.1016/j .ipm.2 011.07.001 , https://do i.org/10.1016/j.ip m.2 011.07.001 7. Grabisc h, M.: The application of fuzzy integrals in m ulticriteria deci- sion making. Eu ropean Journal of Operational Researc h 89 (3), 445 – 456 (1996). https:// d oi.org/ https://doi.org/10 .1 016/0377-2217(95)00176-X , http://www .sciencedirect.com /sc ience/article/pii/037722179500176X 8. Grabisc h, M., Roub ens, M.: A pplication of the Choq uet integra l in multicriteria decision making. F uzzy Measures and Integrals pp. 348 – 374 (2000) 9. Konecn´ y, J., McMahan, B., R amage, D.: F ederated optimization: Dis- tributed optimiza tion b eyond th e d atacen ter. CoRR abs/1511.03575 (2015 ) , http://arx iv.org/abs/1511.03 575 10. Konecn´ y, J., McMahan, H .B., Ramage, D., R ic ht´ arik, P .: F ederated optimization: Distributed machine learning for on-d evice intellig ence. CoRR abs/1610.0252 7 (2016), http://arxiv.or g/abs/1610.02527 11. Lecun, Y ., Bottou, L., Bengio, Y ., Haﬀner, P .: Gradien t -based learning applied to docu ment recognition. Proceedings of the IEEE 86 (11), 2278–23 24 (Nov 1998 ). https://doi .org/10.1109/5 .726791 12. Marrara, S., P asi, G., Viviani, M.: Aggreg ation opera- tors in information retriev al. F uzzy Sets and S ystems 324 , 3–19 (2017). https://doi .org/10.1016/j .fss.201 6.1 2.018 , https://do i.org/10.1016/j.fs s.2 016.12.018 13. McMahan, B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Comm un ication-eﬃcien t learning of deep netw orks from decentralized data. In : Proceedings of the 20th International Conference on A rtiﬁcial Intelligence and Statistics, AIS T A TS 2017, 20-22 April 2017, F ort Laud erdale, FL, US A. pp. 1273– 1282 (2017), http://pr oceedings. mlr.press/v54/mcmahan17a.html 14. Miller, K.W., V oas, J.M., Hurlburt, G.F.: BYOD: security and priv acy considera- tions. IT Professional 14 (5), 53–55 (2012). https://doi.o rg/10.1109/MITP .2012.93 , https://do i.org/10.1109/MITP .20 12.93 15. Sahu, A.K., Li, T., Sanjabi, M., Zaheer, M., T alwalk ar, A ., S mith, V.: On the conv ergence of federated optimization in heterogeneous netw orks. arXiv preprin t arXiv:1812.06 127 (2018) 14 16. Y ager, R.R.: On ordered w eighted a veraging aggrega tion operators in multicrite- ria decisionmaking. IEEE T rans. Systems, Man, and Cyb ernetics 18 (1), 183–190 (1988). https://doi.org/ 10.1109/21.87068 , https://doi.or g/10.1109/21.87068 17. Y ager, R.R.: Quantiﬁer guided aggregation using o wa op erators. International Journal of Intellig ent S ystems 11 (1), 49–73 (1996) 18. Zhao, Y ., Li, M., Lai, L., Sud a, N., Civin, D., Chandra, V .: F ederated learning with non-iid data. CoRR abs/1806.00582 (2018), http://arx iv.org/abs/1806.005 8 2

Towards Effective Device-Aware Federated Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment