DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation

To improve the resilience of distributed training to worst-case, or Byzantine node failures, several recent approaches have replaced gradient averaging with robust aggregation methods. Such techniques can have high computational costs, often quadrati…

Authors: Shashank Rajput, Hongyi Wang, Zachary Charles

DETOX: A Redundancy-based Framework for Faster and More Robust Gradient   Aggregation
D E T O X : A Redundancy-based Framework f or F aster and Mor e Rob ust Gradient Aggr egation Shashank Rajput ∗ Univ ersity of Wisconsin-Madison rajput3@wisc.edu Hongyi W ang ∗ Univ ersity of Wisconsin-Madison hongyiwang@cs.wisc.edu Zachary Charles Univ ersity of Wisconsin-Madison zcharles@wisc.edu Dimitris Papailiopoulos Univ ersity of Wisconsin-Madison dimitris@papail.io Abstract T o impro ve the resilience of distributed training to w orst-case, or Byzantine node failures, se veral recent approaches ha ve replaced gradient averaging with rob ust aggregation methods. Such techniques can ha ve high computational costs, often quadratic in the number of compute nodes, and only ha ve limited robustness guarantees. Other methods hav e instead used redundancy to guarantee robustness, but can only tolerate limited number of Byzantine failures. In this work, we present D E T OX , a Byzantine-resilient distrib uted training frame work that combines algorithmic redundancy with robust aggregation. D E T O X operates in two steps, a filtering step that uses limited redundancy to significantly reduce the effect of Byzantine nodes, and a hierarchical aggre gation step that can be used in tandem with any state-of-the-art rob ust aggregation method. W e show theoretically that this leads to a substantial increase in robustness, and has a per iteration runtime that can be nearly linear in the number of compute nodes. W e pro vide extensi ve experiments over real distributed setups across a variety of large-scale machine learning tasks, showing that D E T O X leads to orders of magnitude accuracy and speedup improv ements over man y state-of-the-art Byzantine-resilient approaches. 1 Introduction T o scale the training of machine learning models, gradient computations can often be distributed across multiple compute nodes. After computing these local gradients, a parameter server then av erages them, and updates a global model. As the scale of data and av ailable compute power gro ws, so does the probability that some compute nodes output unreliable gradients. This can be due to power outages, faulty hardware, or communication failures, or due to security issues, such as the presence of an adversary go verning the output of a compute node. Due to the dif ficulty in quantifying these different types of errors separately , we often model them as Byzantine failures. Such failures are assumed to be able to result in any output, adversarial or otherwise. Unfortunately , the presence of a single Byzantine compute node can result in arbitrarily bad global models when aggregating gradients via their a verage [1]. In the context of distrib uted training, there have generally been tw o distinct approaches to improv e Byzantine robustness. The first replaces the gradient a veraging step at the parameter serv er with a robust aggregation step, such as the geometric median and variants thereof [ 1 , 2 , 3 , 4 , 5 , 6 ]. The second approach instead assigns each node redundant gradients, and uses this redundanc y to eliminate the effect of Byzantine f ailures [7, 8, 9]. ∗ Authors contributed equally to this paper and are listed alphabetically . Both of the abo ve approaches hav e their o wn limitations. For the first, robust aggre gators are typically expensi ve to compute and scale super-linearly (in many cases quadratically [ 10 , 4 ]) with the number of compute nodes. Moreo ver , such methods often come with limited theoretical guarantees of Byzantine robustness ( e .g. , only establishing con ver gence in the limit, or only guaranteeing that the output of the aggreg ator has positi ve inner product with the true gradient [ 1 , 10 ]) and often require strong assumptions, such as bounds on the dimension of the model being trained. On the other hand, redundancy or coding-theoretic based approaches offer strong guarantees of perfect receovery for the aggregated gradients. Howe ver , such approaches, in the w orst-case, require each node to compute Ω( q ) times more gradients, where q is the number of Byzantine machines [ 7 ]. This ov erhead is prohibitiv e in settings with a large number of Byzantine machines. model update robust aggreg ation ... AAAB7XicbZC7SgNBFIZn4y2ut6ilzWAQrMJuLLQRgzaWEcwFkiXMzs4mY2ZnlpmzQgh5BxsLRWwsfBR7G/FtnFwKTfxh4OP/z2HOOWEquAHP+3ZyS8srq2v5dXdjc2t7p7C7Vzcq05TVqBJKN0NimOCS1YCDYM1UM5KEgjXC/tU4b9wzbbiStzBIWZCQruQxpwSsVW+LSIHpFIpeyZsIL4I/g+LFh3uevn251U7hsx0pmiVMAhXEmJbvpRAMiQZOBRu57cywlNA+6bKWRUkSZoLhZNoRPrJOhGOl7ZOAJ+7vjiFJjBkkoa1MCPTMfDY2/8taGcRnwZDLNAMm6fSjOBMYFB6vjiOuGQUxsECo5nZWTHtEEwr2QK49gj+/8iLUyyX/pFS+8YqVSzRVHh2gQ3SMfHSKKugaVVENUXSHHtATenaU8+i8OK/T0pwz69lHf+S8/wAdA5J5 r - group majority r - group majority r - group majority r - group majority r - group majority r - group majority aggregation aggregation aggregation p compute nodes Figure 1: D E T O X is a hierarchical scheme for Byzantine gradient aggre- gation. In its first step, the parameter serv er partitions the compute nodes in groups and assigns each node to a group the same batch of data. Af- ter the nodes compute gradients with respect to this batch, the PS takes a majority vote of their outputs. This filters out a large fraction of the Byzantine gradients. In the second step, the parameter server partitions the filtered gradients in large groups, and applies a given aggregation method to each group. In the last step, the parameter server applies a robust aggregation method ( e.g. , geometric median) to the previous out- puts. The final output is used to perform a gradient update step. 200 400 600 800 1000 Num of Iterations 30 40 50 60 70 80 Test Accuracy (%) D-Bulyan D-Multi-Krum D-MoM Bulyan Multi-Krum Med. Computaion Communication Aggregation 0 2 4 6 8 Time Per Iter (sec) Bulyan Multi-Krum Med. D-Bulyan D-Multi-Krum D-MoM Figure 2: T op: con ver gence com- parisons among various v anilla ro- bust aggregation methods and the versions after deplo ying D E T O X under “a little is enough" Byzan- tine attack [11]. Bottom: Per it- eration runtime analysis of vari- ous methods. All results are for ResNet-18 trained on CIF AR-10. The prefix “D-" stands for a robust aggregation method paired with D E T OX . Our contributions. In this work, we present D E T OX , a Byzantine-resilient distributed training framew ork that first uses computational redundancy to filter out almost all Byzantine gradients, and then performs a hierarchical robust aggregation method. D E T OX is scalable, flexible, and is designed to be used on top of any rob ust aggregation method to obtain improv ed robustness and ef ficiency . A high-lev el description of the hierarchical nature of D E T O X is giv en in Fig. 1. D E T O X proceeds in three steps. First the parameter server orders the compute nodes in groups of r to compute the same gradients. While this step requires redundant computation at the node le vel, it will ev entually allow for much faster computation at the PS level, as well as improved rob ustness. After all compute nodes send their gradients to the PS, the PS takes the majority vote of each group of gradients. W e show that by setting r to be logarithmic in the number of compute nodes, after the majority vote step only a constant number of Byzantine gradients are still present, ev en if the number of Byzantine nodes is a constant fraction of the total number of compute nodes. D E T OX then performs hierarchical robust aggre gation in two steps: First, it partitions the filtered gradients in a small number of groups, and aggregates them using simple techniques such as a veraging. Second, it applies any rob ust aggregator ( e.g . , geometric median [ 2 , 6 ], Bulyan [ 10 ], Multi-Krum [ 4 ], etc.) to the a veraged gradients to further minimize the ef fect of an y remaining traces of the original Byzantine gradients. 2 W e pro ve that D E T O X can obtain orders of magnitude impro ved rob ustness guarantees compared to its competitors, and can achiev e this at a nearly linear complexity in the number of compute nodes p , unlike methods like Bulyan [ 10 ] that require run-time that is quadratic in p . W e extensi vely test our method in real distrib uted setups and lar ge-scale settings, sho wing that by combining D E T OX with previously proposed Byzantine rob ust methods, such as Multi-Krum, Bulyan, and coordinate-wise median, we increase the rob ustness and reduce the o verall runtime of the algorithm. Moreover , we show that under strong Byzantine attacks, D E T OX can lead to almost a 40% increase in accuracy ov er vanilla implementations of Byzantine-rob ust aggregation. A brief performance comparison with some of the current state-of-the-art aggregators in sho wn in Fig. 2. Related work. The topic of Byzantine fault tolerance has been extensi vely studied since the early 80s by Lamport et al. [ 12 ], and deals with worst-case, and/or adversarial failures, e.g . , system crashes, power outages, software b ugs, and adversarial agents that exploit security flaws. In the context of distributed optimization, these f ailures are manifested through a subset of compute nodes returning to the master flawed or adversarial updates. It is now well understood that first-order methods, such as gradient descent or mini-batch SGD, are not rob ust to Byzantine errors; ev en a single erroneous update can introduce arbitrary errors to the optimization variables. Byzantine-tolerant ML has been extensiv ely studied in recent years [ 13 , 14 , 15 , 16 , 17 , 2 ], establishing that while av erage-based gradient methods are susceptible to adversarial nodes, median-based update methods can in some cases achie ve better con vergence, while being robust to some attacks. Although theoretical guarantees are provided in many works, the proposed algorithms in many cases only ensure a weak form of resilience against Byzantine failures, and often fail against strong Byzantine attacks [ 10 ]. A stronger form of Byzantine resilience is desirable for most of distributed machine learning applications. T o the best of our knowledge, D R A C O [ 7 ] and B U LY A N [ 10 ] are the only proposed methods that guarantee strong Byzantine resilience. Howe ver , as mentioned above, D R A C O requires heavy redundant computation from the compute nodes, while B U LY A N requires heavy computation ov erhead on the parameter server end. W e note that [ 18 ] presents an alternati ve approach that does not fit easily under either cate gory , but requires con vexity of the underlying loss function. Finally , [ 19 ] examines the rob ustness of signSGD with a majority vote aggreg ation, but study a restricted Byzantine failure setup that only allows for a blind multiplicativ e adversary . 2 Problem Setup Our goal is to solve solv e the following empirical risk minimization problem: min w F ( w ) := 1 n n X i =1 f i ( w ) where w ∈ R d denotes the parameters of a model, and f i is the loss function on the i -th training sample. T o approximately solve this problem, we often use mini-batch SGD. First, we initialize at some w 0 . At iteration t , we sample S t uniformly at random from { 1 , . . . , n } , and then update via w t +1 = w t − η t | S t | X i ∈ S t ∇ f i ( w t ) , (1) where S t is a randomly selected subset of the n data points. T o perform mini-batch SGD in a distributed manner , the global model w t is stored at a parameter server (PS) and updated according to (1), i.e. , by using the mean of gradients that are e valuated at the compute nodes. Let p denote the total number of compute nodes. At each iteration t , during distrib uted mini-batch SGD, the PS broadcasts w t to each compute node. Each compute node is assigned S i,t ⊆ S t , and then ev aluates the sum of gradients g i = X j ∈ S i,t ∇ f j ( w t ) . The PS then updates the global model via w t +1 = w t − η t p p X i =1 g i . 3 W e note that in our setup we assume that the parameter serv er is the owner of the data, and has access to the entire data set of size n . Distributed training with Byzantine nodes W e assume that a fixed subset Q of size q of the p compute nodes are Byzantine. Let ˆ g i be the output of node i . If i is not Byzantine ( i / ∈ Q ), we say it is “honest”, in which case its output ˆ g i = g i where g i is the true sum of gradients assigned to node i . If i is Byzantine ( i ∈ Q ), its output ˆ g i can be any d -dimensional vector . The PS receives { ˆ g i } p i =1 , and can then process these vectors to produce some approximation to the true gradient update in (1) . W e make no assumptions on the Byzantine outputs. In particular, we allow adversaries with full information about F and w t , and that the byzantine compute nodes can collude. Let  = q /p be the fraction of Byzantine nodes. W e will assume  < 1 / 2 throughout. 3 D E T O X : A Redundancy Framework to Filter most Byzantine Gradients W e now describe D E T O X , a frame work for Byzantine-resilient mini-batch SGD with p nodes, q of which are Byzantine. Let b ≥ p be the desired batch-size, and let r be an odd inte ger . W e refer to r as the r edundancy ratio . For simplicity , we will assume r divides p and that p divides b . D E T OX can be directly extended to the setting where this does not hold. D E T O X first computes a random partition of [ p ] in p/r node groups A 1 , . . . , A p/r each of size r . This will be fix ed throughout. W e then initialize at some w 0 . For t ≥ 0 , we wish to compute some approximation to the gradient update in (1) . T o do so, we need a Byzantine-rob ust estimate of the true gradient. Fix t , and let us suppress the notation t when possible. As in mini-batch SGD, let S be a subset of [ n ] of size b , with each element sampled uniformly at random from [ n ] . W e then partition of S in groups S 1 , . . . , S p/r of size br /p . For each i ∈ A j , the PS assigns node i the task of computing g j := 1 | S j | X k ∈ S j ∇ f k ( w ) = p r b X k ∈ S j ∇ f k ( w ) . (2) If i is an honest node, then its output is ˆ g i = g j , while if i is Byzantine, it outputs some d -dimensional ˆ g i . The ˆ g i are then sent to the PS. The PS then computes z j := ma j( { ˆ g i | i ∈ A j } ) , where ma j denotes the majority vote. If there is no majority , we set z j = 0 . W e will refer to z j as the “vote” of group j . Since some of these votes are still Byzantine, we must do some robust aggregation of the vote. W e employ a hierarchical robust aggregation process H I E R - A G G R , which uses two user -specified aggregation methods A 0 and A 1 . First, the v otes are partitioned in to k groups. Let ˆ z 1 , . . . , ˆ z k denote the output of A 0 on each group. The PS then computes ˆ G = A 1 ( ˆ z 1 , . . . , ˆ z k ) and updates the model via w = w − η ˆ G . This hierarchical aggregation resembles a median of means approach on the votes [ 20 ], and has the benefit of improv ed robustness and ef ficiency . W e discuss this in further detail in Section 4. A description of D E T O X is given in Algorithm 1. Algorithm 1 D E T O X : Algorithm to be performed at the parameter server input Batch size b , redundancy ratio r , compute nodes 1 , . . . , p , step sizes { η t } t ≥ 0 . 1: Randomly partition [ p ] in “node groups” { A j | 1 ≤ j ≤ p/r } of size r . 2: f or t = 0 to T do 3: Draw S t of size b randomly from [ n ] . 4: Partition S t in to groups { S t,j | 1 ≤ j ≤ p/r } of size r b/p . 5: For each j ∈ [ p/r ] , i ∈ A j , push w t and S t,j to compute node i . 6: Receiv e the (potentially Byzantine) p gradients ˆ g t,i from each node. 7: Let z t,j := ma j( { ˆ g t,i | i ∈ A j } ) , and 0 if no majority exists. %Filtering step 8: Set ˆ G t = H I E R - A G G R ( { z t, 1 , . . . , z t,p/r } ) . %Hierarchical aggregation 9: Set w t +1 = w t − η ˆ G t . %Gradient update 10: end for 4 Algorithm 2 H I E R - A G G R : Hierarchical aggregation input Aggregators A 0 , A 1 , votes { z 1 , . . . , z p/r } , vote group size k . 1: Let ˆ p := p/r . 2: Randomly partition { z 1 , . . . , z ˆ p } in to “vote groups” { Z j | 1 ≤ j ≤ ˆ p/k } of size k . 3: For each v ote group Z j , calculate ˆ z j = A 0 ( Z j ) . 4: Return A 1 ( { ˆ z 1 , . . . , ˆ z ˆ p/k } ) . 3.1 Filtering out Almost Every Byzantine Node W e now show that D E T OX filters out the vast majority of Byzantine gradients. Fix the iteration t . Recall that all honest nodes in a node group A j send ˆ g j = g j as in (2) to the PS. If A j has more honest nodes than Byzantine nodes then z j = g j and we say z j is honest. If not, then z j may not equal g j in which case z j is a Byzantine vote. Let X j be the indicator variable for whether block A j has more Byzantine nodes than honest nodes, and let ˆ q = P j X j . This is the number of Byzantine votes. By filtering, D E T OX goes from a Byzantine compute node ratioof  = q /p to a Byzantine vote ratio of ˆ  = ˆ q / ˆ p where ˆ p = p/r . W e first show that E [ ˆ q ] decreases exponentially with r , while ˆ p only decreases linearly with r . That is, by incurring a constant factor loss in compute resources, we gain an e xponential improvement in the reduction of byzantine nodes. Thus, ev en small r can drastically reduce the Byzantine ratio of votes. This observation will allo w us to instead use robust aggregation methods on the z j , i.e. , the votes, greatly impro ving our Byzantine robustness. W e hav e the following theorem about E [ ˆ q ] . All proofs can be found in the appendix. Note that throughout, we did not focus on optimizing constants. Theorem 1. Ther e is a universal constant c such that if the fr action of Byzantine nodes is  < c , then the effective number of Byzantine votes after filtering becomes E [ ˆ q ] = O   ( r − 1) / 2 q /r  . W e now wish to use this to derive high probability bounds on ˆ q . While the variables X i are not independent, they are neg ativ ely correlated. By using a version of Hoef fding’ s inequality for weakly dependent v ariables, we can sho w that if the redundancy is logarithmic, i.e . , r ≈ log ( q ) , then with high probability the number of effecti ve byzantine votes drops to a constant, i.e . , ˆ q = O (1) . Corollary 2. Ther e is a constant c such that if and  ≤ c and r ≥ 3 + 2 log 2 ( q ) then for any δ ∈ (0 , 1 2 ) , with pr obability at least 1 − δ , we have that ˆ q ≤ 1 + 2 log(1 /δ ) . In the next section, we exploit this dramatic reduction of Byzantine v otes to deriv e strong robustness guarantees for D E T O X . 4 D E T O X Improv es the Speed and Robustness of Robust Estimators Using the results of the pre vious section, if we set the redundancy ratio to r ≈ log ( q ) , the filtering stage of D E T O X reduces the number of Byzantine votes ˆ q to roughly a constant. While we could apply some robust aggregator A directly to the output votes of the filtering stage, such methods often scale poorly with the number of votes ˆ p . By instead applying H I E R - A G G R , we greatly improve efficienc y and robustness. Recall that in H I E R - A G G R , we partition the votes into k “vote groups”, apply some A 0 to each group, and apply some A 1 to the k outputs of A 0 . W e analyze the case where k is roughly constant, A 0 computes the mean of its inputs, and A 1 is a robust aggre gator . In this case, H I E R - A G G R is analogous to the Median of Means (MoM) method from robust statistics [20]. Impro ved speed. Suppose that without redundancy , the time required for the compute nodes to finish is T . Applying Krum [ 1 ], Multi-Krum [ 4 ], and Bulyan [ 10 ] to their p outputs requires O ( p 2 d ) operations, so their ov erall runtime is O ( T + p 2 d ) . In D E T OX , the compute nodes require r times more computation to ev aluate redundant gradients. If r ≈ log( q ) , this can be done in O (ln( q ) T ) . W ith H I E R - A G G R as abov e, D E T OX performs three major operations: (1) majority voting, (2) mean computation of the k vote groups and (3) robust aggregation of the these k means using A 1 . (1) and (2) require O ( pd ) time. For practical A 1 aggregators, including Multi-Krum and Bulyan, (3) requires O ( k 2 d ) time. Since k  p , D E T O X has runtime O (ln( q ) T + pd ) . If T = O ( d ) (which 5 generally holds for gradient computations), Krum, Multi-Krum, and Bulyan require O ( p 2 d ) time, but D E T O X only requires O ( pd ) time. Thus, D E T OX can lead to significant speedups, especially when the number of workers is lar ge. Impro ved robustness. T o analyze robustness, we first need some distrib utional assumptions. At any gi ven iteration, let G denote the full gradient of F ( w ) . Throughout this section, we assume that the gradient of each sample is drawn from a distribution D on R d with mean G and variance σ 2 . In D E T OX , the “honest” votes z i will also ha ve mean G , but their v ariance will be σ 2 p/r b . This is because each honest compute node gets a sample of size r b/p , so its variance is reduced by a f actor of r b/p . Suppose ˆ G is some approximation to the true gradient G . W e say that ˆ G is a ∆ -inexact gradient oracle for G if k ˆ G − G k ≤ ∆ . [ 5 ] shows that access to a ∆ -inexact gradient oracle is suf ficient to upper bound the error of a model ˆ w produced by performing gradient updates with ˆ G . T o bound the robustness of an aggegator , it suf fices to bound ∆ . Under the distributional assumptions abo ve, we will deriv e bounds on ∆ for the hierarchical aggreg ator A with different base aggre gators A 1 . W e will analyze D E T OX when A 0 computes the mean of the vote groups, and A 1 is geometric median, coordinate-wise median, or α -trimmed mean [ 6 ]. W e will denote the approximation ˆ G to G computed by D E T OX in these three instances by ˆ G 1 , ˆ G 2 and ˆ G 3 , respectiv ely . Using the proof techniques in [20], we get the following. Theorem 3. Assume r ≥ 3 + 2 log 2 ( q ) and  ≤ c wher e c is the constant fr om Cor ollary 2. Ther e ar e constants c 1 , c 2 , c 3 such that for all δ ∈ (0 , 1 / 2) , with pr obability at least 1 − 2 δ : 1. If k = 128 ln(1 /δ ) , then ˆ G 1 is a c 1 σ q ln(1 /δ ) b -inexact gr adient oracle . 2. If k = 128 ln( d/δ ) , then ˆ G 2 is a c 2 σ q ln( d/δ ) b -inexact gr adient oracle . 3. If k = 128 ln( d/δ ) and α = 1 4 , then ˆ G 3 is a c 3 σ q ln( d/δ ) b -inexact gr adient oracle . The above theorem has three important implications. First, we can deriv e robustness guarantees for D E T OX that are virtually independent of the Byzantine ratio  . Second, even when there are no Byzantine machines, it is known that no aggreg ator can achiev e ∆ = o ( σ / √ b ) [ 21 ], and because we achie ve ∆ = ˜ O ( σ / √ b ) , we cannot e xpect to get an order of better robustness by any other aggregator . Third, other than a logarithmic dependence on q , there is no dependence on the number of nodes p . Even as p and q increase, we still maintain roughly the same robustness guarantees. By comparison, the robustness guarantees of Krum and Geometric Median applied directly to the compute nodes worsens as as p increases [ 17 , 3 ]. Similarly , [ 6 ] show if we apply coordinate-wise median to p nodes, each of which are assigned b/p gradients, we get a ∆ -inexact gradient oracle where ∆ = O ( σ p p/b + σ p d/b ) . If  is constant and p is comparable to b , then this is roughly σ , whereas D E T OX can produce a ∆ -inexact gradient oracle for ∆ = ˜ O ( σ / √ b ) . Thus, the rob ustness of D E T O X can scale much better with the number of nodes than naive rob ust aggregation of gradients. 5 Experiments In this section we present an experimental study on pairing D E T OX with a set of previously proposed robust aggre gation methods, including M U LT I - K R U M [ 17 ], B U LY A N [ 10 ], coordinate-wise median [ 5 ]. W e also incorporate D E T OX with a recently proposed Byzantine resilience distrib uted training method, S I G N SGD with majority v ote [ 19 ]. W e conduct extensiv e experiments on the scalability and robustness of these Byzantine resilient methods, and the improv ements gained when pairing them with D E T OX . All our experiments are deployed on real distrib uted clusters under various Byzantine attack models. Our implementation is publicly av ailable for reproducibility at 2 . The main findings are as follo ws: 1) Applying D E T OX leads to significant speedups, e .g. , up to an order of magnitude end-to-end training speedup is observed; 2) in defending against state-of-the-art 2 https://github .com/hwang595/DETO X 6 Byzantine attacks, D E T O X leads to significant Byzantine-resilience, e.g. , applying B U LY A N on top of D E T O X improves the test-set prediction accurac y from 11% to 60% when training VGG13-BN on CIF AR-100 under the “a little is enough" (ALIE) [ 11 ] Byzantine attack. Moreov er , incorporating S I G N SGD with D E T O X improv es the test-set prediction accuracy from 34 . 92% to 78 . 75% when defending against a constatnt Byzantine attck for ResNet-18 trained on CIF AR-10. 5.1 Experimental Setup W e implemented v anilla versions of the aforementioned Byzantine resilient methods, as well as versions of these methods pairing with D E T O X , in PyT orch [ 22 ] with MPI4py [ 23 ]. Our experimental comparisons are deployed on a cluster of 46 m5.2xlarge instances on Amazon EC2, where 1 node serves as the PS and the remaining p = 45 nodes are compute nodes. In all following experiments, we set the number of Byzantine nodes to be q = 5 . In each iteration of the vanilla Byzantine resilient methods, each compute node ev aluates b p = 32 gradients sampled from its partition of data while in D E T O X each compute node e valuates r times more gradients where r = 3 , so rb p = 96 . The average of these locally computed gradients is then sent back to the PS. After receiving all gradient summations from the compute nodes, the PS applies either vanilla Byzantine resilient methods or their D E T O X paired variants. 5.2 Implementation of D E T O X W e emphasize that D E T O X is not simply a ne w robust aggre gation technique. It is instead a general Byzantine-resilient distributed training frame work, and any robust aggre gation method can be im- mediately implemented on top of it to increase its Byzantine-resilience and scalability . Note that after the majority voting stage on the PS one has a wide range of choices for A 0 and A 1 . In our implementations, we had the following setups: 1) A 0 = Mean, A 1 = Coordinate-size Median, 2) A 0 = M U LT I - K RU M , A 1 = Mean, 3) A 0 = B U L Y A N , A 1 = Mean, and 4) A 0 = coordinate-wise majority vote, A 1 = coordinate-wise majority vote (designed specifically for pairing D E T O X with S I G N SGD). W e tried A 0 = Mean and A 1 = M U LT I - K RU M / B U L Y A N but we found that setups 2) and 3) had better resilience than these choices. More details on the implementation and system-lev el optimizations that we performed can be found in the Appendix B.1. Byzantine attack models W e consider two Byzantine attack models for pairing M U LT I - K R U M , B U L Y A N , and coordinate-wise median with D E T OX . First, we consider the “r eversed gradient" attack, where adversarial nodes that were supposed to send g to the PS instead send − c g , for some c > 0 . The second Byzantine attack model we study is the recently proposed ALIE [ 11 ] attack, where the Byzantine compute nodes collude and use their locally calculated gradients to estimate the mean and standard de viation of the entire set of gradients among all other compute nodes. The Byzantine nodes then use the estimated mean and v ariance to manipulate the gradient they send back to the PS. T o be more specific, Byzantine nodes will send ˆ µ + z · ˆ σ where ˆ µ and ˆ σ are the estimated mean and standard de viation by Byzantine nodes and z is a hyper -parameter which was tuned empirically in [11]. Then, to compare the resilience of the vanilla S I G N SGD and the one paired with D E T O X , we w ill consider a simple attack, i.e. , constant Byzantine attack . In constant Byzantine attack , Byzantine compute nodes simply send a constant gradient matrix with dimension equal to that of the true gradient where all elements equals to − 1 . Under this attack, and specifically for S I G N SGD, the Byzantine gradients will mislead model updates to wards wrong directions and corrupt the final model trained via S I G N SGD. Datasets and models W e conducted our experiments over ResNet-18 [ 24 ] on CIF AR-10 and VGG13-BN [ 25 ] on CIF AR-100. For each dataset, we use data augmentation (random crops, and flips) and normalize each individual image. Moreov er , we tune the learning rate scheduling process and use the constant momentum at 0 . 9 in running all experiments. The details of parameter tuning and dataset normalization are reported in the Appendix B.2. 7 5.3 Results Scalability W e report a per-iteration runtime analysis of the aforementioned rob ust aggregations and their D E T OX paired variants on both CIF AR-10 over ResNet-18 and CIF AR-100 over V GG-13. The results on ResNet-18 and VGG13-BN are sho wn in Figure 2 and 3 respectively . W e observe that although D E T OX requires slightly more compute time per iteration, due to its algorithmic redundancy , it largely reduces the PS computation cost during the aggregation stage, which matches our theoretical analysis. Surprisingly , we observe that by applying D E T O X , the communication costs decrease. This is because the variance of computation time among compute nodes increases with heavier computational redundancy . Therefore, after applying D E T O X , compute nodes tend not to send their gradients to the PS at the same time, which mitigates a potential network bandwidth congestion. In a nutshell, applying D E T OX can lead to up to 3 × per-iteration speedup. 250 500 750 1000 1250 1500 1750 2000 Num of Iterations 10 20 30 40 50 60 Test Accuracy (%) D-Bulyan D-Multi-Krum D-MoM Bulyan Multi-Krum Med. Computaion Communication Aggregation 0 2 4 6 Time Per Iter (sec) Bulyan Multi-Krum Med. D-Bulyan D-Multi-Krum D-MoM Figure 3: Left: Con vergence performance of various rob ust aggregation methods over ALIE attack. Right: Per iteration runtime analysis of various rob ust aggregation methods. Results of VGG13-BN on CIF AR-100 0 100 200 300 400 500 Wall-clock Time (Mins.) 76 78 80 82 84 86 88 90 92 Test Accuracy (%) D-Multi-Krum Multi-krum (a) ResNet-18, M U LT I - K R U M 0 100 200 300 400 500 600 700 Wall-clock Time (Mins.) 76 78 80 82 84 86 88 90 92 Test Accuracy (%) D-Bulyan Bulyan (b) ResNet-18, B U LY A N 0 100 200 300 400 500 600 700 800 Wall-clock Time (Mins.) 76 78 80 82 84 86 88 90 92 Test Accuracy (%) D-MoM Med. (c) ResNet-18, Coord-Median 0 50 100 150 200 250 300 350 400 Wall-clock Time (Mins.) 40 45 50 55 60 65 70 Test Accuracy (%) D-Multi-Krum Multi-krum (d) VGG13-BN, M U LTI - K RU M 0 100 200 300 400 500 Wall-clock Time (Mins.) 40 45 50 55 60 65 70 Test Accuracy (%) D-Bulyan Bulyan (e) VGG13-BN, B U LY A N 0 100 200 300 400 500 600 Wall-clock Time (Mins.) 25 30 35 40 45 50 55 60 65 70 Test Accuracy (%) D-MoM Med (f) VGG13-BN, Coor d-Median Figure 4: End-to-end con ver gence comparisons among applying D E T OX on different baseline methods under r everse gradient attack. (a)-(c): comparisons between vanilla and D E T O X deployed version of M U LT I - K RU M , B U LY A N , and coordinate-wise median ov er ResNet-18 trained on CIF AR-10. (d)-(f): same comparisons over VGG13-BN trained on CIF AR-100. 8 D-M.-K. over M.-K. D-Bulyan over Bulyan D-MoM over Med. Method 86% 88% 90% 92% Test Accuracy 2.1x 1.94x 5.24x 1.75x 2.51x 5.01x 2.22x 2.48x 5.15x 1.81x 2.1x 4.54x (a) ResNet-18, CIF AR-10 D-M.-K. over M.-K. D-Bulyan over Bulyan Method 59% 61% 63% 65% Test Accuracy 2.13x 2.04x 2.1x 1.84x 2.43x 1.88x 2.45x 1.94x D-MoM over M.-Med. Method 45% 48% 50% 55% 11.57x 10.4x 10.4x 11.15x (b) VGG13-BN, CIF AR-100 Figure 5: Speedups in con verging to specific accuracies for vanilla robust aggregation methods and their D E T OX -deployed variants under re verse gradient attack: (a) results of ResNet-18 trained on CIF AR-10, (b) results of VGG13-BN trained on CIF AR-100 T able 1: Summary of defense results over ALIE attacks [11]; the numbers reported correspond to test set prediction accuracy . Methods ResNet-18 VGG13-BN D-M U LT I - K RU M 80.3% 42.98% D-B U LY A N 76.8% 46.82% D-Med. 86.21% 59.51% M U LT I - K RU M 45.24% 17.18% B U L Y A N 42.56% 11.06% Med. 43.7% 8.64% Byzatine-resilience under various attacks W e first study the Byzantine-resilience of all methods and baselines under the ALIE attack, which is to the best of our kno wledge, the strongest Byzantine attack kno wn. The results on ResNet-18 and V GG13-BN are sho wn in Figure 2 and 3 respecti vely . Applying D E T O X leads to significant improv ement on Byzantine-resilience compared to vanilla M U LT I - K RU M , B U LY A N , and coordinate-wise median on both datasets as shown in T able 1. W e then consider the r ever se gr adient attack, the results are shown in Figure 4. Since r ever se gr adient is a much weaker attack, all vanilla robust aggregation methods and their D E T O X paired v ariants defend well. Moreov er , applying D E T O X leads to significant end-to-end speedups. In particular , combining the coordinate-wise median with D E T OX led to a 5 × speedup gain in the amount of time to achie ve to 90% test set prediction accuracy for ResNet-18 trained on CIF AR-10. The speedup results are shown in Figure 5. For the experiment where VGG13-BN was trained on CIF AR-100, up to an order of magnitude end-to-end speedup can be observed in coordinate-wise median applied on top of D E T OX . For completeness, we also compare versions of D E T OX with D R A C O [ 7 ]. This is not the focus of this work, as we are primarily interested in sho wing that D E T OX improves the rob ustness of traditional robust aggre gators. Howe ver the comparisons with D R A C O can be found in the Appendix B.4. Comparison between D E T OX and S I G N SGD W e compare D E T OX paired S I G N SGD with vanilla S I G N SGD where only the sign information of each gradient element will be sent to the PS. The PS, 9 on receiving sign information of gradients, takes coordiante-wise majority votes to get the model update. As is argued in [ 19 ], the gradient distribution for man y mordern deep networks can be close to unimodal and symmetric, hence a random sign flip attack is weak since it will not hurt the gradient distribution. W e thus consider a stronger constant Byzantine attack introduced in Section 5.2. T o pair D E T O X with S I G N SGD, after the majority voting stage of D E T O X , we set both A 0 and A 1 as coordinate-wise majority vote describe in Algorithm 1 in [ 19 ]. For hyper-parameter tuning, we follo w the suggestion in [ 19 ] and set the initial learning rate at 0 . 0001 . Howe ver , in defensing the our proposed constant Byzantine attack , we observe that constant learning rates lead to model di ver gence. Thus, we tune the learning rate schedule and use 0 . 0001 × 0 . 99 t (mo d 10) for both D E T OX and D E T O X paired S I G N SGD. The results of both ResNet-18 trained on CIF AR-10 and VGG13-BN trained on CIF AR-100 are sho wn in Figure 6 where we observ e that D E T OX paired S I G N SGD improves the Byzantine resilience of S I G N SGD significantly . For ResNet-18 trained on CIF AR-10, D E T O X improv es testset prediction accuracy of v anilla S I G N SGD from 34 . 92% to 78 . 75% . While for V GG13-BN trained on CIF AR-100, D E T O X improves testset prediction accuracy (T OP-1) of vanilla S I G N SGD from 2 . 12% to 40 . 37% . 0 500 1000 1500 2000 2500 3000 3500 4000 Num of Iterations 20 30 40 50 60 70 80 Test Accuracy (%) D-signSGD signSGD (a) ResNet-18 on CIF AR-10 0 1000 2000 3000 4000 5000 Num of Iterations 0 5 10 15 20 25 30 35 40 Test Accuracy (%) D-signSGD signSGD (b) VGG13-BN on CIF AR-100 Figure 6: Con ver gence comparisons among D E T O X paired with S I G N SGD and vanilla S I G N SGD under con- stant Byzantine attack on: (a) ResNet-18 trained on CIF AR-10 dataset; (b) VGG13-BN trained on CIF AR-100 dataset 20 30 40 50 60 70 80 90 100 Dimension 1 0 − 2 1 0 − 1 1 0 0 ℓ 2 N o r m o f A g g r e g a t i o n D-Coord-Median D-Geo-Median Vanilla Coord Median Vanilla Geo Median Figure 7: Experiment with synthetic data for rob ust mean estimation: error is reported against dimension (lower is better) Mean estimation on synthetic data T o verify our theoretical analysis, we finally conduct an experiment for a simple mean estimation task. The result of our synthetic mean experiment are shown in Figure 7. In the synthetic mean experiment, we set p = 220000 , r = 11 , q = b e r 3 c , and for dimension d ∈ { 20 , 30 , · · · , 100 } , we generate 20 samples iid from N (0 , I d ) . The Byzantine nodes, instead send a constant vector of the same dimension with ` 2 norm of 100. The robustness of an estimator is reflected in the ` 2 norm of its mean estimate. Our experimental results sho w that D E T O X increases the robustness of geometric median and coordinate-wise median, and decreases the dependecne of the error on d . 10 6 Conclusion In this paper , we present D E T O X , a new frame work for Byzantine-resilient distributed training. Notably , any rob ust aggregator can be immediatley used with D E T OX to increase its robustness and ef ficiency . W e demonstrate these impro vements theoretically and empirically . In the future, we w ould like to de vise a priv acy-preserving version of D E T OX , as currently it requires the PS to be the owner of the data, and also to partition data among compute nodes. This means that the current version of D E T O X is not pri v acy preserving. Overcoming this limitation would allo w us to dev elop v ariants of D E T O X for federated learning. References [1] Pev a Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer . Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Pr ocessing Systems 30: Annual Conference on Neural Information Pr ocessing Systems 2017, 4-9 December 2017, Long Beach, CA, USA , pages 118–128, 2017. URL http://papers.nips.cc/paper/ 6617- machine- learning- with- adversaries- byzantine- tolerant- gradient- descent . [2] Y udong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Pr oceedings of the A CM on Measurement and Analysis of Computing Systems , 1(2):44, 2017. [3] Cong Xie, Oluwasanmi K oyejo, and Indranil Gupta. Generalized byzantine-tolerant sgd. arXiv pr eprint arXiv:1802.10116 , 2018. [4] Georgios Damaskinos, El Mahdi El Mhamdi, Rachid Guerraoui, and Sebastien Guirguis, Arsany Rouault. Aggregathor: Byzantine machine learning via robust gradient aggregation. Confer ence on Systems and Machine Learning , 2019. [5] Dong Y in, Y udong Chen, Kannan Ramchandran, and Peter Bartlett. Defending against saddle point attack in byzantine-robust distributed learning. CoRR , abs/1806.05358, 2018. URL http://arxiv.org/abs/1806.05358 . [6] Dong Y in, Y udong Chen, Kannan Ramchandran, and Peter Bartlett. Byzantine-robust distributed learning: T owards optimal statistical rates. In International Conference on Mac hine Learning , pages 5636–5645, 2018. [7] Lingjiao Chen, Hongyi W ang, Zachary Charles, and Dimitris Papailiopoulos. Draco: Byzantine- resilient distributed training via redundant gradients. In International Conference on Mac hine Learning , pages 902–911, 2018. [8] Deepesh Data, Linqi Song, and Suhas Digga vi. Data encoding for byzantine-resilient distributed gradient descent. In 2018 56th Annual Allerton Confer ence on Communication, Contr ol, and Computing (Allerton) , pages 863–870. IEEE, 2018. [9] Qian Y u, Netanel Ra viv , Jinhyun So, and A Salman A vestimehr . Lagrange coded computing: Optimal design for resiliency , security and priv acy . arXiv pr eprint arXiv:1806.00939 , 2018. [10] El Mahdi El Mhamdi, Rachid Guerraoui, and Sébastien Rouault. The hidden vulnerability of distributed learning in byzantium. arXiv pr eprint arXiv:1802.07927 , 2018. [11] Moran Baruch, Gilad Baruch, and Y oav Goldber g. A little is enough: Circumventing defenses for distributed learning. arXiv pr eprint arXiv:1902.06156 , 2019. [12] Leslie Lamport, Robert Shostak, and Marshall Pease. The byzantine generals problem. ACM T ransactions on Pr ogramming Langua ges and Systems (T OPLAS) , 4(3):382–401, 1982. [13] El-Mahdi El-Mhamdi, Rachid Guerraoui, Arsany Guirguis, and Sebastien Rouault. Sgd: Decentralized byzantine resilience. arXiv preprint , 2019. 11 [14] Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Zeno: Byzantine-suspicious stochastic gradient descent. arXiv preprint , 2018. [15] Cong Xie, Sanmi Ko yejo, and Indranil Gupta. Fall of empires: Breaking byzantine-tolerant sgd by inner product manipulation. arXiv preprint , 2019. [16] El-Mahdi El-Mhamdi and Rachid Guerraoui. Fast and secure distributed learning in high dimension. arXiv preprint , 2019. [17] Pev a Blanchard, Rachid Guerraoui, Julien Stainer, et al. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Pr ocessing Systems , pages 119–129, 2017. [18] Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. Byzantine stochastic gradient de- scent. In S. Bengio, H. W allach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Pr ocessing Systems 31 , pages 4618–4628. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/ 7712- byzantine- stochastic- gradient- descent.pdf . [19] Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, and Anima Anandkumar . signsgd with majority vote is communication ef ficient and fault tolerant. arXiv , 2018. [20] Stanislav Minsk er et al. Geometric median and rob ust estimation in banach spaces. Bernoulli , 21(4):2308–2335, 2015. [21] Gábor Lugosi, Shahar Mendelson, et al. Sub-gaussian estimators of the mean of a random vector . The Annals of Statistics , 47(2):783–794, 2019. [22] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Y ang, Zachary DeV ito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer . Automatic differentiation in pytorch. 2017. [23] Lisandro D Dalcin, Rodrigo R Paz, P ablo A Kler, and Alejandro Cosimo. Parallel distributed computing using python. Advances in W ater Resour ces , 34(9):1124–1139, 2011. [24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Pr oceedings of the IEEE confer ence on computer vision and pattern recognition , pages 770–778, 2016. [25] Karen Simonyan and Andrew Zisserman. V ery deep con volutional networks for large-scale image recognition. arXiv preprint , 2014. [26] Nathan Linial and Zur Luria. Chernoff ’ s inequality-a very elementary proof. arXiv pr eprint arXiv:1403.7739 , 2014. [27] J. Ramon C. Pelekis. Hoeffding’ s inequality for sums of weakly dependent random variables. Mediterranean J ournal of Mathematics , 2017. [28] Georgios Damaskinos, El Mahdi El Mhamdi, Rachid Guerraoui, Arsan y Guirguis, and Sèbastien Rouault. Aggregathor: Byzantine machine learning via robust gradient aggregation. In SysML , 2019. 12 A Proofs A.1 Proof of Theorem 1 The following is a more precise statement of the theorem. Theorem. If r > 3 , p ≥ 2 r and  < 1 / 40 then E [ ˆ q ] falls as O  q (40  (1 −  )) ( r − 1) / 2 /r  which is exponential in r . Pr oof. By direct computation, E ( ˆ q ) = E   p/r X i =1 X i   = p r E ( X i ) = p r ( r − 1) / 2 X i =0  q r − i  p − q i   p r  ≤ p r r + 1 2  q ( r + 1) / 2  p − q ( r − 1) / 2   p r  ≤ p r r + 1 2  r ( r − 1) / 2  q ( r +1) / 2 ( p − q ) ( r − 1) / 2 ( p − r ) r = p r r + 1 2  r ( r − 1) / 2  q ( r +1) / 2 ( p − q ) ( r − 1) / 2 p r (1 − r/p ) r ≤ p r r + 1 2  r ( r − 1) / 2  q ( r +1) / 2 ( p − q ) ( r − 1) / 2 p r (1 / 2) r = p r ( r + 1)2 r − 1  r ( r − 1) / 2   ( r +1) / 2 (1 −  ) ( r − 1) / 2 . Note that  r ( r − 1) / 2  is the coefficient of x ( r +1) / 2 (1 − x ) ( r − 1) / 2 in the binomial expansion of 1 = 1 r = ( x + (1 − x )) r . Therefore, setting x = 1 2 , we find that  r ( r − 1) / 2  ≤ 2 r . Therefore, p r ( r + 1)2 r − 1  r ( r − 1) / 2   ( r +1) / 2 (1 −  ) ( r − 1) / 2 ≤ p r ( r + 1)2 2 r − 1  ( r +1) / 2 (1 −  ) ( r − 1) / 2 = p r ( r + 1)   2 2 r − 1  ( r − 1) / 2 (1 −   ( r − 1) / 2 ) = 2 q r ( r + 1)  16  (1 −  )  ( r − 1) / 2 = 2 q r  16( r + 1) 2 / ( r − 1)  (1 −  )  ( r − 1) / 2 . 13 Note that since r > 3 and r is odd, we have r ≥ 5 . Therefore, E ( ˆ q ) ≤ 2 q (40  (1 −  )) ( r − 1) / 2 /r . For r = 3 , we hav e the following lemma. Lemma 4. If r = 3 , then E [ ˆ q ] ≤ q (4  − 2  2 ) / 3 when n ≥ 6 . Pr oof. E ( q e ) = E ( p 3 X i =1 X i ) = p 3 E ( X i ) = p 3  q 3  +  q 2  p − q 1   n 3  = p 3 q ( q − 1)(3 p − 2 q − 2) p ( p − 1)( p − 2) = q 3   − 1 p   3 − 2  − 2 p   1 − 1 p   1 − 2 p  ≤ q 3  3 − 2  − 2 p 1 − 2 p ≤ q  (4 − 2  ) / 3 A.2 Proof of Corollary 2 From Theorem 1 we see that E [ ˆ q ] ≤ 2 q (40  (1 −  )) ( r − 1) / 2 /r ≤ 2 q (40  ) ( r − 1) / 2 . Now , straightfor- ward analysis implies that if  ≤ 1 / 80 and r ≥ 3 + 2 log 2 q then E [ ˆ q ] ≤ 1 . W e will then use the following Lemma: Lemma 5. F or all θ > 0 , P [ ˆ q ≥ E [ ˆ q ](1 + θ )] ≤  1 1 + θ/ 2  E [ ˆ q ] θ / 2 Now , using Lemma 5 and assuming θ ≥ 2 , P [ ˆ q ≥ E [ ˆ q ](1 + θ )] ≤  1 1 + θ/ 2  E [ ˆ q ] θ / 2 = ⇒ P [ ˆ q ≥ 1 + E [ ˆ q ] θ ] ≤  1 1 + θ/ 2  E [ ˆ q ] θ / 2 = ⇒ P [ ˆ q ≥ 1 + E [ ˆ q ] θ ] ≤ 2 − E [ ˆ q ] θ / 2 where we used the fact that E [ ˆ q ] ≤ 1 in the first implication and the assumption that θ ≥ 2 in the second. Setting δ := 2 − E [ ˆ q ] θ / 2 , we get the probability bound. Finally , setting δ ≤ 1 / 2 makes θ ≥ 2 , which completes the proof. A.3 Proof of Lemma 5 W e will prove the follo wing: P [ ˆ q ≥ E [ ˆ q ](1 + θ )] ≤    1 1 + θ 2    E [ ˆ q ] θ / 2 Pr oof. W e will use the following theorem for this proof [26, 27]. 14 Theorem (Linial [ 26 ]) . Let X 1 , . . . , X ˆ p be Bernoulli 0 / 1 random variables. Let β ∈ (0 , 1) be such that β ˆ p is a positive inte ger and let k be any positive integ er such that 0 < k < β ˆ p . Then P " ˆ p X i =1 X i ≥ β ˆ p # ≤ 1  β ˆ p k  X | A | = k P [ ∧ i ∈ A ( X i = 1)] Let β ˆ p = E [ ˆ q ](1 + θ ) . Now , P [ X i = 1] = E [ X i ] = E [ ˆ q ] / ˆ p . W e will sho w that P [ ∧ i ∈ A ( X i = 1)] ≤ ( E [ ˆ q ] / ˆ p ) k where A ⊆ { 1 , . . . , ˆ p } of size k . T o see this, note that for any i , P [ X i = 1] = E [ ˆ q ] / ˆ p . The conditional probability of some other X j being 1 giv en that X i is 1 would only reduce. Formally , for i 6 = j , P [ X j = 1 | X i = 1] ≤ P [ X i = 1] = γ . Note that for X i to be 1 , the Byzantine machines in the i -th block must be in the majority . Hence, the reduction in the pool of leftover Byzantine machines was more than honest machines. Since the total number of Byzantine machines is less than the number of honest machines, the probability for them being in a majority in block j reduces. Therefore, P " ˆ p X i =1 X i ≥ E [ ˆ q ](1 + θ ) # ≤  ˆ p k   E [ ˆ q ](1 + θ ) k  P [ ∧ i ∈ A ( X i = 1)] ≤  ˆ p k   E [ ˆ q ](1 + θ ) k  ( E [ ˆ q ] / ˆ p ) k ≤ ( ˆ p ) k k !  E [ ˆ q ](1 + θ ) k   E [ ˆ q ] ˆ p  k . Letting k = E [ ˆ q ] θ / 2 , we then hav e P " ˆ p X i =1 X i ≥ E [ ˆ q ](1 + θ ) # ≤ ( ˆ p ) k ( E [ ˆ q ](1 + θ / 2)) k ( E [ ˆ q ] / ˆ p ) k = 1 1 + θ 2 ! E [ ˆ q ] θ / 2 A.4 Proof of Theorem 3 W e will adapt the techniques of Theorem 3.1 in [20]. Lemma 6 ([ 20 ], Lemma 2) . Let H be some Hilbert space, and for x 1 , . . . , x k ∈ H , let x g m be their geometric median. F ix α ∈ (0 , 1 2 ) and suppose that z ∈ H satisfies k x g m − z k > C α r , wher e C α = (1 − α ) r 1 1 − 2 α and r > 0 . Then there e xists J ⊆ { 1 , . . . , k } with | J | > αk such that for all j ∈ J , k x j − z k > r . Note that for a general Hilbert or Banach space H , the geometric median is defined as: x g m := arg min k X j =1 k x − x j k H 15 where k . k H is the norm on H . This coincides with the notion of geometric median in R 2 under the ` 2 norm. Note that Coordinatewise Median is the Geometric Median in the real space with the ` 1 norm, which forms a Banach space. Firstly , we use Corollary 2 to see that with probability 1 − δ , ˆ q ≤ 1 + 2 log (1 /δ ) . Now , we assume that ˆ q ≤ 1 + 2 log (1 /δ ) is true. Conditioned on this e vent, we will show the remainder of the theorem holds with probability at least 1 − δ . Hence, with total probability at least (1 − δ ) 2 ≥ 1 − 2 δ , the statement of the theorem holds. (1): Let us assume that number of clusters is k = 128 log 1 /δ for some δ < 1 , also note that because δ ∈ [0 , 1 / 2] , we hav e that k = 128 log 1 /δ ≥ 64(0 . 5 + log 1 /δ ) ≥ 8 ˆ q . Now , choose α = 1 / 4 . Choose r = 4 σ q k b . Assume that the Geometric Median is more than C α r distance aw ay from true mean. Then by the previous Lemma, atleast α = 1 / 4 fraction of the empirical means of the clusters must lie atleast r distance away from true mean. Because we assume the number of clusters is more than 8 ˆ q , atleast 1 / 8 fraction of empirical means of uncorrupted clusters must also lie atleast r distance away from true mean. Recall that the variance of the mean of an “honest” v ote group is given by ( σ 0 ) 2 = σ 2 k b . By applying Chebyshev’ s inequality to the i th uncorrupted vote group G [ i ] , we find that its empirical mean ˆ x satisfies P k G [ i ] − G k ≥ 4 σ r k b ! ≤ 1 16 . Now , we define a Bernoulli e vent that is 1 if the empirical mean of an uncorrupted v ote group is at distance larger than r to the true mean, and 0 otherwise. By the computation abov e, the probability of this ev ent is less than 1 / 16 . Thus, its mean is less than 1 / 16 and we want to upper bound the probability that empirical mean is more than 1 / 8 . Using the number of e vents as k = 128 log (1 /δ ) , we find that this holds with probability at least 1 − δ . For this, we used the following version of Hoeffding’ s inequality in this part and part (3) of this proof. For Bernoulli ev ents with mean µ , empirical mean ˆ µ , number of e vents m and deviation θ : P ( ˆ µ − µ ≥ θ ) ≤ exp( − 2 mθ 2 ) T o finish the proof, just plug in the values of C α giv en in the Lemma 2.1 (written above) from [ 20 ], where C α = 3 / 2 √ 2 for Geometric Median. (2): For coordinate-wise median, we set k = 128 log d/δ . Then we apply the result pro ved in pre vious part for each dimension of ˆ G . Then, we get that with probability at least 1 − δ /d , | ˆ G i − G i | ≤ C 1 σ i r log d/δ b where ˆ G i is the i th coordinate of ˆ G , G i is the i th coordinate of G and σ 2 i is the i th diagonal entry of Σ . Doing a union bound, we get that with probability at least 1 − δ /d k ˆ G − G k ≤ C 1 σ r log d/δ b . (3): Define ∆ i = σ i v u u u t k b r 1 2 k log d δ where σ 2 i is the i th diagonal entry of Σ . Now , for each uncorrupted v ote group, using Chebyshev’ s inequality: P  | ˆ G i − G i | ≥ ∆ i  ≤ r 1 2 k log d δ . Now , i th coordinate of α -trimmed mean lies ∆ i away from G i if atleast αk of the i th coordinates of vote group empirical means lie ∆ i away from G i . Note that because of the assumption of the 16 Proposition αk ≥ 2 ˆ q . Because ˆ q of these can be corrupted, atleast αk / 2 of true empirical means hav e i th coordinates that lie ∆ i away from G i . This means α/ 2 fraction hav e true empirical means hav e i th coordinates that lie ∆ i away from G i . Define a Bernoulli variable X for a vote group as being 1 if the i th coordinate of empirical mean of that v ote group lies more than ∆ i away from G i , and 0 otherwise. The mean of X therefore satisfies E ( X ) < r 1 2 k log d δ . Set α = 4 r 1 2 k log d δ . Again, using Hoeffding’ s inequality in a manner analogous to part (1) of the proof, we get that probability of i th coordinate of α -trimmed mean being more than ∆ i away from G i is less than δ /d . T aking union bound over all d coordinates, we find that the probability of α -trimmed mean being more than σ v u u u t k b r 1 2 k log d δ = σ r 4 k bα away from G is less than δ . Hence we hav e prov ed that if α = 4 r 1 2 k log d δ and αk ≥ 2 ˆ q , then with probability at least 1 − δ , ∆ ≤ σ r 4 k bα . Now , set α = 1 / 4 and k = 128 log( d/δ ) . One can easily see that αk ≥ 2 ˆ q is satisfied and we get that with probability at least 1 − δ , for some constant C 3 , ∆ ≤ C 3 σ r log( d/δ ) b . B Extra Experimental Details B.1 Implementation and system-level optimization details W e introduce the details of combining B U LY A N , M U LT I - K R U M , and coordinate-wise median with D E T O X . • B U L Y A N : according to [ 10 ] B U LY A N requires p ≥ 4 q + 3 . In D E T O X , after the first majority voting le vel, the corresponding requirement in B U LY A N becomes p r ≥ 4 ˆ q + 3 = 11 . Thus, we assign all “winning" gradients in to one cluster i.e. , B U L Y A N is conducted across 15 gradients. • M U LT I - K RU M : according to [ 1 ], M U LT I - K RU M requires p ≥ 2 q + 3 . Therefore, for similar reason, we assign 15 “winning" gradients into two groups with une ven sizes at 7 and 8 respectiv ely . • coordinate-wise median: for this baseline we follow the theoretical analysis in Section 3.1 i.e. , 15 “winning" gradients are e venly assigned to 5 clusters with size at 3 for r everse gradient Byzantine attack. For ALIE attack, we assign those 15 gradients e venly to 3 clusters with size of 5. The reason for this choice is simply that we observe the reported strategies perform better in our experiments. Then mean of the gradients is calculated in each cluster . Finally , we take coordinate-wise median across means of all clusters. One important thing to point out is that we conducted system le vel optimizations on implementing M U LT I - K RU M and B U LY A N , e.g. , parallelizing the computationally heavy parts in order to make the comparisons more fair according to [ 28 ]. The main idea of our system-lev el optimization are two-fold: i) gradients of all layers of a neural network are firstly vectorized and concatenated to a high dimensional vector . Robust aggregations are then deployed on those high dimensional gradient 17 vectors from all compute nodes. ii) As computational heavy parts exist for sev eral methods e.g. , calculating medians in the second stage of B U LY A N . T o optimize that part, we chunk the high dimensional gradient vectors e venly into pieces, and parallelize the median calculations in all the pieces. Our system-le vel optimization leads to 2-4 × speedup in the robust aggreg ation stage. B.2 Hyper -parameter tuning T able 2: Tuned stepsize schedules for e xperiments under r everse gradient Byzantine attack Experiments CIF AR-10 on ResNet-18 CIF AR-100 on VGG13-BN D-M U LT I - K RU M 0.1 0.1 D-B U LY A N 0.1 0.1 D-Med. 0 . 1 × 0 . 99 t (mo d 10) 0 . 1 × 0 . 99 t (mo d 10) M U LT I - K RU M 0.03125 0.03125 B U L Y A N 0.1 0.1 Med. 0.1 0 . 1 × 0 . 995 t (mo d 10) T able 3: Tuned stepsize schedules for e xperiments under ALIE Byzantine attack Experiments CIF AR-10 on ResNet-18 CIF AR-100 on VGG13-BN D-M U LT I - K RU M 0 . 1 × 0 . 98 t (mo d 10) 0 . 1 × 0 . 965 t (mo d 10) D-B U LY A N 0 . 1 × 0 . 99 t (mo d 10) 0 . 1 × 0 . 965 t (mo d 10) D-Med. 0 . 1 × 0 . 98 t (mo d 10) 0 . 1 × 0 . 98 t (mo d 10) M U LT I - K RU M 0 . 0078125 × 0 . 96 t (mo d 10) 0 . 00390625 × 0 . 965 t (mo d 10) B U L Y A N 0 . 001953125 × 0 . 95 t (mo d 10) 0 . 00390625 × 0 . 965 t (mo d 10) Med. 0 . 001953125 × 0 . 95 t (mo d 10) 0 . 001953125 × 0 . 965 t (mo d 10) B.3 Data augmentation and normalization details In preprocessing the images in CIF AR-10/100 datasets, we follow the standard data augmenta- tion and normalization process. F or data augmentation, random cropping and horizontal ran- dom flipping are used. Each color channels are normalized with mean and standard deviation by µ r = 0 . 491372549 , µ g = 0 . 482352941 , µ b = 0 . 446666667 , σ r = 0 . 247058824 , σ g = 0 . 243529412 , σ b = 0 . 261568627 . Each channel pixel is normalized by subtracting the mean v alue in this color channel and then divided by the standard de viation of this color channel. B.4 Comparison between D E T O X and D R A C O W e provide the experimental results in comparing D E T O X with D R AC O . 18 0 100 200 300 400 500 600 700 800 Wall-clock Time (Mins.) 76 78 80 82 84 86 88 90 92 Test Accuracy (%) D-Multi-Krum D-Bulyan D-Med-Mean Draco (a) ResNet-18 on CIF AR-10 0 50 100 150 200 250 300 350 400 Wall-clock Time (Mins.) 45 50 55 60 65 70 Test Accuracy (%) D-Multi-Krum D-Bulyan D-Med-Mean Draco (b) VGG13-BN on CIF AR-100 Figure 8: Conv ergence with respect to runtime comparisons among D E T O X back-ended robust aggregation methods and D R A C O under r everse gr adient Byzantine attack on dif ferent dataset and model combinations: (a) ResNet-18 trained on CIF AR-10 dataset; (b) VGG13-BN trained on CIF AR-100 dataset Comp. Comm. Aggr. Iter. 0 2 4 6 8 10 12 Time Per Iter (sec) D-Multi-Krum D-Bulyan D-MoM Draco (a) ResNet-18 on CIF AR-10 Comp. Comm. Aggr. Iter. 0 1 2 3 4 5 6 Time Per Iter (sec) D-Multi-Krum D-Bulyan D-MoM Draco (b) VGG13-BN on CIF AR-100 Figure 9: Conv ergence with respect to runtime comparisons among D E T O X back-ended robust aggregation methods and D R A C O under r everse gr adient Byzantine attack on dif ferent dataset and model combinations: (a) ResNet-18 trained on CIF AR-10 dataset; (b) VGG13-BN trained on CIF AR-100 dataset 19

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment