Phocas: dimensional Byzantine-resilient stochastic gradient descent

Phocas: dimensional Byzantine-resilient stochastic gradient descent

Model

We consider the following optimization problem:

\begin{align*}
\min_{x} F(x),
\end{align*}

where $`F(x) = \E_{z \sim \mathcal{D}}[f(x; z)]`$, $`z`$ is sampled from some unknown distribution $`\mathcal{D}`$. We assume that there exists a minimizer of $`F(x)`$, which is denoted by $`x^*`$.

We solve this problem in a distributed manner with $`m`$ workers. In each iteration, each worker will sample $`n`$ independent and identically distributed (i.i.d.) data points from the distribution $`\mathcal{D}`$, and compute the gradient of the local empirical loss $`F_i(x) = \frac{1}{n} \sum_{j=1}^n f(x; z^{i,j}), \forall i \in [m]`$, where $`z^{i,j}`$ is the $`j`$th sampled data on the $`i`$th worker. The servers will collect and aggregate the gradients sent by the workers, and update the model as follows:

\begin{align*}
x^{t+1} = x^t - \gamma^t Aggr(\{\tilde{v}_i^t: i \in [m]\}),
\end{align*}

where $`Aggr(\cdot)`$ is an aggregation rule (e.g., averaging), and $`\{\tilde{v}_i^t: i \in [m]\}`$ is the set of gradient estimators received by the servers in the $`t^{\mbox{th}}`$ iteration. Under Byzantine failures/attacks, $`\{v_i^t = \nabla F_i(x^t): i \in [m]\}`$ is partially replaced by arbitrary values, which yields $`\{\tilde{v}_i^t: i \in [m]\}`$.

Conclusion

We investigate the generalized Byzantine resilience, and propose trimmed-mean-based aggregation rules for synchronous SGD. The algorithms have low time complexity and provable convergence. Our empirical results show good performance. We will study the Byzantine resilience in other scenarios such as asynchronous training in the future work.

Byzantine resilience

In this section, we formally define the classic Byzantine resilience property and its generalized version: dimensional Byzantine resilience.

Suppose that in a specific iteration, the correct vectors $`\{v_i: i \in [m]\}`$ are i.i.d samples drawn from the random variable $`G = \sum_{j=1}^n \nabla f(x; z^j)`$, where $`\E[G] = g`$ is an unbiased estimator of the gradient based on the current parameter $`x`$. Thus, $`\E[v_i] = \E[G] = g`$, for any $`i \in [m]`$. We simplify the notations by ignoring the index of iteration $`t`$.

We first introduce the classic Byzantine model, which is reformulated from the model proposed by . With the Byzantine workers, the vectors $`\{\tilde{v}_i: i \in [m]\}`$ which are actually received by the server nodes are as follows:

\begin{align}
\tilde{v}_i = 
\begin{cases}
v_i, &\mbox{if the $i$th worker is correct,}\\
arbitrary, &\mbox{if the $i$th worker is Byzantine}.
\end{cases}
\label{equ:byz_worker}
\end{align}

Note that the indices of Byzantine workers can change across different iterations. Furthermore, the server nodes are not aware of which workers are Byzantine. The only information given is the number of Byzantine workers, if necessary.

We then introduce the classic Byzantine resilience.

(Classic $`\Delta`$-Byzantine Resilience). Assume that $`0 \leq q \leq m`$. Let $`\{v_i: i \in [m]\}`$ be any i.i.d. random vectors in $`\R^d`$, $`v_i \sim G`$, with $`\E[G] = g`$. Let $`\{\tilde{v}_i: i \in [m]\}`$ be the set of vectors, of which up to $`q`$ of them are replaced by arbitrary vectors in $`\R^d`$, while the others still equal to the corresponding $`\{v_i\}`$. Aggregation rule $`\aggr(\cdot)`$ is said to be classic $`\Delta`$-Byzantine resilient if $`\E \|\aggr(\{\tilde{v}_i: i \in [m]\}) - g \|^2 \leq \Delta,`$ where $`\Delta`$ is a constant dependent on $`m`$ and $`q`$.

The baseline algorithm Krum is defined as follows.

Krum chooses the vector with the minimal local sum of distances: $`\krum (\{\tilde{v}_i: i \in [m]\}) = \tilde{v}_k, \quad k = \argmin_{i \in [m]} \sum_{i \rightarrow j} \| \tilde{v}_i - \tilde{v}_j \|^2,`$ where $`i \rightarrow j`$ is the indices of the $`m-q-2`$ nearest neighbours of $`\tilde{v}_i`$ in $`\{\tilde{v}_i: i \in [m]\}`$ measured by Euclidean distance.

The Krum aggregation is classic $`\Delta`$-Byzantine resilient under certain assumptions. The proof is given by Proposition 1 of .

Let $`v_1, \ldots, v_m`$ be any i.i.d. random $`d`$-dimensional vectors s.t. $`v_i \sim G`$, with $`\E[G] = g`$ and $`\E\|G-g\|^2 \leq V`$. $`q`$ of $`\{\tilde{v}_i: i \in [m]\}`$ are Byzantine. If $`2q + 2 < m`$, we have $`\E \|\krum(\{\tilde{v}_i: i \in [m]\}) - g \|^2 \leq \Delta_0,`$ where $`\Delta_0 = \left(6m-6q + \frac{4q(m-q-2) + 4q^2(m-q-1)}{m-2q-2} \right)V.`$

The generalized Byzantine model is denoted as:

\begin{align}
(\tilde{v}_i)_j = 
\begin{cases}
(v_i)_j, &\mbox{if the the $j$th dimension of $v_i$ is correct,}\\
arbitrary, &\mbox{otherwise},
\end{cases}
\label{equ:byz_model}
\end{align}

where $`(v_i)_j`$ is the $`j`$th dimension of the vector $`v_i`$.

Based on the Byzantine model above, we introduce a generalized Byzantine resilience property, dimensional $`\Delta`$-Byzantine resilience, which is defined as follows:

(Dimensional $`\Delta`$-Byzantine Resilience). Assume that $`0 \leq q \leq m`$. Let $`\{v_i: i \in [m]\}`$ be any i.i.d. random vectors in $`\R^d`$, $`v_i \sim G`$, with $`\E[G] = g`$. Let $`\{\tilde{v}_i: i \in [m]\}`$ be the set of candidate vectors. For each dimension, up to $`q`$ of the $`m`$ values are replaced by arbitrary values, i.e., for dimension $`j \in [d]`$, $`q`$ of $`\{(\tilde{v}_i)_j: i \in [m]\}`$ are Byzantine, where $`(\tilde{v}_i)_j`$ is the $`j`$th dimension of the vector $`\tilde{v}_i`$. Aggregation rule $`\aggr(\cdot)`$ is said to be dimensional $`\Delta`$-Byzantine resilient if $`\E \|\aggr(\{\tilde{v}_i: i \in [m]\}) - g \|^2 \leq \Delta,`$ where $`\Delta`$ is a constant dependent on $`m`$ and $`q`$.

Note that classic $`\Delta`$-Byzantine resilience is a special case of dimensional $`\Delta`$-Byzantine resilience. For classic Byzantine resilience defined in Definition [def:byz], all the Byzantine values must lie in the same subset of workers, as shown in Figure [fig:viz_byz](a).

In the following propositions, we show that Mean and Krum are not dimensional Byzantine resilient ($`\E \|\aggr(\{\tilde{v}_i: i \in [m]\}) - g \|^2`$ is unbounded). The proofs are provided in the appendix.

The averaging aggregation rule is not dimensional Byzantine-resilient.

Any aggregation rule $`\aggr(\{\tilde{v}_i: i \in [m]\})`$ that outputs $`\aggr \in \{\tilde{v}_i: i \in [m]\}`$ is not dimensional Byzantine resilient.

Krum chooses the vector with the minimal score, which is not dimensional Byzantine-resilient.

$`\krum(\cdot)`$ is not dimensional Byzantine-resilient.

Trimmed-mean-based aggregation

With the Byzantine failure model defined in Equation ([equ:byz_worker]) and ([equ:byz_model]), we propose two trimmed-mean-based aggregation rules, which are Byzantine resilient under certain conditions.

Trimmed mean

To define the trimmed mean, we first define the order statistics.

(Order Statistics) By sorting the scalar sequence $`\{u_i: i \in [m]\}`$, we get $`u_{1:m} \leq u_{2:m} \leq \ldots \leq u_{m:m}`$, where $`u_{k:m}`$ is the $`k`$th smallest element in $`\{u_i: i \in [m]\}`$.

Then, we define the trimmed mean.

(Trimmed Mean) For $`b \in \mathbb{Z} \cap [0, \lceil m/2 \rceil - 1]`$, the $`b`$-trimmed mean of the set of scalars $`\{u_i: i \in [m]\}`$ is defined as follows:

\trmean_b(\{u_i: i \in [m]\}) = \frac{1}{m-2b} \sum_{k=b+1}^{m-b} u_{k:m},

where $`u_{k:m}`$ is the $`k`$th smallest element in $`\{u_i: i \in [m]\}`$ defined in Definition [def:ord_stat]. The high-dimensional version, $`Trmean_b(\{\tilde{v}_i: i \in [m]\})`$, simply applies $`\trmean_b(\cdot)`$ in the coordinate-wise manner.

The following theorem claims that by using $`\trmean_b(\cdot)`$, the resulting vector is dimensional Byzantine resilient. A proof is provided in the appendix.

(Bounded Variance) Let $`v_1, \ldots, v_m`$ be any i.i.d. random $`d`$-dimensional vectors s.t. $`v_i \sim G`$, with $`\E[G] = g`$ and $`\E\|G-g\|^2 \leq V`$. In each dimension, $`q`$ values are Byzantine, which yields $`\{\tilde{v}_i: i \in [m]\}`$. If $`2q < m`$, we have $`\E \|\trmean_b(\{\tilde{v}_i: i \in [m]\}) - g \|^2 \leq \Delta_1,`$ where $`\Delta_1 = \frac{2(b+1)(m-q)}{(m-b-q)^2} V.`$

Theorem [thm:bound_trmean_var] tells us that the upper bound of the variance $`\E \left\| \trmean_b(\{\tilde{v}_i: i \in [m]\}) - g \right\|^2`$ decreases when $`m`$ increases, $`b`$ decreases, $`q`$ decreases, or $`V`$ decreases.

Beyond trimmed mean

Using the trimmed mean, we have to drop $`2b`$ elements for each dimension. In this section, we explore the possibility of aggregating more elements. To be more specific, for each dimension, we take the average of the $`m-b`$ values nearest to the trimmed mean. We call the resulting aggregation rule Phocas 1, which is defined as follows:

(Phocas) We sort the scalar sequence $`\{u_i: i \in [m]\}`$ by using the distance to a certain value $`y`$: $`|u_{1/y} - y| \leq |u_{2/y} - y| \leq \ldots \leq |u_{m/y} - y|,`$ where $`u_{k/y}`$ is the $`k`$th nearest element to $`y`$ in $`\{u_i: i \in [m]\}`$. Phocas is the average of the first $`(m-b)`$ nearest elements to the $`b`$-trimmed mean $`Trmean_b = \trmean_b(\{u_i: i \in [m]\})`$:

\begin{align*}
\phocas_b(\{u_i: i \in [m]\}) = \frac{\sum_{i=1}^{m-b} u_{i/Trmean_b}}{m-b}.
\end{align*}

The high-dimensional version, $`\phocas_b(\{\tilde{v}_i: i \in [m]\})`$, simply applies $`\phocas_b(\cdot)`$ in the coordinate-wise manner.

We show that $`\phocas(\cdot)`$ is dimensional Byzantine-resilient.

(Bounded Variance) Let $`v_1, \ldots, v_n`$ be any i.i.d. random $`d`$-dimensional vectors s.t. $`v_i \sim G`$, with $`\E[G] = g`$ and $`\E\|G-g\|^2 \leq V`$. In each dimension, $`q`$ values are Byzantine, which yields $`\{\tilde{v}_i: i \in [m]\}`$ If $`2q < n`$, we have $`\E \|\phocas_b(\{\tilde{v}_i: i \in [m]\}) - g \|^2 \leq \Delta_2,`$ where $`\Delta_2 = \left[4 + \frac{12(b+1)(m-q)}{(m-b-q)^2} \right] V.`$

The Phocas aggregation can be viewed as a trimmed average centering at the trimmed mean, which filters out the values far away from the trimmed mean. Similar to the trimmed mean, the variance of Phocas decreases when $`m`$ increases, $`b`$ decreases, $`q`$ decreases, or $`V`$ decreases.

Convergence analysis

In this section, we provide the convergence guarantees for synchronous SGD with $`\Delta`$-Byzantine-resilient aggregation rules. The proofs can be found in the appendix. We first introduce the two conditions necessary in our convergence analysis.

If $`F(x)`$ is $`L_F`$-smooth, then $`F(y) - F(x) \leq \ip{\nabla F(x)}{y-x} + \frac{L_F}{2} \|y-x\|^2, \forall x,y \in \R^d,`$ where $`L_F \geq 0`$. If $`F(x)`$ is $`\mu_F`$-strongly convex, then $`\ip{\nabla F(x)}{y-x} + \frac{\mu_F}{2} \|y-x\|^2 \leq F(y) - F(x), \forall x,y \in \R^d,`$ where $`\mu_F \geq 0`$.

First, we prove that for strongly convex and smooth loss functions, SGD with $`\Delta`$-Byzantine-resilient aggregation rules has linear convergence with a constant error.

Assume that $`F(x)`$ is $`\mu_F`$-strongly convex and $`L_F`$-smooth, where $`0 < \mu_F \leq L_F`$. We take $`\gamma \leq \frac{2}{\mu_F + L_F}`$. In any iteration $`t`$, the correct gradients are $`v_i^t = \nabla F_i(x^t)`$. Using any (classic or dimensional) $`\Delta`$-Byzantine-resilient aggregation rule with corresponding assumptions, we obtain linear convergence with a constant error after $`T`$ iterations with synchronous SGD:

\begin{align*}
\E\|x^{T} - x^*\| \leq \left( 1- \frac{\gamma \mu_F L_F}{\mu_F + L_F} \right)^T \|x^0-x^*\| + \frac{\mu_F + L_F}{\mu_F L_F} \gamma  \sqrt{\Delta}.
\end{align*}

Then, we prove the convergence of SGD for general smooth loss functions.

Assume that $`F(x)`$ is $`L_F`$-smooth and potentially non-convex, where $`0 < L_F`$. We take $`\gamma \leq \frac{1}{L_F}`$. In any iteration $`t`$, the correct gradients are $`v_i^t = \nabla F_i(x^t)`$. Using any (classic or dimensional) $`\Delta`$-Byzantine-resilient aggregation rule with corresponding assumptions, we obtain linear convergence with a constant error after $`T`$ iterations with synchronous SGD:

\begin{align*}
\frac{\sum_{i=0}^{T-1} \E\|\nabla F(x^i)\|^2}{T} \leq \frac{2}{\gamma T} \left[ F(x^0) - F(x^*) \right] + \Delta.
\end{align*}

Time complexity

For the trimmed mean, we only need to find the order statistics of each dimension. To do so, we use the so-called selection algorithm  with linear time complexity to find the $`k`$th smallest element. In general, the time complexity is $`O(d(m-2b)m)`$. When $`b`$ is large, the factor $`m-2b \ll m`$ can be ignored, which yields the nearly linear time complexity $`O(dm)`$. When $`b`$ is small, the time complexity is the same as the sorting algorithm, which is $`O(dm\log m).`$ For Phocas, the computation additional to computing the trimmed takes linear time $`O(dm)`$. Thus, the time complexity is the same as Trmean. Note that for Krum and Multi-Krum, the time complexity is $`O(dm^2)`$ .


  1. The name of a Byzantine emperor. ↩︎