Delayed Momentum Aggregation: Communication-efficient Byzantine-robust Federated Learning with Partial Participation

Delayed Momentum Aggregation: Communication-efficient Byzantine-robust Federated Learning with Partial Participation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Partial participation is essential for communication-efficient federated learning at scale, yet existing Byzantine-robust methods typically assume full client participation. In the partial participation setting, a majority of the sampled clients may be Byzantine, once Byzantine clients dominate, existing methods break down immediately. We introduce delayed momentum aggregation, a principle where the central server aggregates cached momentum from non-sampled clients along with fresh momentum from sampled clients. This principle ensures Byzantine clients remain a minority from the server’s perspective even when they dominate the sampled set. We instantiate this principle in our optimizer DeMoA. We analyze the convergence rate of DeMoA, showing that DeMoA is Byzantine-robust under partial participation. Experiments show that, with 20% Byzantine ratio and only 10% partial participation rate, DeMoA achieves the best accuracy even when existing methods fail empirically.


💡 Research Summary

The paper tackles a critical gap in federated learning (FL): the combination of partial client participation and Byzantine robustness. Existing Byzantine‑robust FL methods assume that every client participates in every communication round. In practice, especially on mobile or IoT devices, only a random subset of clients is available at any given time. This partial participation creates “Byzantine‑majority rounds” where the sampled set may contain more malicious clients than honest ones, even though the global Byzantine fraction (\delta) is below 0.5. In such rounds, conventional robust aggregators (median, Krum, Bulyan, etc.) fail because they only see the sampled updates and cannot distinguish malicious outliers when the majority is adversarial.

The authors propose a new principle called Delayed Momentum Aggregation. The central server, in each round, aggregates not only the fresh momentum vectors from the sampled clients but also the most recent cached momentum from all non‑sampled clients. The cached momentum is “delayed” by the number of rounds since the client was last sampled, denoted (\tau(i,t)). A lightweight preprocessing function (P(\cdot)) removes the implicit momentum effect introduced by the delay, following ideas from asynchronous momentum SGD literature. By always including a (possibly stale) momentum estimate from every honest client, the server’s view of the client population always respects the global Byzantine bound (\delta<0.5). Consequently, the robust aggregator never sees a Byzantine majority, preserving its statistical guarantees.

Building on this principle, the authors introduce DeMoA (Delayed Momentum Aggregation optimizer). In each round (t), each client is independently sampled with probability (p_t). For a sampled client (i), the local momentum is updated as
\


Comments & Academic Discussion

Loading comments...

Leave a Comment