# Optimal Gradient Compression for Distributed and Federated Learning

@article{Albasyoni2020OptimalGC, title={Optimal Gradient Compression for Distributed and Federated Learning}, author={Alyazeed Albasyoni and Mher Safaryan and Laurent Condat and Peter Richt{\'a}rik}, journal={ArXiv}, year={2020}, volume={abs/2010.03246} }

Communicating information, like gradient vectors, between computing nodes in distributed and federated learning is typically an unavoidable burden, resulting in scalability issues. Indeed, communication might be slow and costly. Recent advances in communication-efficient training algorithms have reduced this bottleneck by using compression techniques, in the form of sparsification, quantization, or low-rank approximation. Since compression is a lossy, or inexact, process, the iteration… Expand

#### 9 Citations

Distributed Learning and Democratic Embeddings: Polynomial-Time Source Coding Schemes Can Achieve Minimax Lower Bounds for Distributed Gradient Descent under Communication Constraints

- Computer Science
- ArXiv
- 2021

The utility of the proposed source coding scheme can remarkably improve the performance when used in conjunction with other compression schemes, and achieves the minimax optimal convergence rate to within (near) constant factors for a class of communication-constrained distributed optimization algorithms. Expand

Smoothness-Aware Quantization Techniques

- Computer Science, Mathematics
- ArXiv
- 2021

This work extends Safaryan et al.'s smoothness-aware compression strategy to arbitrary unbiased compression operators, which also includes sparsification, and provides extensive numerical evidence that their smooths-aware quantization strategies outperform existing quantization schemes as well the aforementioned smoothity-aware sparsifying strategies with respect to all relevant success measures. Expand

Rethinking gradient sparsification as total error minimization

- Computer Science, Mathematics
- ArXiv
- 2021

This work identifies that the total error — the sum of the compression errors for all iterations — encapsulates sparsification throughout training and proposes a communication complexity model that minimizes the totalerror under a communication budget for the entire training. Expand

Wyner-Ziv Estimators: Efficient Distributed Mean Estimation with Side Information

- Computer Science, Mathematics
- AISTATS
- 2021

Without any probabilistic assumptions on the underlying data, the problem of distributed mean estimation where the server has access to side information is studied and Wyner-Ziv estimators are proposed, which are communication and computationally efficient and near-optimal when an upper bound for the distance between the side information and the data is known. Expand

MURANA: A Generic Framework for Stochastic Variance-Reduced Optimization

- Computer Science, Mathematics
- ArXiv
- 2021

A generic variance-reduced algorithm for minimizing a sum of several smooth functions plus a regularizer, in a sequential or distributed manner, formulated with general stochastic operators, which allow it to cover many existing randomization mechanisms within a unified framework. Expand

Improved Communication Efficiency for Distributed Mean Estimation with Side Information

- Computer Science, Mathematics
- 2021 IEEE International Symposium on Information Theory (ISIT)
- 2021

This paper proposes a practical and efficient estimator based on an r-bit Wynzer-Ziv estimator proposed by Mayekar et al., which requires no probabilistic assumption on the data. Expand

Federated Learning with Spiking Neural Networks

- Computer Science
- ArXiv
- 2021

This paper proposes a federated learning framework for decentralized and privacy preserving training of Spiking Neural Networks and observes that SNNs outperform ANNs in terms of overall accuracy by over 15% when the data is distributed across a large number of clients in the federation while providing up to 5.3× energy efficiency. Expand

Layer-wise Adaptive Model Aggregation for Scalable Federated Learning

- Computer Science
- 2021

This work proposes FedLAMA, a layer-wise model aggregation scheme for scalable Federated Learning that reduces the communication cost by up to 60% for IID data and 70% for non-IID data while achieving a comparable accuracy to FedAvg. Expand

Parallel and Distributed algorithms for ML problems

- Mathematics, Computer Science
- 2020

A survey of modern parallel and distributed approaches to solve sum-type convex minimization problems come from ML applications is made. Expand

#### References

SHOWING 1-10 OF 42 REFERENCES

Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback

- Computer Science, Mathematics
- NeurIPS
- 2019

A general distributed compressed SGD with Nesterov's momentum is proposed, which achieves the same testing accuracy as momentum SGD using full-precision gradients, but with $46\% less wall clock time. Expand

Natural Compression for Distributed Deep Learning

- Computer Science, Mathematics
- ArXiv
- 2019

This work introduces a new, simple yet theoretically and practically effective compression technique: em natural compression (NC), which is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a "natural" way by ignoring the mantissa. Expand

Federated Learning: Strategies for Improving Communication Efficiency

- Computer Science
- ArXiv
- 2016

Two ways to reduce the uplink communication costs are proposed: structured updates, where the user directly learns an update from a restricted space parametrized using a smaller number of variables, e.g. either low-rank or a random mask; and sketched updates, which learn a full model update and then compress it using a combination of quantization, random rotations, and subsampling. Expand

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

- Computer Science, Mathematics
- ICLR
- 2018

This paper finds 99.9% of the gradient exchange in distributed SGD is redundant, and proposes Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth, which enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributedTraining on mobile. Expand

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

- Computer Science
- NIPS
- 2017

Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques. Expand

DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression

- Computer Science
- ICML
- 2019

This work provides a detailed analysis on this two-pass communication model and its asynchronous parallel variant, with error-compensated compression both on the worker nodes and on the parameter server, and admits three very nice properties: it is compatible with an arbitrary compression technique, it admits an improved convergence rate and it admits linear speedup with respect to the number of workers. Expand

PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization

- Computer Science, Mathematics
- NeurIPS
- 2019

A new low-rank gradient compressor based on power iteration that can compress gradients rapidly, efficiently aggregate the compressed gradients using all-reduce, and achieve test performance on par with SGD is proposed. Expand

Distributed learning with compressed gradients

- Mathematics
- 2018

Asynchronous computation and gradient compression have emerged as two key techniques for achieving scalability in distributed optimization for large-scale machine learning. This paper presents a… Expand

Distributed Learning with Compressed Gradient Differences

- Computer Science, Mathematics
- ArXiv
- 2019

This work proposes a new distributed learning method --- DIANA --- which resolves issues via compression of gradient differences, and performs a theoretical analysis in the strongly convex and nonconvex settings and shows that its rates are superior to existing rates. Expand

On Biased Compression for Distributed Learning

- Computer Science, Mathematics
- ArXiv
- 2020

It is shown for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings, and a new highly performing biased compressor is proposed---combination of Top-k and natural dithering---which in the authors' experiments outperforms all other compression techniques. Expand