Open library

This portal has been archived. Explore the next generation of this technology.

Combining Global Sparse Gradients with Local Gradients in Distributed Neural Network Training

lib:1f45db93f307f3d8 (v1.0.0)

Authors: Alham Fikri Aji,Kenneth Heafield,Nikolay Bogoychev
Where published: IJCNLP 2019 11
Document: PDF DOI

Abstract URL: https://www.aclweb.org/anthology/D19-1373/

One way to reduce network traffic in multi-node data-parallel stochastic gradient descent is to only exchange the largest gradients. However, doing so damages the gradient and degrades the model{'}s performance. Transformer models degrade dramatically while the impact on RNNs is smaller. We restore gradient quality by combining the compressed global gradient with the node{'}s locally computed uncompressed gradient. Neural machine translation experiments show that Transformer convergence is restored while RNNs converge faster. With our method, training on 4 nodes converges up to 1.5x as fast as with uncompressed gradients and scales 3.5x relative to single-node training.

Relevant initiatives

Related knowledge about this paper

Search on this portal

Reproduced results (crowd-benchmarking and competitions)

Artifact and reproducibility checklists

Common formats for research projects and shared artifacts

Collective Knowledge (organizing research projects based on FAIR principles)

Reproducibility initiatives

Comments

Please log in to add your comments!

If you notice any inapropriate content that should not be here, please report us as soon as possible and we will try to remove it within 48 hours!

Combining Global Sparse Gradients with Local Gradients in Distributed Neural Network Training

Relevant initiatives Hide

Comments Hide

Relevant initiatives

Comments