This portal has been archived. Explore the next generation of this technology.

Combining Global Sparse Gradients with Local Gradients in Distributed Neural Network Training

lib:1f45db93f307f3d8 (v1.0.0)

Authors: Alham Fikri Aji,Kenneth Heafield,Nikolay Bogoychev
Where published: IJCNLP 2019 11
Document:  PDF  DOI 
Abstract URL: https://www.aclweb.org/anthology/D19-1373/


One way to reduce network traffic in multi-node data-parallel stochastic gradient descent is to only exchange the largest gradients. However, doing so damages the gradient and degrades the model{'}s performance. Transformer models degrade dramatically while the impact on RNNs is smaller. We restore gradient quality by combining the compressed global gradient with the node{'}s locally computed uncompressed gradient. Neural machine translation experiments show that Transformer convergence is restored while RNNs converge faster. With our method, training on 4 nodes converges up to 1.5x as fast as with uncompressed gradients and scales 3.5x relative to single-node training.

Relevant initiatives  

Related knowledge about this paper Reproduced results (crowd-benchmarking and competitions) Artifact and reproducibility checklists Common formats for research projects and shared artifacts Reproducibility initiatives

Comments  

Please log in to add your comments!
If you notice any inapropriate content that should not be here, please report us as soon as possible and we will try to remove it within 48 hours!