Authors: Alham Fikri Aji,Kenneth Heafield,Nikolay Bogoychev
Where published:
IJCNLP 2019 11
Document:
PDF
DOI
Abstract URL: https://www.aclweb.org/anthology/D19-1373/
One way to reduce network traffic in multi-node data-parallel stochastic gradient descent is to only exchange the largest gradients. However, doing so damages the gradient and degrades the model{'}s performance. Transformer models degrade dramatically while the impact on RNNs is smaller. We restore gradient quality by combining the compressed global gradient with the node{'}s locally computed uncompressed gradient. Neural machine translation experiments show that Transformer convergence is restored while RNNs converge faster. With our method, training on 4 nodes converges up to 1.5x as fast as with uncompressed gradients and scales 3.5x relative to single-node training.