Approximating neural machine translation for efficiency
Aji, Alham Fikri
Neural machine translation (NMT) has been shown to outperform statistical machine translation. However, NMT models typically require a large number of parameters and are expensive to train and deploy. Moreover, its large model size makes parallel training inefficient due to costly network communication. Likewise, distributing and locally running the model for a client-based NMT model such as a web browser or mobile device remains challenging. This thesis investigates ways to approximately train an NMT system by compressing either the gradients or the parameters for faster communication or reduced memory consumption. We propose a gradient compression technique that exchanges only the top 1% of the most significant gradient values while delaying the rest to be considered for the next iteration. This method reduces the network communication cost by 50-fold but causes noisy gradient updates. We also find that Transformer–the current state-of-the-art NMT architecture–is highly sensitive to noisy gradients. Therefore, we extend the compression technique by restoring the compressed gradient with locally-computed gradients. We obtained a linear scale-up in parallel training without sacrificing model performance. We also explore transfer learning as a better method of initialising the training. With transfer learning, the model converges faster and can be trained with more aggressive hyperparameters. Lastly, we propose a log-based quantisation method to compress the model size. Models are quantised to 4-bit precision with no noticeable quality degradation after re-training combined with reserving the quantisation errors as feedback.