Structural pruning for speed in neural machine translation
Item Status
Embargo End Date
Date
Authors
Behnke, Maximiliana
Abstract
Neural machine translation (NMT) strongly outperforms previous statistical techniques. With
the emergence of a transformer architecture, we consistently train and deploy deeper and
larger models, often with billions of parameters, as an ongoing effort to achieve even better
quality. On the other hand, there is also a constant pursuit for optimisation opportunities to
reduce inference runtime.
Parameter pruning is one of the staple optimisation techniques. Even though coefficient-wise
sparsity is the most popular for compression purposes, it is not easy to make a model run
faster. Sparse matrix multiplication routines require custom approaches, usually depending on
low-level hardware implementations for the most efficiency. In my thesis, I focus on structural
pruning in the field of NMT, which results in smaller but still dense architectures that do not
need any further modifications to work efficiently.
My research focuses on two main directions. The first one explores Lottery Ticket Hypothesis
(LTH), a well-known pruning algorithm, but this time in a structural setup with a custom pruning
criterion. It involves partial training and pruning steps performed in a loop. Experiments with
LTH produced substantial speed-up when applied to prune heads in the attention mechanism
of a transformer. While this method has proven successful, it carries the burden of prolonged
training cost that makes an already expensive training routine even more so.
From that point, I exclusively concentrate on research incorporating pruning into training via
regularisation. I experiment with a standard group lasso, which zeroes-out parameters together
in a structural pre-defined way. By targeting feedforward and attention layers in a transformer,
group lasso significantly improves inference speed with already optimised state-of-the-art fast
models. Improving upon that work, I designed a novel approach called aided regularisation,
where every layer penalty is scaled based on statistics gathered as training progresses. Both
gradient- and parameter-based approaches aim to decrease the depth of a model, further
optimising speed while maintaining the translation quality of an unpruned baseline.
The goal of this dissertation is to advance the state-of-the-art efficient NMT with simple but
tangible structural sparsity methods. The majority of all experiments in the thesis involve
highly-optimised models as baselines to show that this work pushes the Pareto frontier of
quality vs speed trade-off forward. For example, it is possible to prune a model to be 50% faster
with no change in translation quality.
This item appears in the following Collection(s)

