Structural pruning for speed in neural machine translation

Behnke, Maximiliana

Structural pruning for speed in neural machine translation

Simple item page

dc.contributor.advisor

Heafield, Kenneth

dc.contributor.advisor

Grot, Boris

dc.contributor.advisor

Grundkiewicz, Roman

dc.contributor.author

Behnke, Maximiliana

dc.contributor.sponsor

Engineering and Physical Sciences Research Council (EPSRC)

en

dc.date.accessioned

2022-12-05T14:12:32Z

dc.date.available

2022-12-05T14:12:32Z

dc.date.issued

2022-12-05

dc.description.abstract

Neural machine translation (NMT) strongly outperforms previous statistical techniques. With the emergence of a transformer architecture, we consistently train and deploy deeper and larger models, often with billions of parameters, as an ongoing effort to achieve even better quality. On the other hand, there is also a constant pursuit for optimisation opportunities to reduce inference runtime. Parameter pruning is one of the staple optimisation techniques. Even though coefficient-wise sparsity is the most popular for compression purposes, it is not easy to make a model run faster. Sparse matrix multiplication routines require custom approaches, usually depending on low-level hardware implementations for the most efficiency. In my thesis, I focus on structural pruning in the field of NMT, which results in smaller but still dense architectures that do not need any further modifications to work efficiently. My research focuses on two main directions. The first one explores Lottery Ticket Hypothesis (LTH), a well-known pruning algorithm, but this time in a structural setup with a custom pruning criterion. It involves partial training and pruning steps performed in a loop. Experiments with LTH produced substantial speed-up when applied to prune heads in the attention mechanism of a transformer. While this method has proven successful, it carries the burden of prolonged training cost that makes an already expensive training routine even more so. From that point, I exclusively concentrate on research incorporating pruning into training via regularisation. I experiment with a standard group lasso, which zeroes-out parameters together in a structural pre-defined way. By targeting feedforward and attention layers in a transformer, group lasso significantly improves inference speed with already optimised state-of-the-art fast models. Improving upon that work, I designed a novel approach called aided regularisation, where every layer penalty is scaled based on statistics gathered as training progresses. Both gradient- and parameter-based approaches aim to decrease the depth of a model, further optimising speed while maintaining the translation quality of an unpruned baseline. The goal of this dissertation is to advance the state-of-the-art efficient NMT with simple but tangible structural sparsity methods. The majority of all experiments in the thesis involve highly-optimised models as baselines to show that this work pushes the Pareto frontier of quality vs speed trade-off forward. For example, it is possible to prune a model to be 50% faster with no change in translation quality.

en

dc.identifier.uri

https://hdl.handle.net/1842/39558

dc.identifier.uri

http://dx.doi.org/10.7488/era/2808

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task Bogoychev, N., Grundkiewicz, R., Aji, A. F., Behnke, M., Heafield, K., Kashyap, S., Farsarakis, E-I. & Chudyk, M., 10 Jul 2020, Proceedings of the Fourth Workshop on Neural Generation and Translation. Seattle: Association for Computational Linguistics (ACL), p. 218–224 7 p. Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

en

dc.relation.hasversion

Losing Heads in the Lottery: Pruning Transformer Behnke, M. & Heafield, K., 16 Nov 2020, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics (ACL), p. 2664–2674 11 p. Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

en

dc.subject

NMT

en

dc.subject

machine translation

en

dc.subject

NLP

en

dc.subject

pruning

en

dc.subject

deep learning

en

dc.subject

efficiency

en

dc.subject

transformer

en

dc.subject

optimisation

en

dc.title

Structural pruning for speed in neural machine translation

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: BehnkeM_2022.pdf
Size:: 6.19 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection