Edinburgh Research Archive

Parameter-efficient transfer learning for pre-trained transformers

Item Status

Embargo End Date

Authors

Cooper Stickland, Asa

Abstract

In this thesis I will tackle a problem in machine learning and natural language processing (NLP) that I will refer to as ‘parameter efficient transfer learning’. This involves taking ‘general-purpose’, large-scale models trained on huge amounts of data, and specializing them to a particular task, without changing the underlying model that much. A recent paradigm in machine learning has been to do large scale ‘pre-training’ of a model on unsupervised data before specializing to a particular task. Typically this means ‘full fine-tuning’ of pre-trained models by updating every parameter of the pre-trained model on the new task. In this thesis we consider an alternative approach to full fine-tuning where we only update a subset of (or small number of additional) pre-trained model parameters, hence the term ‘parameter-efficient’ transfer learning, which can save on computation and storage space, unlock new capabilities, and in some situations outperform fine-tuning every parameter. In the first section we consider parameter-efficient transfer learning on English classification tasks. Our first contribution is an approach to fine-tuning pre-trained models on multiple tasks simultaneously. Typical approaches underperform task-specific models due to a lack of capacity and interference between tasks. Our contribution is an approach to ‘multi-task’ learning where we introduce small task-specific modules for each task, which enable us to achieve the same performance as task-specific fine-tuning with only a fraction of the parameters. This initial exploration was done on relatively small models compared to the current state- of-the-art, and did not cover a popular approach of freezing pre-trained model parameters and only training the small modules. In the second half of this section we address these limitations, contributing a survey of parameter-efficient approaches, showing which parameter- efficient architectures work the best as model scale increases, and detailing trade-offs between performance, memory-efficiency and other factors. In the second section we consider applying parameter-efficient transfer learning approach to machine translation (MT), which involves modeling sequences rather than class labels and is multilingual rather than English-only. This means approaches designed for English classification can underperform. We explore adapting systems that have only been trained on an unsupervised objective (involving multilingual text but not machine translation) to the MT task. We were the first to apply parameter-efficient techniques to this problem. We explore which parts of the transformer sequence-to-sequence architecture are important to adapt, and what percentage of the original model we need to update to match fine-tuning every parameter. In further chapters we contribute a new approach where we train independent ‘adapters’ (a particular parameter-efficient architecture) for source language, target language and ‘domain’ (i.e. legal text), allowing us to compose them in ways not seen during training. Finally, we contribute an extensive series of experiments on what matters for the performance of parameter-efficient methods on machine translation.

This item appears in the following Collection(s)