Edinburgh Research Archive

Parameter-efficient transfer learning for pre-trained transformers

dc.contributor.advisor
Murray, Iain
dc.contributor.advisor
Titov, Ivan
dc.contributor.advisor
Hospedales, Timothy
dc.contributor.author
Cooper Stickland, Asa
dc.date.accessioned
2024-10-02T09:30:57Z
dc.date.available
2024-10-02T09:30:57Z
dc.date.issued
2024-10-02
dc.description.abstract
In this thesis I will tackle a problem in machine learning and natural language processing (NLP) that I will refer to as ‘parameter efficient transfer learning’. This involves taking ‘general-purpose’, large-scale models trained on huge amounts of data, and specializing them to a particular task, without changing the underlying model that much. A recent paradigm in machine learning has been to do large scale ‘pre-training’ of a model on unsupervised data before specializing to a particular task. Typically this means ‘full fine-tuning’ of pre-trained models by updating every parameter of the pre-trained model on the new task. In this thesis we consider an alternative approach to full fine-tuning where we only update a subset of (or small number of additional) pre-trained model parameters, hence the term ‘parameter-efficient’ transfer learning, which can save on computation and storage space, unlock new capabilities, and in some situations outperform fine-tuning every parameter. In the first section we consider parameter-efficient transfer learning on English classification tasks. Our first contribution is an approach to fine-tuning pre-trained models on multiple tasks simultaneously. Typical approaches underperform task-specific models due to a lack of capacity and interference between tasks. Our contribution is an approach to ‘multi-task’ learning where we introduce small task-specific modules for each task, which enable us to achieve the same performance as task-specific fine-tuning with only a fraction of the parameters. This initial exploration was done on relatively small models compared to the current state- of-the-art, and did not cover a popular approach of freezing pre-trained model parameters and only training the small modules. In the second half of this section we address these limitations, contributing a survey of parameter-efficient approaches, showing which parameter- efficient architectures work the best as model scale increases, and detailing trade-offs between performance, memory-efficiency and other factors. In the second section we consider applying parameter-efficient transfer learning approach to machine translation (MT), which involves modeling sequences rather than class labels and is multilingual rather than English-only. This means approaches designed for English classification can underperform. We explore adapting systems that have only been trained on an unsupervised objective (involving multilingual text but not machine translation) to the MT task. We were the first to apply parameter-efficient techniques to this problem. We explore which parts of the transformer sequence-to-sequence architecture are important to adapt, and what percentage of the original model we need to update to match fine-tuning every parameter. In further chapters we contribute a new approach where we train independent ‘adapters’ (a particular parameter-efficient architecture) for source language, target language and ‘domain’ (i.e. legal text), allowing us to compose them in ways not seen during training. Finally, we contribute an extensive series of experiments on what matters for the performance of parameter-efficient methods on machine translation.
en
dc.identifier.uri
https://hdl.handle.net/1842/42243
dc.identifier.uri
http://dx.doi.org/10.7488/era/4963
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Cooper Stickland, A. Berard, and V. Nikoulina. Multilingual domain adaptation for NMT: Decoupling language and domain information with adapters. In Proceedings of the Sixth Conference on Machine Translation, pages 578–598, Online, Nov. 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.wmt-1.64
en
dc.relation.hasversion
A. C. Stickland and I. Murray. Bert and pals: Projected attention layers for efficient adaptation in multi-task learning. In International Conference on Machine Learning, pages 5986–5995, 2019. URL http://proceedings.mlr.press/v97/stickland19a.html
en
dc.relation.hasversion
A. C. Stickland, A. B´erard, and V. Nikoulina. Multilingual domain adaptation for NMT: decoupling language and domain information with adapters. Sixth Conference on Machine Translation (WMT2021), 2021a. URL https://www.statmt.org/wmt21/pdf/2021.wmt-1.64.pdf
en
dc.relation.hasversion
A. C. Stickland, X. Li, and M. Ghazvininejad. Recipes for adapting pre-trained monolingual and multilingual models to machine translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3440–3453, Online, Apr. 2021b. Association for Computational Linguistics. URL https: //www.aclweb.org/anthology/2021.eacl-main.301
en
dc.relation.hasversion
A. ¨Ust¨un and A. Cooper Stickland. When does parameter-efficient transfer learning work for machine translation? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7919–7933, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022. emnlp-main.540
en
dc.subject
Machine learning
en
dc.subject
Natural Language Processing
en
dc.title
Parameter-efficient transfer learning for pre-trained transformers
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en

Files

Original bundle

Now showing 1 - 1 of 1
Name:
Cooper SticklandA_2024.pdf
Size:
5.73 MB
Format:
Adobe Portable Document Format
Description:

This item appears in the following Collection(s)