Edinburgh Research Archive

Improving natural language processing for under-served languages through increased training data diversity

dc.contributor.advisor
Heafield, Kenneth
dc.contributor.advisor
Birch-Mayne, Alexandra
dc.contributor.author
Burchell, Laurie Vear
dc.date.accessioned
2024-10-21T11:21:03Z
dc.date.available
2024-10-21T11:21:03Z
dc.date.issued
2024-10-21
dc.description.abstract
More and better data is often the most effective way to improve the quality of natural language processing (NLP), with the highest-performing applications requiring terabytes of data. However, most of the world's language varieties do not have anything like this amount of data available, limiting performance. This thesis aims to increase the diversity of training data for under-served language varieties as a means to improving downstream NLP applications. We take two broad approaches to increasing diversity in this thesis. We look firstly at diverse data augmentation, quantifying different types of induced diversity and how these affect downstream performance. Using neural machine translation as a specific application, we measure the diversity of different methods of generating back translation (BT), a popular data augmentation method. We find that some types of diversity are more important than others for downstream performance and make recommendations about how to make BT more effective. The second approach towards increasing training data diversity taken in this thesis is to improve language identification (LID), a fundamental part of any data-gathering pipeline. Given that poor LID is a significant impediment to diverse corpus creation, we curate an open dataset covering around 200 language varieties to facilitate further research. We demonstrate the quality of this dataset by using it to train a high-performing LID model and by carrying out further analysis into its capability. We use our LID dataset and model to explore two challenging problems for LID: identifying code-switched text and improving classification for Arabic dialects. We focus on making these challenges tractable for realistic corpus building, employing metrics which reflect downstream performance more faithfully. Our findings demonstrate the limitations of current LID techniques and lay the groundwork for future research in this area. A key finding throughout this thesis is that quality matters, particularly for under-served languages. Furthermore, even as corpus sizes grow, it is crucial not to lose sight of the quirks of individual languages. We provide resources and future research directions for increasing the diversity of useful training data for under-served languages, and in so doing facilitate the development of effective NLP applications for a wider variety of users.
en
dc.identifier.uri
https://hdl.handle.net/1842/42315
dc.identifier.uri
http://dx.doi.org/10.7488/era/5035
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Burchell, L., Birch, A., Bogoychev, N., and Heafield, K. (2023). An open dataset and model for language identification. In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 865–879, Toronto, Canada. Association for Computational Linguistics.
en
dc.relation.hasversion
Burchell, L., Birch, A., and Heafield, K. (2022). Exploring diversity in back translation for low-resource machine translation. In Cherry, C., Fan, A., Foster, G., Haffari, G. R., Khadivi, S., Peng, N. V., Ren, X., Shareghi, E., and Swayamdipta, S., editors, Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 67–79, Hybrid. Association for Computational Linguistics
en
dc.relation.hasversion
Burchell, L., Birch, A., Thompson, R., and Heafield, K. (2024). Code-switched language identification is harder than you think. In Graham, Y. and Purver, M., editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 646–658, St. Julian’s, Malta. Association for Computational Linguistics
en
dc.rights.license
C​r​e​a​t​i​v​e ​C​o​m​m​o​n​s: ​A​t​t​r​i​b​u​t​i​o​n (​C​C-​B​Y)
en
dc.rights.uri
https://creativecommons.org/licenses/by/4.0/
en
dc.subject
machine translation
en
dc.subject
language identification
en
dc.subject
natural language processing
en
dc.subject
Data Augmentation
en
dc.subject
low-resource languages
en
dc.subject
under-served languages
en
dc.subject
code switching
en
dc.subject
Arabic Dialects
en
dc.subject
dataset
en
dc.title
Improving natural language processing for under-served languages through increased training data diversity
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en

Files

Original bundle

Now showing 1 - 1 of 1
Name:
BurchellLV_2024.pdf
Size:
1.29 MB
Format:
Adobe Portable Document Format
Description:

This item appears in the following Collection(s)