Improving natural language processing for under-served languages through increased training data diversity

Burchell, Laurie Vear

Improving natural language processing for under-served languages through increased training data diversity

Files

BurchellLV_2024.pdf (1.29 MB)

Date

2024-10-21

Authors

Burchell, Laurie Vear

Full item page

Abstract

More and better data is often the most effective way to improve the quality of natural language processing (NLP), with the highest-performing applications requiring terabytes of data. However, most of the world's language varieties do not have anything like this amount of data available, limiting performance. This thesis aims to increase the diversity of training data for under-served language varieties as a means to improving downstream NLP applications. We take two broad approaches to increasing diversity in this thesis. We look firstly at diverse data augmentation, quantifying different types of induced diversity and how these affect downstream performance. Using neural machine translation as a specific application, we measure the diversity of different methods of generating back translation (BT), a popular data augmentation method. We find that some types of diversity are more important than others for downstream performance and make recommendations about how to make BT more effective. The second approach towards increasing training data diversity taken in this thesis is to improve language identification (LID), a fundamental part of any data-gathering pipeline. Given that poor LID is a significant impediment to diverse corpus creation, we curate an open dataset covering around 200 language varieties to facilitate further research. We demonstrate the quality of this dataset by using it to train a high-performing LID model and by carrying out further analysis into its capability. We use our LID dataset and model to explore two challenging problems for LID: identifying code-switched text and improving classification for Arabic dialects. We focus on making these challenges tractable for realistic corpus building, employing metrics which reflect downstream performance more faithfully. Our findings demonstrate the limitations of current LID techniques and lay the groundwork for future research in this area. A key finding throughout this thesis is that quality matters, particularly for under-served languages. Furthermore, even as corpus sizes grow, it is crucial not to lose sight of the quirks of individual languages. We provide resources and future research directions for increasing the diversity of useful training data for under-served languages, and in so doing facilitate the development of effective NLP applications for a wider variety of users.

URI

https://hdl.handle.net/1842/42315
http://dx.doi.org/10.7488/era/5035

This item appears in the following Collection(s)

Informatics thesis and dissertation collection