Improving natural language processing for under-served languages through increased training data diversity

Burchell, Laurie Vear

Improving natural language processing for under-served languages through increased training data diversity

Simple item page

dc.contributor.advisor

Heafield, Kenneth

dc.contributor.advisor

Birch-Mayne, Alexandra

dc.contributor.author

Burchell, Laurie Vear

dc.date.accessioned

2024-10-21T11:21:03Z

dc.date.available

2024-10-21T11:21:03Z

dc.date.issued

2024-10-21

dc.description.abstract

More and better data is often the most effective way to improve the quality of natural language processing (NLP), with the highest-performing applications requiring terabytes of data. However, most of the world's language varieties do not have anything like this amount of data available, limiting performance. This thesis aims to increase the diversity of training data for under-served language varieties as a means to improving downstream NLP applications. We take two broad approaches to increasing diversity in this thesis. We look firstly at diverse data augmentation, quantifying different types of induced diversity and how these affect downstream performance. Using neural machine translation as a specific application, we measure the diversity of different methods of generating back translation (BT), a popular data augmentation method. We find that some types of diversity are more important than others for downstream performance and make recommendations about how to make BT more effective. The second approach towards increasing training data diversity taken in this thesis is to improve language identification (LID), a fundamental part of any data-gathering pipeline. Given that poor LID is a significant impediment to diverse corpus creation, we curate an open dataset covering around 200 language varieties to facilitate further research. We demonstrate the quality of this dataset by using it to train a high-performing LID model and by carrying out further analysis into its capability. We use our LID dataset and model to explore two challenging problems for LID: identifying code-switched text and improving classification for Arabic dialects. We focus on making these challenges tractable for realistic corpus building, employing metrics which reflect downstream performance more faithfully. Our findings demonstrate the limitations of current LID techniques and lay the groundwork for future research in this area. A key finding throughout this thesis is that quality matters, particularly for under-served languages. Furthermore, even as corpus sizes grow, it is crucial not to lose sight of the quirks of individual languages. We provide resources and future research directions for increasing the diversity of useful training data for under-served languages, and in so doing facilitate the development of effective NLP applications for a wider variety of users.

en

dc.identifier.uri

https://hdl.handle.net/1842/42315

dc.identifier.uri

http://dx.doi.org/10.7488/era/5035

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

Burchell, L., Birch, A., Bogoychev, N., and Heafield, K. (2023). An open dataset and model for language identification. In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 865–879, Toronto, Canada. Association for Computational Linguistics.

en

dc.relation.hasversion

Burchell, L., Birch, A., and Heafield, K. (2022). Exploring diversity in back translation for low-resource machine translation. In Cherry, C., Fan, A., Foster, G., Haffari, G. R., Khadivi, S., Peng, N. V., Ren, X., Shareghi, E., and Swayamdipta, S., editors, Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 67–79, Hybrid. Association for Computational Linguistics

en

dc.relation.hasversion

Burchell, L., Birch, A., Thompson, R., and Heafield, K. (2024). Code-switched language identification is harder than you think. In Graham, Y. and Purver, M., editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 646–658, St. Julian’s, Malta. Association for Computational Linguistics

en

dc.rights.license

Creative Commons: Attribution (CC-BY)

en

dc.rights.uri

https://creativecommons.org/licenses/by/4.0/

en

dc.subject

machine translation

en

dc.subject

language identification

en

dc.subject

natural language processing

en

dc.subject

Data Augmentation

en

dc.subject

low-resource languages

en

dc.subject

under-served languages

en

dc.subject

code switching

en

dc.subject

Arabic Dialects

en

dc.subject

dataset

en

dc.title

Improving natural language processing for under-served languages through increased training data diversity

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: BurchellLV_2024.pdf
Size:: 1.29 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection