Improving the modelling of Arabic varieties in NLP

Keleg, Amr

Improving the modelling of Arabic varieties in NLP

Files

KelegA_2026.pdf (8.99 MB)

Date

2026-03-19

Authors

Keleg, Amr

Full item page

Abstract

Natural Language Processing (NLP) systems generally focus on supporting standardized varieties of languages. Developing systems for a non-standardized variety (dialect) requires finding/selecting samples in this variety, to create customized mixtures of pretraining data or to develop dialect-specific benchmarks. To this end, Dialect Identification (DI) is typically employed. This thesis explores some limitations of the long-standing approach that frames DI as a single-label classification task, where each sentence is linked to a single dialect. I specifically focus on Arabic, a language with a rich diversity of regional dialects. Arabic also exists in a diglossic state where two varieties co-exist within the same speaking community—Modern Standard Arabic (MSA) and local varieties of Dialectal Arabic (DA). The thesis’s main contributions are twofold. First, the different levels between pure MSA sentences and highly colloquial sentences are operationalized as a continuous variable—in range [0,1]—termed Arabic Level of Dialectness (ALDi). ALDi estimation is modeled as a regression task, with a fine-tuned BERT-based model achieving an RMSE of 0.18. Second, Arabic Dialect Identification (ADI) is reframed as a multi-label task, where the validity of sentences in the different regional varieties is independently assessed. This is based on finding that in 66% of the cases where a single-label ADI model made errors, both its predictions and the gold-standard labels were valid. Accordingly, each sentence has a set of regions in which it is valid, and an ALDi score to indicate how it diverges from MSA. By definition, MSA sentences are expected to be labeled as valid in all the considered regions, with almost-zero ALDi scores. Following the newly proposed framing, I created the first multi-label ADI dataset of 1,120 sentences (tweets), labeled by 33 annotators from 11 Arab countries, with ALDi ratings. This dataset allowed for investigating some widely adopted assumptions about Arabic. For instance, I show that Arabic dialects overlap considerably at both the country and regional levels. Additionally, the conscious Dialect Level choice that Arabic speakers make—operationalized as ALDi—is a better predictor of the number of dialects in which a sentence is valid than its length. Lastly, signs of systematic differences in the ALDi ratings provided by speakers of different dialects for the same sentences show the need for further investigations of ALDi’s annotation. ALDi is a valuable variable for many applications. For instance, I used it to identify stylistic differences in Arab presidents’ speeches—previously only possible through qualitative analysis. For tasks requiring data annotation, I found that high-ALDi samples need to have a higher priority of being routed to speakers of the samples’ dialects.

URI

https://era.ed.ac.uk/handle/1842/44503
https://doi.org/10.7488/era/7020

This item appears in the following Collection(s)

Informatics thesis and dissertation collection