Improving the modelling of Arabic varieties in NLP

Keleg, Amr

Improving the modelling of Arabic varieties in NLP

Simple item page

dc.contributor.advisor

Magdy, Walid

dc.contributor.advisor

Goldwater, Sharon

dc.contributor.author

Keleg, Amr

dc.contributor.sponsor

UKRI CDT in Natural Language Processing

dc.date.accessioned

2026-03-19T16:02:20Z

dc.date.issued

2026-03-19

dc.description.abstract

Natural Language Processing (NLP) systems generally focus on supporting standardized varieties of languages. Developing systems for a non-standardized variety (dialect) requires finding/selecting samples in this variety, to create customized mixtures of pretraining data or to develop dialect-specific benchmarks. To this end, Dialect Identification (DI) is typically employed. This thesis explores some limitations of the long-standing approach that frames DI as a single-label classification task, where each sentence is linked to a single dialect. I specifically focus on Arabic, a language with a rich diversity of regional dialects. Arabic also exists in a diglossic state where two varieties co-exist within the same speaking community—Modern Standard Arabic (MSA) and local varieties of Dialectal Arabic (DA). The thesis’s main contributions are twofold. First, the different levels between pure MSA sentences and highly colloquial sentences are operationalized as a continuous variable—in range [0,1]—termed Arabic Level of Dialectness (ALDi). ALDi estimation is modeled as a regression task, with a fine-tuned BERT-based model achieving an RMSE of 0.18. Second, Arabic Dialect Identification (ADI) is reframed as a multi-label task, where the validity of sentences in the different regional varieties is independently assessed. This is based on finding that in 66% of the cases where a single-label ADI model made errors, both its predictions and the gold-standard labels were valid. Accordingly, each sentence has a set of regions in which it is valid, and an ALDi score to indicate how it diverges from MSA. By definition, MSA sentences are expected to be labeled as valid in all the considered regions, with almost-zero ALDi scores. Following the newly proposed framing, I created the first multi-label ADI dataset of 1,120 sentences (tweets), labeled by 33 annotators from 11 Arab countries, with ALDi ratings. This dataset allowed for investigating some widely adopted assumptions about Arabic. For instance, I show that Arabic dialects overlap considerably at both the country and regional levels. Additionally, the conscious Dialect Level choice that Arabic speakers make—operationalized as ALDi—is a better predictor of the number of dialects in which a sentence is valid than its length. Lastly, signs of systematic differences in the ALDi ratings provided by speakers of different dialects for the same sentences show the need for further investigations of ALDi’s annotation. ALDi is a valuable variable for many applications. For instance, I used it to identify stylistic differences in Arab presidents’ speeches—previously only possible through qualitative analysis. For tasks requiring data annotation, I found that high-ALDi samples need to have a higher priority of being routed to speakers of the samples’ dialects.

dc.identifier.uri

https://era.ed.ac.uk/handle/1842/44503

dc.identifier.uri

https://doi.org/10.7488/era/7020

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

Abdul-Mageed, M., Elmadany, A., Zhang, C., Nagoudi, E. M. B., Bouamor, H., and Habash, N. (2023). NADI 2023: The fourth nuanced Arabic dialect identification shared task. In Sawaf, H., El-Beltagy, S., Zaghouani, W., Magdy, W., Abdelali, A., Tomeh, N., Abu Farha, I., Habash, N., Khalifa, S., Keleg, A., Haddad, H., Zitouni, I., Mrini, K., and Almatham, R., editors, Proceedings of ArabicNLP 2023, pages 600–613, Singapore (Hybrid). Association for Computational Linguistics

dc.relation.hasversion

Abdul-Mageed, M., Keleg, A., Elmadany, A., Zhang, C., Hamed, I., Magdy, W., Bouamor, H., and Habash, N. (2024). NADI 2024: The fifth nuanced Arabic dialect identification shared task. In Habash, N., Bouamor, H., Eskander, R., Tomeh, N., Abu Farha, I., Abdelali, A., Touileb, S., Hamed, I., Onaizan, Y., Alhafni, B., Antoun, W., Khalifa, S., Haddad, H., Zitouni, I., AlKhamissi, B., Almatham, R., and Mrini, K., editors, Proceedings of the Second Arabic Natural Language Processing Conference, pages 709–728, Bangkok, Thailand. Association for Computational Linguistics

dc.relation.hasversion

Keleg, A., Goldwater, S., and Magdy, W. (2023). ALDi: Quantifying the Arabic level of dialectness of text. In Bouamor, H., Pino, J., and Bali, K., editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10597–10611, Singapore. Association for Computational Linguistics

dc.relation.hasversion

Keleg, A., Goldwater, S., and Magdy, W. (2025). Revisiting common assumptions about Arabic dialects in NLP. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T., editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3309–3327, Vienna, Austria. Association for Computational Linguistics

dc.relation.hasversion

Keleg, A. and Magdy, W. (2023). Arabic dialect identification under scrutiny: Limitations of single-label classification. In Sawaf, H., El-Beltagy, S., Zaghouani, W., Magdy, W., Abdelali, A., Tomeh, N., Abu Farha, I., Habash, N., Khalifa, S., Keleg, A., Haddad, H., Zitouni, I., Mrini, K., and Almatham, R., editors, Proceedings of ArabicNLP 2023, pages 385–398, Singapore (Hybrid). Association for Computational Linguistics

dc.relation.hasversion

Keleg, A., Magdy, W., and Goldwater, S. (2024). Estimating the level of dialectness predicts inter-annotator agreement in multi-dialect Arabic datasets. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 766–777, Bangkok, Thailand. Association for Computational Linguistics

dc.relation.hasversion

Olsen, H., Touileb, S., and Velldal, E. (2023). Arabic dialect identification: An indepth error analysis on the MADAR parallel corpus. In Sawaf, H., El-Beltagy, S., Zaghouani, W., Magdy, W., Abdelali, A., Tomeh, N., Abu Farha, I., Habash, N., Khalifa, S., Keleg, A., Haddad, H., Zitouni, I., Mrini, K., and Almatham, R., editors, Proceedings of ArabicNLP 2023, pages 370–384, Singapore (Hybrid). Association for Computational Linguistics

dc.subject

Arabic Dialects

dc.subject

Arabic Varieties

dc.subject

Interspeaker Variation

dc.subject

Intraspeaker Variation

dc.subject

Dialect Levels

dc.subject

Interannotator Agreement

dc.subject

Arabic Dialect Identification

dc.title

Improving the modelling of Arabic varieties in NLP

dc.type

Thesis

dc.type.qualificationlevel

Doctoral

dc.type.qualificationname

PhD Doctor of Philosophy

Files

Original bundle

Now showing 1 - 1 of 1

Name:: KelegA_2026.pdf
Size:: 8.99 MB
Format:: Adobe Portable Document Format

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection