Improving the modelling of Arabic varieties in NLP
dc.contributor.advisor
Magdy, Walid
dc.contributor.advisor
Goldwater, Sharon
dc.contributor.author
Keleg, Amr
dc.contributor.sponsor
UKRI CDT in Natural Language Processing
dc.date.accessioned
2026-03-19T16:02:20Z
dc.date.issued
2026-03-19
dc.description.abstract
Natural Language Processing (NLP) systems generally focus on supporting standardized varieties of languages. Developing systems for a non-standardized variety (dialect) requires finding/selecting samples in this variety, to create customized mixtures of pretraining data or to develop dialect-specific benchmarks. To this end, Dialect Identification (DI) is typically employed. This thesis explores some limitations of the long-standing approach that frames DI as a single-label classification task, where each sentence is linked to a single dialect. I specifically focus on Arabic, a language with a rich diversity of regional dialects. Arabic also exists in a diglossic state where two varieties co-exist within the same speaking community—Modern Standard Arabic (MSA) and local varieties of Dialectal Arabic (DA).
The thesis’s main contributions are twofold. First, the different levels between pure MSA sentences and highly colloquial sentences are operationalized as a continuous variable—in range [0,1]—termed Arabic Level of Dialectness (ALDi). ALDi estimation is modeled as a regression task, with a fine-tuned BERT-based model achieving an RMSE of 0.18. Second, Arabic Dialect Identification (ADI) is reframed as a multi-label task, where the validity of sentences in the different regional varieties is independently assessed. This is based on finding that in 66% of the cases where a single-label ADI model made errors, both its predictions and the gold-standard labels were valid.
Accordingly, each sentence has a set of regions in which it is valid, and an ALDi score to indicate how it diverges from MSA. By definition, MSA sentences are expected to be labeled as valid in all the considered regions, with almost-zero ALDi scores. Following the newly proposed framing, I created the first multi-label ADI dataset of 1,120 sentences (tweets), labeled by 33 annotators from 11 Arab countries, with ALDi ratings.
This dataset allowed for investigating some widely adopted assumptions about Arabic. For instance, I show that Arabic dialects overlap considerably at both the country and regional levels. Additionally, the conscious Dialect Level choice that Arabic speakers make—operationalized as ALDi—is a better predictor of the number of dialects in which a sentence is valid than its length. Lastly, signs of systematic differences in the ALDi ratings provided by speakers of different dialects for the same sentences show the need for further investigations of ALDi’s annotation.
ALDi is a valuable variable for many applications. For instance, I used it to identify stylistic differences in Arab presidents’ speeches—previously only possible through qualitative analysis. For tasks requiring data annotation, I found that high-ALDi samples need to have a higher priority of being routed to speakers of the samples’ dialects.
dc.identifier.uri
https://era.ed.ac.uk/handle/1842/44503
dc.identifier.uri
https://doi.org/10.7488/era/7020
dc.language.iso
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Abdul-Mageed, M., Elmadany, A., Zhang, C., Nagoudi, E. M. B., Bouamor, H., and Habash, N. (2023). NADI 2023: The fourth nuanced Arabic dialect identification shared task. In Sawaf, H., El-Beltagy, S., Zaghouani, W., Magdy, W., Abdelali, A., Tomeh, N., Abu Farha, I., Habash, N., Khalifa, S., Keleg, A., Haddad, H., Zitouni, I., Mrini, K., and Almatham, R., editors, Proceedings of ArabicNLP 2023, pages 600–613, Singapore (Hybrid). Association for Computational Linguistics
dc.relation.hasversion
Abdul-Mageed, M., Keleg, A., Elmadany, A., Zhang, C., Hamed, I., Magdy, W., Bouamor, H., and Habash, N. (2024). NADI 2024: The fifth nuanced Arabic dialect identification shared task. In Habash, N., Bouamor, H., Eskander, R., Tomeh, N., Abu Farha, I., Abdelali, A., Touileb, S., Hamed, I., Onaizan, Y., Alhafni, B., Antoun, W., Khalifa, S., Haddad, H., Zitouni, I., AlKhamissi, B., Almatham, R., and Mrini, K., editors, Proceedings of the Second Arabic Natural Language Processing Conference, pages 709–728, Bangkok, Thailand. Association for Computational Linguistics
dc.relation.hasversion
Keleg, A., Goldwater, S., and Magdy, W. (2023). ALDi: Quantifying the Arabic level of dialectness of text. In Bouamor, H., Pino, J., and Bali, K., editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10597–10611, Singapore. Association for Computational Linguistics
dc.relation.hasversion
Keleg, A., Goldwater, S., and Magdy, W. (2025). Revisiting common assumptions about Arabic dialects in NLP. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T., editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3309–3327, Vienna, Austria. Association for Computational Linguistics
dc.relation.hasversion
Keleg, A. and Magdy, W. (2023). Arabic dialect identification under scrutiny: Limitations of single-label classification. In Sawaf, H., El-Beltagy, S., Zaghouani, W., Magdy, W., Abdelali, A., Tomeh, N., Abu Farha, I., Habash, N., Khalifa, S., Keleg, A., Haddad, H., Zitouni, I., Mrini, K., and Almatham, R., editors, Proceedings of ArabicNLP 2023, pages 385–398, Singapore (Hybrid). Association for Computational Linguistics
dc.relation.hasversion
Keleg, A., Magdy, W., and Goldwater, S. (2024). Estimating the level of dialectness predicts inter-annotator agreement in multi-dialect Arabic datasets. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 766–777, Bangkok, Thailand. Association for Computational Linguistics
dc.relation.hasversion
Olsen, H., Touileb, S., and Velldal, E. (2023). Arabic dialect identification: An indepth error analysis on the MADAR parallel corpus. In Sawaf, H., El-Beltagy, S., Zaghouani, W., Magdy, W., Abdelali, A., Tomeh, N., Abu Farha, I., Habash, N., Khalifa, S., Keleg, A., Haddad, H., Zitouni, I., Mrini, K., and Almatham, R., editors, Proceedings of ArabicNLP 2023, pages 370–384, Singapore (Hybrid). Association for Computational Linguistics
dc.subject
Arabic Dialects
dc.subject
Arabic Varieties
dc.subject
Interspeaker Variation
dc.subject
Intraspeaker Variation
dc.subject
Dialect Levels
dc.subject
Interannotator Agreement
dc.subject
Arabic Dialect Identification
dc.title
Improving the modelling of Arabic varieties in NLP
dc.type
Thesis
dc.type.qualificationlevel
Doctoral
dc.type.qualificationname
PhD Doctor of Philosophy
Files
Original bundle
1 - 1 of 1
- Name:
- KelegA_2026.pdf
- Size:
- 8.99 MB
- Format:
- Adobe Portable Document Format
This item appears in the following Collection(s)

