Edinburgh Research Archive

Linguistic typology for neural machine translation

dc.contributor.advisor
Birch-Mayne, Alexandra
dc.contributor.advisor
Haddow, Barry
dc.contributor.author
Oncevay, Arturo
dc.date.accessioned
2023-10-09T10:35:12Z
dc.date.available
2023-10-09T10:35:12Z
dc.date.issued
2023-10-09
dc.description.abstract
The vast structural diversity of languages worldwide, compounded by the problem of scarce resources, remains a challenge for machine translation research. To address this problem, we leverage knowledge from the field of linguistic typology, which describes the structural diversity and common properties of the world's languages. In this thesis, we investigate which variables or concepts of linguistic typology impact neural machine translation performance. First, we propose a combined language representation that encodes language variables from typology databases, specifically syntax variables, and fuses them with pre-trained multilingual language embeddings. We demonstrate the higher quality of the new language representations by assessing their performance on computational typology tasks such as typological feature prediction and phylogenetic inference. Next, we show that the combined language space can be leveraged to improve multilingual machine translation tasks and reduce negative transfer by creating clusters of multilingual models with significantly related languages, obtaining benefits across languages with different training sizes compared to robust baselines, and with the advantage of working efficiently when adding new languages to a multilingual setting. Furthermore, we investigate the impact of typological variables associated with morphology on machine translation. Morphology examines how words are formed, a process that varies across languages and is related to subword segmentation, a critical aspect of current machine translation systems. Specifically, we demonstrate that a higher degree of morphological fusion and synthesis usually corresponds to lower translation quality. We perform this analysis at the word level and obtain consistent results at the segment level for several language pairs. Finally, we study the extreme case of high synthesis (polysynthesis) and low-resource scenarios, which are typically present in endangered languages from the Americas. We build machine translation resources for Amerindian languages, find that unsupervised segmentation methods perform comparably or better than morphologically supervised ones, and propose a less data-dependent segmentation strategy based on syllable units with promising results in our case study. Overall, our work sheds light on the impact of linguistic typology on machine translation, specifically on the relevance of syntax and morphological variables in low-resource and structurally diverse languages.
en
dc.identifier.uri
https://hdl.handle.net/1842/41033
dc.identifier.uri
http://dx.doi.org/10.7488/era/3772
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Quantifying Synthesis and Fusion and their Impact on Machine Translation Oncevay, A., Ataman, D., van Berkel, N., Haddow, B., Birch-Mayne, A. & Bjerva, J., 1 Jul 2022, Proceedings of The 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Carpuat, M., de Marneffe, M-C. & Meza Ruiz, I. V. (eds.). Stroudsburg, PA, USA: Association for Computational Linguistics (ACL), p. 1308-1321 14 p.
en
dc.relation.hasversion
Bridging linguistic typology and multilingual machine translation with multi-view language representations Oncevay, A., Haddow, B. & Birch, A., 20 Nov 2020, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics (ACL), p. 2391–2406 16 p.
en
dc.relation.hasversion
Alva, C. and Oncevay, A. (2017). Spell-checking based on syllabification and character-level graphs for a Peruvian agglutinative language. In Proceedings of the First Workshop on Subword and Character Level Models in NLP, pages 109–116, Copenhagen, Denmark. Association for Computational Linguistics.
en
dc.relation.hasversion
Bawden, R., Birch, A., Dobreva, R., Oncevay, A., Miceli Barone, A. V., and Williams, P. (2020). The University of Edinburgh’s English-Tamil and English-Inuktitut submissions to the WMT20 news translation task. In Proceedings of the Fifth Conference on Machine Translation, pages 92–99, Online. Association for Computational Linguistics.
en
dc.relation.hasversion
Bustamante, G., Oncevay, A., and Zariquiey, R. (2020). No data to crawl? monolingual corpus creation from PDF files of truly low-resource languages in Peru. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 2914–2923, Marseille, France. European Language Resources Association.
en
dc.relation.hasversion
Ebrahimi, A., Mager, M., Oncevay, A., Chaudhary, V., Chiruzzo, L., Fan, A., Ortega, J., Ramos, R., Rios, A., Meza Ruiz, I. V., Gim´enez-Lugo, G., Mager, E., Neubig, G., Palmer, A., Coto-Solano, R., Vu, T., and Kann, K. (2022). AmericasNLI: Evaluating zero-shot natural language understanding of pretrained multilingual models in truly low-resource languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6279–6299, Dublin, Ireland. Association for Computational Linguistics.
en
dc.relation.hasversion
Galarreta, A.-P., Melgar, A., and Oncevay, A. (2017). Corpus creation and initial SMT experiments between Spanish and Shipibo-konibo. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pages 238–244, Varna, Bulgaria. INCOMA Ltd.
en
dc.relation.hasversion
G´omez Montoya, H. E., Rojas, K. D. R., and Oncevay, A. (2019). A continuous improvement framework of machine translation for Shipibo-konibo. In Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages, pages 17–23, Dublin, Ireland. European Association for Machine Translation.
en
dc.relation.hasversion
Kann, K., Ebrahimi, A., Mager, M., Oncevay, A., Ortega, J. E., Rios, A., Fan, A., Gutierrez-Vasques, X., Chiruzzo, L., Gim´enez-Lugo, G. A., Ramos, R., Meza Ruiz, I. V., Mager, E., Chaudhary, V., Neubig, G., Palmer, A., Coto-Solano, R., and Vu, N. T. (2022). AmericasNLI: Machine translation and natural language inference systems for indigenous languages of the americas. Frontiers in Artificial Intelligence, 5.
en
dc.relation.hasversion
Mager, M., Oncevay, A., Ebrahimi, A., Ortega, J., Rios, A., Fan, A., Gutierrez- Vasques, X., Chiruzzo, L., Gim´enez-Lugo, G., Ramos, R., Meza Ruiz, I. V., Coto- Solano, R., Palmer, A., Mager-Hois, E., Chaudhary, V., Neubig, G., Vu, N. T., and Kann, K. (2021). Findings of the AmericasNLP 2021 shared task on open machine translation for indigenous languages of the Americas. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 202–217, Online. Association for Computational Linguistics.
en
dc.relation.hasversion
Mager, M., Oncevay, A., Mager, E., Kann, K., and Vu, T. (2022). BPE vs. morphological segmentation: A case study on machine translation of four polysynthetic languages. In Findings of the Association for Computational Linguistics: ACL 2022, pages 961–971, Dublin, Ireland. Association for Computational Linguistics.
en
dc.relation.hasversion
Oncevay, A. (2021). Peru is multilingual, its machine translation should be too? In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 194–201, Online. Association for Computational Linguistics.
en
dc.relation.hasversion
Oncevay, A., Rivas Rojas, K. D., Chavez Sanchez, L. K., and Zariquiey, R. (2022b). Revisiting syllables in language modelling and their application on low-resource machine translation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4258–4267, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
en
dc.relation.hasversion
Pereira-Noriega, J., Mercado-Gonzales, R., Melgar, A., Sobrevilla-Cabezudo, M., and Oncevay-Marcos, A. (2017). Ship-lemmatagger: Building an nlp toolkit for a peruvian native language. In Ekˇstein, K. and Matouˇsek, V., editors, Text, Speech, and Dialogue, pages 473–481, Cham. Springer International Publishing.
en
dc.relation.hasversion
Vasquez, A., Ego Aguirre, R., Angulo, C., Miller, J., Villanueva, C., Agi´c, ˇ Z., Zariquiey, R., and Oncevay, A. (2018). Toward Universal Dependencies for Shipibokonibo. In Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), pages 151–161, Brussels, Belgium. Association for Computational Linguistics.
en
dc.relation.hasversion
Zariquiey, R., Oncevay, A., and Vera, J. (2022). CLD2 language documentation meets natural language processing for revitalising endangered languages. In Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 20–30, Dublin, Ireland. Association for Computational Linguistics.
en
dc.subject
Machine Translation
en
dc.subject
Natural Language Processing
en
dc.subject
Linguistic Typology
en
dc.title
Linguistic typology for neural machine translation
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en

Files

Original bundle

Now showing 1 - 1 of 1
Name:
OncevayA_2023.pdf
Size:
2.14 MB
Format:
Adobe Portable Document Format
Description:

This item appears in the following Collection(s)