Linguistic typology for neural machine translation

Oncevay, Arturo

Linguistic typology for neural machine translation

Simple item page

dc.contributor.advisor

Birch-Mayne, Alexandra

dc.contributor.advisor

Haddow, Barry

dc.contributor.author

Oncevay, Arturo

dc.date.accessioned

2023-10-09T10:35:12Z

dc.date.available

2023-10-09T10:35:12Z

dc.date.issued

2023-10-09

dc.description.abstract

The vast structural diversity of languages worldwide, compounded by the problem of scarce resources, remains a challenge for machine translation research. To address this problem, we leverage knowledge from the field of linguistic typology, which describes the structural diversity and common properties of the world's languages. In this thesis, we investigate which variables or concepts of linguistic typology impact neural machine translation performance. First, we propose a combined language representation that encodes language variables from typology databases, specifically syntax variables, and fuses them with pre-trained multilingual language embeddings. We demonstrate the higher quality of the new language representations by assessing their performance on computational typology tasks such as typological feature prediction and phylogenetic inference. Next, we show that the combined language space can be leveraged to improve multilingual machine translation tasks and reduce negative transfer by creating clusters of multilingual models with significantly related languages, obtaining benefits across languages with different training sizes compared to robust baselines, and with the advantage of working efficiently when adding new languages to a multilingual setting. Furthermore, we investigate the impact of typological variables associated with morphology on machine translation. Morphology examines how words are formed, a process that varies across languages and is related to subword segmentation, a critical aspect of current machine translation systems. Specifically, we demonstrate that a higher degree of morphological fusion and synthesis usually corresponds to lower translation quality. We perform this analysis at the word level and obtain consistent results at the segment level for several language pairs. Finally, we study the extreme case of high synthesis (polysynthesis) and low-resource scenarios, which are typically present in endangered languages from the Americas. We build machine translation resources for Amerindian languages, find that unsupervised segmentation methods perform comparably or better than morphologically supervised ones, and propose a less data-dependent segmentation strategy based on syllable units with promising results in our case study. Overall, our work sheds light on the impact of linguistic typology on machine translation, specifically on the relevance of syntax and morphological variables in low-resource and structurally diverse languages.

en

dc.identifier.uri

https://hdl.handle.net/1842/41033

dc.identifier.uri

http://dx.doi.org/10.7488/era/3772

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

Quantifying Synthesis and Fusion and their Impact on Machine Translation Oncevay, A., Ataman, D., van Berkel, N., Haddow, B., Birch-Mayne, A. & Bjerva, J., 1 Jul 2022, Proceedings of The 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Carpuat, M., de Marneffe, M-C. & Meza Ruiz, I. V. (eds.). Stroudsburg, PA, USA: Association for Computational Linguistics (ACL), p. 1308-1321 14 p.

en

dc.relation.hasversion

Bridging linguistic typology and multilingual machine translation with multi-view language representations Oncevay, A., Haddow, B. & Birch, A., 20 Nov 2020, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics (ACL), p. 2391–2406 16 p.

en

dc.relation.hasversion

Alva, C. and Oncevay, A. (2017). Spell-checking based on syllabification and character-level graphs for a Peruvian agglutinative language. In Proceedings of the First Workshop on Subword and Character Level Models in NLP, pages 109–116, Copenhagen, Denmark. Association for Computational Linguistics.

en

dc.relation.hasversion

Bawden, R., Birch, A., Dobreva, R., Oncevay, A., Miceli Barone, A. V., and Williams, P. (2020). The University of Edinburgh’s English-Tamil and English-Inuktitut submissions to the WMT20 news translation task. In Proceedings of the Fifth Conference on Machine Translation, pages 92–99, Online. Association for Computational Linguistics.

en

dc.relation.hasversion

Bustamante, G., Oncevay, A., and Zariquiey, R. (2020). No data to crawl? monolingual corpus creation from PDF files of truly low-resource languages in Peru. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 2914–2923, Marseille, France. European Language Resources Association.

en

dc.relation.hasversion

Ebrahimi, A., Mager, M., Oncevay, A., Chaudhary, V., Chiruzzo, L., Fan, A., Ortega, J., Ramos, R., Rios, A., Meza Ruiz, I. V., Gim´enez-Lugo, G., Mager, E., Neubig, G., Palmer, A., Coto-Solano, R., Vu, T., and Kann, K. (2022). AmericasNLI: Evaluating zero-shot natural language understanding of pretrained multilingual models in truly low-resource languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6279–6299, Dublin, Ireland. Association for Computational Linguistics.

en

dc.relation.hasversion

Galarreta, A.-P., Melgar, A., and Oncevay, A. (2017). Corpus creation and initial SMT experiments between Spanish and Shipibo-konibo. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pages 238–244, Varna, Bulgaria. INCOMA Ltd.

en

dc.relation.hasversion

G´omez Montoya, H. E., Rojas, K. D. R., and Oncevay, A. (2019). A continuous improvement framework of machine translation for Shipibo-konibo. In Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages, pages 17–23, Dublin, Ireland. European Association for Machine Translation.

en

dc.relation.hasversion

Kann, K., Ebrahimi, A., Mager, M., Oncevay, A., Ortega, J. E., Rios, A., Fan, A., Gutierrez-Vasques, X., Chiruzzo, L., Gim´enez-Lugo, G. A., Ramos, R., Meza Ruiz, I. V., Mager, E., Chaudhary, V., Neubig, G., Palmer, A., Coto-Solano, R., and Vu, N. T. (2022). AmericasNLI: Machine translation and natural language inference systems for indigenous languages of the americas. Frontiers in Artificial Intelligence, 5.

en

dc.relation.hasversion

Mager, M., Oncevay, A., Ebrahimi, A., Ortega, J., Rios, A., Fan, A., Gutierrez- Vasques, X., Chiruzzo, L., Gim´enez-Lugo, G., Ramos, R., Meza Ruiz, I. V., Coto- Solano, R., Palmer, A., Mager-Hois, E., Chaudhary, V., Neubig, G., Vu, N. T., and Kann, K. (2021). Findings of the AmericasNLP 2021 shared task on open machine translation for indigenous languages of the Americas. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 202–217, Online. Association for Computational Linguistics.

en

dc.relation.hasversion

Mager, M., Oncevay, A., Mager, E., Kann, K., and Vu, T. (2022). BPE vs. morphological segmentation: A case study on machine translation of four polysynthetic languages. In Findings of the Association for Computational Linguistics: ACL 2022, pages 961–971, Dublin, Ireland. Association for Computational Linguistics.

en

dc.relation.hasversion

Oncevay, A. (2021). Peru is multilingual, its machine translation should be too? In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 194–201, Online. Association for Computational Linguistics.

en

dc.relation.hasversion

Oncevay, A., Rivas Rojas, K. D., Chavez Sanchez, L. K., and Zariquiey, R. (2022b). Revisiting syllables in language modelling and their application on low-resource machine translation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4258–4267, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.

en

dc.relation.hasversion

Pereira-Noriega, J., Mercado-Gonzales, R., Melgar, A., Sobrevilla-Cabezudo, M., and Oncevay-Marcos, A. (2017). Ship-lemmatagger: Building an nlp toolkit for a peruvian native language. In Ekˇstein, K. and Matouˇsek, V., editors, Text, Speech, and Dialogue, pages 473–481, Cham. Springer International Publishing.

en

dc.relation.hasversion

Vasquez, A., Ego Aguirre, R., Angulo, C., Miller, J., Villanueva, C., Agi´c, ˇ Z., Zariquiey, R., and Oncevay, A. (2018). Toward Universal Dependencies for Shipibokonibo. In Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), pages 151–161, Brussels, Belgium. Association for Computational Linguistics.

en

dc.relation.hasversion

Zariquiey, R., Oncevay, A., and Vera, J. (2022). CLD2 language documentation meets natural language processing for revitalising endangered languages. In Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 20–30, Dublin, Ireland. Association for Computational Linguistics.

en

dc.subject

Machine Translation

en

dc.subject

Natural Language Processing

en

dc.subject

Linguistic Typology

en

dc.title

Linguistic typology for neural machine translation

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: OncevayA_2023.pdf
Size:: 2.14 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection