Data augmentation for language generation inspired by machine translation

Chen, Pinzhen

Data augmentation for language generation inspired by machine translation

Simple item page

dc.contributor.advisor

Haddow, Barry

dc.contributor.advisor

Heafield, Kenneth

dc.contributor.author

Chen, Pinzhen

dc.contributor.sponsor

other

en

dc.date.accessioned

2024-06-12T13:22:30Z

dc.date.available

2024-06-12T13:22:30Z

dc.date.issued

2024-06-12

dc.description.abstract

The field of natural language processing has witnessed a surge in the adoption of deep learning, which faces notable hurdles when the training data is scarce. This thesis aims to study automatic data augmentation for language generation tasks where acquiring human-annotated data is costly. Drawing from insights in machine translation research, we transfer techniques from this field to a wider array of sequence generation problems. The thesis's initial segment delves into parallel data retrieval for neural machine translation. We devise a method that scores cross-lingual sentences using a translation model itself and approximates pairwise comparisons with trie-constrained decoding. The process does not require document alignment and can identify a reasonable number of parallel sentences. Then, arguing for parallelism between contextualized words and their definitions, we propose to train a unified word-definition modelling system using data augmentation inspired by multilingual translation embeddings. Our system attains superior results for reverse dictionary and definition generation tasks on conventional research datasets and in an international shared task. Finally, we expand generation-based and self-data augmentation to programming language generation tasks including back-translation, monolingual copying, multilingualism, and numeric augmentation. In addition, we attempt to encode numbers as numeric values instead of strings. Significant improvement is observed in code-to-code translation and code-to-text summarization despite starting from powerful language models pre-trained on code and text data.

en

dc.identifier.uri

https://hdl.handle.net/1842/41873

dc.identifier.uri

http://dx.doi.org/10.7488/era/4596

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

Marta Ban˜on, Pinzhen Chen, Barry Haddow, Kenneth Heafield, Hieu Hoang, Miquel ´ Espla-Gomis, Mikel L. Forcada, Amir Kamran, Faheem Kirefu, Philipp Koehn, ` Sergio Ortiz Rojas, Leopoldo Pla Sempere, Gema Ram´ırez-Sanchez, Elsa Sarr ´ ´ıas, Marek Strelec, Brian Thompson, William Waites, Dion Wiggins, and Jaume Zaragoza. 2020. ParaCrawl: Web-scale acquisition of parallel corpora. In Proceed ings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4555–4567, Online. Association for Computational Linguistics.

en

dc.relation.hasversion

Pinzhen Chen, Nikolay Bogoychev, and Ulrich Germann. 2020a. Character mapping and ad-hoc adaptation: Edinburgh’s IWSLT 2020 open domain translation system. In Proceedings of the 17th International Conference on Spoken Language Translation, pages 122–129, Online. Association for Computational Linguistics.

en

dc.relation.hasversion

Pinzhen Chen, Nikolay Bogoychev, Kenneth Heafield, and Faheem Kirefu. 2020. Parallel sentence mining by constrained decoding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1672–1678, Online. Association for Computational Linguistics.

en

dc.relation.hasversion

Pinzhen Chen and Kenneth Heafield. 2022. Approaching neural Chinese word segmentation as a low-resource machine translation task. In Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation, pages 600– 606, Manila, Philippines. Association for Computational Linguistics

en

dc.relation.hasversion

Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Andrey Kutuzov, Barry Haddow, and Kenneth Heafield. 2024. Monolingual or multilingual instruction tuning: Which makes a better Alpaca. In Findings of the Association for Computational Linguistics: EACL 2024, pages 1347–1356, St. Julian’s, Malta. Association for Computational Linguistics.

en

dc.relation.hasversion

Pinzhen Chen and Gerasimos Lampouras. 2023. Exploring data augmentation for code generation tasks. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1542–1550, Dubrovnik, Croatia. Association for Computational Linguistics

en

dc.relation.hasversion

Pinzhen Chen and Zheng Zhao. 2022a. Edinburgh at SemEval-2022 task 1: Jointly fishing for word embeddings and definitions. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 75–81, Seattle, United States. Association for Computational Linguistics.

en

dc.relation.hasversion

Pinzhen Chen and Zheng Zhao. 2022b. A unified model for reverse dictionary and definition modelling. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 8– 13, Online only. Association for Computational Linguistics.

en

dc.relation.hasversion

Ashok Urlana, Pinzhen Chen, Zheng Zhao, Shay Cohen, Manish Shrivastava, and Barry Haddow. 2023. PMIndiaSum: Multilingual and cross-lingual headline summarization for languages in India. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11606–11628, Singapore. Association for Computational Linguistics

en

dc.subject

data augmentation

en

dc.subject

machine translation

en

dc.subject

natural language processing

en

dc.subject

language generation

en

dc.title

Data augmentation for language generation inspired by machine translation

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Chen2024.pdf
Size:: 1.59 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection