Edinburgh Research Archive

Data augmentation for language generation inspired by machine translation

dc.contributor.advisor
Haddow, Barry
dc.contributor.advisor
Heafield, Kenneth
dc.contributor.author
Chen, Pinzhen
dc.contributor.sponsor
other
en
dc.date.accessioned
2024-06-12T13:22:30Z
dc.date.available
2024-06-12T13:22:30Z
dc.date.issued
2024-06-12
dc.description.abstract
The field of natural language processing has witnessed a surge in the adoption of deep learning, which faces notable hurdles when the training data is scarce. This thesis aims to study automatic data augmentation for language generation tasks where acquiring human-annotated data is costly. Drawing from insights in machine translation research, we transfer techniques from this field to a wider array of sequence generation problems. The thesis's initial segment delves into parallel data retrieval for neural machine translation. We devise a method that scores cross-lingual sentences using a translation model itself and approximates pairwise comparisons with trie-constrained decoding. The process does not require document alignment and can identify a reasonable number of parallel sentences. Then, arguing for parallelism between contextualized words and their definitions, we propose to train a unified word-definition modelling system using data augmentation inspired by multilingual translation embeddings. Our system attains superior results for reverse dictionary and definition generation tasks on conventional research datasets and in an international shared task. Finally, we expand generation-based and self-data augmentation to programming language generation tasks including back-translation, monolingual copying, multilingualism, and numeric augmentation. In addition, we attempt to encode numbers as numeric values instead of strings. Significant improvement is observed in code-to-code translation and code-to-text summarization despite starting from powerful language models pre-trained on code and text data.
en
dc.identifier.uri
https://hdl.handle.net/1842/41873
dc.identifier.uri
http://dx.doi.org/10.7488/era/4596
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Marta Ban˜on, Pinzhen Chen, Barry Haddow, Kenneth Heafield, Hieu Hoang, Miquel ´ Espla-Gomis, Mikel L. Forcada, Amir Kamran, Faheem Kirefu, Philipp Koehn, ` Sergio Ortiz Rojas, Leopoldo Pla Sempere, Gema Ram´ırez-Sanchez, Elsa Sarr ´ ´ıas, Marek Strelec, Brian Thompson, William Waites, Dion Wiggins, and Jaume Zaragoza. 2020. ParaCrawl: Web-scale acquisition of parallel corpora. In Proceed ings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4555–4567, Online. Association for Computational Linguistics.
en
dc.relation.hasversion
Pinzhen Chen, Nikolay Bogoychev, and Ulrich Germann. 2020a. Character mapping and ad-hoc adaptation: Edinburgh’s IWSLT 2020 open domain translation system. In Proceedings of the 17th International Conference on Spoken Language Translation, pages 122–129, Online. Association for Computational Linguistics.
en
dc.relation.hasversion
Pinzhen Chen, Nikolay Bogoychev, Kenneth Heafield, and Faheem Kirefu. 2020. Parallel sentence mining by constrained decoding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1672–1678, Online. Association for Computational Linguistics.
en
dc.relation.hasversion
Pinzhen Chen and Kenneth Heafield. 2022. Approaching neural Chinese word segmentation as a low-resource machine translation task. In Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation, pages 600– 606, Manila, Philippines. Association for Computational Linguistics
en
dc.relation.hasversion
Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Andrey Kutuzov, Barry Haddow, and Kenneth Heafield. 2024. Monolingual or multilingual instruction tuning: Which makes a better Alpaca. In Findings of the Association for Computational Linguistics: EACL 2024, pages 1347–1356, St. Julian’s, Malta. Association for Computational Linguistics.
en
dc.relation.hasversion
Pinzhen Chen and Gerasimos Lampouras. 2023. Exploring data augmentation for code generation tasks. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1542–1550, Dubrovnik, Croatia. Association for Computational Linguistics
en
dc.relation.hasversion
Pinzhen Chen and Zheng Zhao. 2022a. Edinburgh at SemEval-2022 task 1: Jointly fishing for word embeddings and definitions. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 75–81, Seattle, United States. Association for Computational Linguistics.
en
dc.relation.hasversion
Pinzhen Chen and Zheng Zhao. 2022b. A unified model for reverse dictionary and definition modelling. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 8– 13, Online only. Association for Computational Linguistics.
en
dc.relation.hasversion
Ashok Urlana, Pinzhen Chen, Zheng Zhao, Shay Cohen, Manish Shrivastava, and Barry Haddow. 2023. PMIndiaSum: Multilingual and cross-lingual headline summarization for languages in India. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11606–11628, Singapore. Association for Computational Linguistics
en
dc.subject
data augmentation
en
dc.subject
machine translation
en
dc.subject
natural language processing
en
dc.subject
language generation
en
dc.title
Data augmentation for language generation inspired by machine translation
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en

Files

Original bundle

Now showing 1 - 1 of 1
Name:
Chen2024.pdf
Size:
1.59 MB
Format:
Adobe Portable Document Format
Description:

This item appears in the following Collection(s)