Edinburgh Research Archive

Data augmentation for language generation inspired by machine translation

Item Status

Embargo End Date

Authors

Chen, Pinzhen

Abstract

The field of natural language processing has witnessed a surge in the adoption of deep learning, which faces notable hurdles when the training data is scarce. This thesis aims to study automatic data augmentation for language generation tasks where acquiring human-annotated data is costly. Drawing from insights in machine translation research, we transfer techniques from this field to a wider array of sequence generation problems. The thesis's initial segment delves into parallel data retrieval for neural machine translation. We devise a method that scores cross-lingual sentences using a translation model itself and approximates pairwise comparisons with trie-constrained decoding. The process does not require document alignment and can identify a reasonable number of parallel sentences. Then, arguing for parallelism between contextualized words and their definitions, we propose to train a unified word-definition modelling system using data augmentation inspired by multilingual translation embeddings. Our system attains superior results for reverse dictionary and definition generation tasks on conventional research datasets and in an international shared task. Finally, we expand generation-based and self-data augmentation to programming language generation tasks including back-translation, monolingual copying, multilingualism, and numeric augmentation. In addition, we attempt to encode numbers as numeric values instead of strings. Significant improvement is observed in code-to-code translation and code-to-text summarization despite starting from powerful language models pre-trained on code and text data.

This item appears in the following Collection(s)