Enriching sentence-level machine translation

Pal, Proyag

Enriching sentence-level machine translation

Files

PalP-2025.pdf (1.98 MB)

Date

2025-06-13

Authors

Pal, Proyag

Full item page

Abstract

Neural Machine Translation (MT) has long been established as a successful paradigm to produce high-quality MT across many languages and domains. However, it suffers from one significant limitation – it is too often formulated as a task of translating isolated sentences in a source language into sentences in the target language. This renders standard MT models unable to capture any information that is not in the sentence, such as document context, speaker information, the domain of the text, external constraints etc. This thesis aims to study this limitation, analyse the shortcomings of sentence-level MT, and present some approaches to enrich MT models to overcome this limitation. The first part of this thesis introduces a method to quantify the amount of information missing from source sentences that is needed to translate them perfectly. This method is called “cheat codes” and it allows us to establish an upper bound on the amount of additional information that the model needs to be provided to be able to exactly reproduce reference translations. We find that a surprisingly small amount of leaked information about the target in addition to the source is enough to achieve this. We also use this method to study what parts of translation are difficult for these models to learn correctly, even in the presence of extra information. This analysis allows us to signpost some hard problems for neural MT for further research to focus on. The second part of the thesis presents two examples of how MT can be augmented with extra information to improve translation quality or overall user experience in specific applications. The first example is using document context, which is always used by human translators when translating text, but is rarely present in parallel corpora. We extract and publish a large-scale dataset of parallel sentences with corresponding contexts from existing publicly available resources, and show that this data helps improve translation performance in terms of overall quality as well as specific document-level phenomena. The second example is providing timing constraints to an isochronous MT model for use in automatic dubbing. By incorporating duration information and keeping track of it while translating, the model can produce translations that better match the source audio, which eventually results in a better user experience when viewing the automatically dubbed content. On the whole, we find that even though a relatively small amount of information is missing from sentence-level MT, enriching the models with these small pieces of information can have a significant positive impact on the quality and usefulness of MT systems in a wide variety of situations. We provide detailed analyses, datasets, and methods to build better MT systems and encourage future research in this direction.

URI

https://hdl.handle.net/1842/43567
http://dx.doi.org/10.7488/era/6101

This item appears in the following Collection(s)

Informatics thesis and dissertation collection