Enriching sentence-level machine translation
Files
Item Status
Embargo End Date
Date
Authors
Pal, Proyag
Abstract
Neural Machine Translation (MT) has long been established as a successful paradigm
to produce high-quality MT across many languages and domains. However, it suffers
from one significant limitation – it is too often formulated as a task of translating isolated
sentences in a source language into sentences in the target language. This renders
standard MT models unable to capture any information that is not in the sentence, such
as document context, speaker information, the domain of the text, external constraints
etc. This thesis aims to study this limitation, analyse the shortcomings of sentence-level
MT, and present some approaches to enrich MT models to overcome this limitation.
The first part of this thesis introduces a method to quantify the amount of information
missing from source sentences that is needed to translate them perfectly. This method
is called “cheat codes” and it allows us to establish an upper bound on the amount
of additional information that the model needs to be provided to be able to exactly
reproduce reference translations. We find that a surprisingly small amount of leaked
information about the target in addition to the source is enough to achieve this. We
also use this method to study what parts of translation are difficult for these models to
learn correctly, even in the presence of extra information. This analysis allows us to
signpost some hard problems for neural MT for further research to focus on.
The second part of the thesis presents two examples of how MT can be augmented with
extra information to improve translation quality or overall user experience in specific
applications. The first example is using document context, which is always used by
human translators when translating text, but is rarely present in parallel corpora. We
extract and publish a large-scale dataset of parallel sentences with corresponding contexts
from existing publicly available resources, and show that this data helps improve
translation performance in terms of overall quality as well as specific document-level
phenomena. The second example is providing timing constraints to an isochronous
MT model for use in automatic dubbing. By incorporating duration information and
keeping track of it while translating, the model can produce translations that better
match the source audio, which eventually results in a better user experience when
viewing the automatically dubbed content.
On the whole, we find that even though a relatively small amount of information is
missing from sentence-level MT, enriching the models with these small pieces of
information can have a significant positive impact on the quality and usefulness of
MT systems in a wide variety of situations. We provide detailed analyses, datasets, and
methods to build better MT systems and encourage future research in this direction.
This item appears in the following Collection(s)

