Stream-based statistical machine translation
View/ Open
Date
24/11/2011Author
Levenberg, Abby D.
Metadata
Abstract
We investigate a new approach for SMT system training within the streaming model
of computation. We develop and test incrementally retrainable models which, given
an incoming stream of new data, can efficiently incorporate the stream data online. A
naive approach using a stream would use an unbounded amount of space. Instead, our
online SMT system can incorporate information from unbounded incoming streams
and maintain constant space and time. Crucially, we are able to match (or even exceed)
translation performance of comparable systems which are batch retrained and
use unbounded space. Our approach is particularly suited for situations when there is
arbitrarily large amounts of new training material and we wish to incorporate it efficiently
and in small space.
The novel contributions of this thesis are:
1. An online, randomised language model that can model unbounded input streams
in constant space and time.
2. An incrementally retrainable translationmodel for both phrase-based and grammarbased
systems. The model presented is efficient enough to incorporate novel
parallel text at the single sentence level.
3. Strategies for updating our stream-based language model and translation model
which demonstrate how such components can be successfully used in a streaming
translation setting. This operates both within a single streaming environment
and also in the novel situation of having to translate multiple streams.
4. Demonstration that recent data from the stream is beneficial to translation performance.
Our stream-based SMT system is efficient for tackling massive volumes of new
training data and offers-up new ways of thinking about translating web data and dealing
with other natural language streams.