Exploiting linguistically-enriched models of phrase-based statistical machine translation
Item statusRestricted Access
This thesis presents the design and implementation of linguistically-informed models for statistical phrase-based machine translation. Using Koehn’s Pharaoh (2004), a state-of-the-art SMT system, and Moses (Hoang, 2006), a variant of the former which supports factored translation models, we have investigated two approaches: Combined Feature Models and Factored Models. While Combined Feature Models make use of concatenations of linguistic features to enrich their models, Factored Models view a token as a vector of factors, enabling to build relatively independent models for each factor. In the context of machine translation, both models were expected to enrich the existing surface word model with additional linguistic information. The research undertaken focused on finding ways to improve output translation quality for English-to-French and French-to-English translations from various standpoints. A better general readability and understandability of a generated document should be achieved mainly by ensuring the text fluency in the target language (syntactic correctness), its adequacy (use of adequate terminology) and its fidelity (semantic adequacy). These main goals were addressed by first of all analysing the Pharaoh’s current performance, and understanding language specific and model-related problems encountered. Several experiments were then performed using our two approaches, and their results were compared. Despite a few noted improvements in some of the linguistic issues discussed, notably fixed expression translation and part-of-speech ambiguity, major problems involving complex syntactic structures in the source language still posed a hard challenge to the approach of linguistically augmenting phrase-based statistical machine translation.