Introducing corpus-based rules and algorithms in a rule-based machine translation system
Machine translation offers the challenge of automatically translating a text from one natural language into another. Statistical methods - originating from the field of information theory - have shown to be a major breakthrough in the field of machine translation. Prior to this paradigm, many systems had been developed following a rule-based approach. This denotes a system based on a linguistic description of the languages involved and of how translation occurs in the mind of the (human) translator. Statistical models on the contrary use empirical means and may work with very little linguistic hypothesis on language and translation as performed by humans. This had implications for rule-based translation systems, in terms of software architecture and the nature of the rules, which were manually input and lack any statistical feature. In the view of such diverging paradigms, we can imagine trying to combine both in a hybrid system. In the present work, we start by examining the state-of-the-art of both rule-based and statistical systems. We restrict the rule-based approach to transfer-based systems. We compare rule-based and statistical paradigms in terms of global translation quality and give a qualitative analysis of their respective specific errors. We also introduce initial black-box hybrid models that confirm there is an expected gain in combining the two approaches. Motivated by the qualitative analysis, we focus our study and experiments on lexical phrasal rules. We propose a setup allowing to extract such resources from corpora. Going one step further in the integration of rule-based and statistical approaches, we then examine how to combine the extracted rules with decoding modules that will allow for a corpus-based handling of ambiguity. This then leads to the final delivery of this work: a rule-based system for which we can learn non-deterministic rules from corpora, and whose decoder can be optimised on a tuning set in the same domain.