Using Discourse Strategies to Improve Sentence Alignment in Statistical Machine Translation
Item statusRestricted Access
The training process of the translation model in statistical machine translation requires a sentence-aligned parallel corpus of source and target language. Most available parallel corpora are at best document-aligned, so sentence alignment is performed on the document-aligned parallel corpus as a pre-processing step to word alignment and building the phrase translation table. In the process of sentence alignment, some data is discarded for "quality reasons", usually because of N:1 sentence alignments. This work presents a set of rules based on empirical analysis of discourse strategies in data discarded during the alignment process of Europarl data. These rules are developed to split the long sentence in 2:1/1:2 sentence alignments, leading to two 1:1 sentence alignments which are added to the training data. I present three evaluation methods addressing the split performance and applicability as well as the impact on the translation table of the data gained, and show that the sentence splits determined by the rules lead to more grammatical sentences on each side of the split than a proportionate split, and record small improvements in BLEU score of a translation system trained with the additional data compared to one without. Findings also indicate that the rules developed are domain-specific to the Europarl corpus and result in bad sentence splits of N:1 alignments of news report data.