Edinburgh Research Archive

Segment-level evaluation of machine translation metrics

Item Status

Embargo End Date

Authors

Moghe, Nikita

Abstract

Most metrics evaluating Machine Translation (MT) claim their effectiveness by demonstrating their ability to distinguish the quality of different MT systems over a large corpus (system-level evaluation). However, their evaluation on determining good translations from bad for one instance (segment-level evaluation) is largely understudied and overlooked. Segment-level evaluation influences system-level evaluation and is crucial in applications that use MT as a part of their technology stack. In this thesis, we offer a new perspective on evaluating segment-level metrics through their use in extrinsic tasks and challenge sets allowing us to identify their drawbacks and subsequently provide suggestions to improve them. Our first approach evaluates a metric's ability to correlate translation quality with translation utility in an extrinsic task. We find that contemporary MT metrics exhibit negligible correlation with the outcomes of a downstream task indicating their inability to identify useful translations. We observe that the scores provided by neural metrics are not interpretable, in large part due to having undefined ranges. We further find that different tasks show varying sensitivity to MT errors. To assess the capability of individual metrics in identifying various machine translation errors, we create a contrastive challenge set. ACES consists of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge spanning 146 language pairs. We evaluate 47 metrics on the ACES dataset belonging to different design paradigms. We also investigate claims that Large Language Models (LLMs) are effective as MT evaluators, addressing the limitations of previous studies by providing a more holistic evaluation that covers a range of linguistic phenomena and language pairs and includes both low- and medium-resource languages. Our results demonstrate that different metric families struggle with different phenomena and that LLM-based methods fail to demonstrate reliable performance. We conduct several analyses and observe that many metrics ignore the information in the source sentence, have a tendency to prefer surface-level overlap and end up incorporating properties of base multilingual models which are not always beneficial. Throughout the thesis, it becomes evident that singular scores produced by metrics are uninformative. We provide several recommendations to improve metric design while advocating MT evaluation based on the prediction of error labels instead of error scores. To facilitate this, we also release Span-ACES where the incorrect translations from ACES are annotated at the span level.

This item appears in the following Collection(s)