Edinburgh Research Archive

Segment-level evaluation of machine translation metrics

dc.contributor.advisor
Birch-Mayne, Alexandra
dc.contributor.advisor
Steedman, Mark
dc.contributor.author
Moghe, Nikita
dc.contributor.sponsor
UK Research and Innovation (UKRI)
en
dc.contributor.sponsor
University of Edinburgh
en
dc.date.accessioned
2024-09-16T14:41:06Z
dc.date.available
2024-09-16T14:41:06Z
dc.date.issued
2024-09-16
dc.description.abstract
Most metrics evaluating Machine Translation (MT) claim their effectiveness by demonstrating their ability to distinguish the quality of different MT systems over a large corpus (system-level evaluation). However, their evaluation on determining good translations from bad for one instance (segment-level evaluation) is largely understudied and overlooked. Segment-level evaluation influences system-level evaluation and is crucial in applications that use MT as a part of their technology stack. In this thesis, we offer a new perspective on evaluating segment-level metrics through their use in extrinsic tasks and challenge sets allowing us to identify their drawbacks and subsequently provide suggestions to improve them. Our first approach evaluates a metric's ability to correlate translation quality with translation utility in an extrinsic task. We find that contemporary MT metrics exhibit negligible correlation with the outcomes of a downstream task indicating their inability to identify useful translations. We observe that the scores provided by neural metrics are not interpretable, in large part due to having undefined ranges. We further find that different tasks show varying sensitivity to MT errors. To assess the capability of individual metrics in identifying various machine translation errors, we create a contrastive challenge set. ACES consists of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge spanning 146 language pairs. We evaluate 47 metrics on the ACES dataset belonging to different design paradigms. We also investigate claims that Large Language Models (LLMs) are effective as MT evaluators, addressing the limitations of previous studies by providing a more holistic evaluation that covers a range of linguistic phenomena and language pairs and includes both low- and medium-resource languages. Our results demonstrate that different metric families struggle with different phenomena and that LLM-based methods fail to demonstrate reliable performance. We conduct several analyses and observe that many metrics ignore the information in the source sentence, have a tendency to prefer surface-level overlap and end up incorporating properties of base multilingual models which are not always beneficial. Throughout the thesis, it becomes evident that singular scores produced by metrics are uninformative. We provide several recommendations to improve metric design while advocating MT evaluation based on the prediction of error labels instead of error scores. To facilitate this, we also release Span-ACES where the incorrect translations from ACES are annotated at the span level.
en
dc.identifier.uri
https://hdl.handle.net/1842/42171
dc.identifier.uri
http://dx.doi.org/10.7488/era/4892
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Amrhein, C., Moghe, N., and Guillou, L. (2022). ACES: Translation accuracy challenge sets for evaluating machine translation metrics. In Koehn, P., Barrault, L., Bojar, O., Bougares, F., Chatterjee, R., Costa-jussa, M. R., Federmann, C., Fishel, M., ` Fraser, A., Freitag, M., Graham, Y., Grundkiewicz, R., Guzman, P., Haddow, B., Huck, M., Jimeno Yepes, A., Kocmi, T., Martins, A., Morishita, M., Monz, C., Nagata, M., Nakazawa, T., Negri, M., Nev´ eol, A., Neves, M., Popel, M., Turchi, ´ M., and Zampieri, M., editors, Proceedings of the Seventh Conference on Machine Translation (WMT), pages 479–513, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics
en
dc.relation.hasversion
Amrhein, C., Moghe, N., and Guillou, L. (2023). ACES: Translation accuracy challenge sets at WMT 2023. In Koehn, P., Haddow, B., Kocmi, T., and Monz, C., editors, Proceedings of the Eighth Conference on Machine Translation, pages 695–712, Singapore. Association for Computational Linguistics
en
dc.relation.hasversion
Moghe, N., Fazla, A., Amrhein, C., Kocmi, T., Steedman, M., Birch, A., Sennrich, R., and Guillou, L. (2024). Machine translation meta evaluation through translation accuracy challenge sets. Computing Research Repository, arXiv:2401.16313
en
dc.relation.hasversion
Moghe, N., Hardmeier, C., and Bawden, R. (2020). The University of Edinburgh Uppsala University’s submission to the WMT 2020 chat translation task. In Barrault, L., Bojar, O., Bougares, F., Chatterjee, R., Costa-jussa, M. R., Federmann, C., Fishel, ` M., Fraser, A., Graham, Y., Guzman, P., Haddow, B., Huck, M., Yepes, A. J., Koehn, P., Martins, A., Morishita, M., Monz, C., Nagata, M., Nakazawa, T., and Negri, M., editors, Proceedings of the Fifth Conference on Machine Translation, pages 473–478, Online. Association for Computational Linguistics
en
dc.relation.hasversion
Moghe, N., Razumovskaia, E., Guillou, L., Vulic, I., Korhonen, A., and Birch, A. ´ (2023a). Multi3NLU++: A multilingual, multi-intent, multi-domain dataset for natu ral language understanding in task-oriented dialogue. In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors, Findings of the Association for Computational Linguis tics: ACL 2023, pages 3732–3755, Toronto, Canada. Association for Computational Linguistics
en
dc.relation.hasversion
Moghe, N., Sherborne, T., Steedman, M., and Birch, A. (2023b). Extrinsic evaluation of machine translation metrics. In Rogers, A., Boyd-Graber, J., and Okazaki, N.,editors, Proceedings of the 61st Annual Meeting of the Association for Computa tional Linguistics (Volume 1: Long Papers), pages 13060–13078, Toronto, Canada. Association for Computational Linguistics.
en
dc.relation.hasversion
Moghe, N., Steedman, M., and Birch, A. (2021). Cross-lingual intermediate fine-tuning improves dialogue state tracking. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t., editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1137–1150, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics
en
dc.subject
machine translation systems
en
dc.subject
segment-level evaluation
en
dc.subject
MT metrics
en
dc.subject
MT evaluation
en
dc.subject
evaluating translations
en
dc.title
Segment-level evaluation of machine translation metrics
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en

Files

Original bundle

Now showing 1 - 1 of 1
Name:
Moghe2024.pdf
Size:
1.49 MB
Format:
Adobe Portable Document Format
Description:

This item appears in the following Collection(s)