Segment-level evaluation of machine translation metrics

Moghe, Nikita

Segment-level evaluation of machine translation metrics

Simple item page

dc.contributor.advisor

Birch-Mayne, Alexandra

dc.contributor.advisor

Steedman, Mark

dc.contributor.author

Moghe, Nikita

dc.contributor.sponsor

UK Research and Innovation (UKRI)

en

dc.contributor.sponsor

University of Edinburgh

en

dc.date.accessioned

2024-09-16T14:41:06Z

dc.date.available

2024-09-16T14:41:06Z

dc.date.issued

2024-09-16

dc.description.abstract

Most metrics evaluating Machine Translation (MT) claim their effectiveness by demonstrating their ability to distinguish the quality of different MT systems over a large corpus (system-level evaluation). However, their evaluation on determining good translations from bad for one instance (segment-level evaluation) is largely understudied and overlooked. Segment-level evaluation influences system-level evaluation and is crucial in applications that use MT as a part of their technology stack. In this thesis, we offer a new perspective on evaluating segment-level metrics through their use in extrinsic tasks and challenge sets allowing us to identify their drawbacks and subsequently provide suggestions to improve them. Our first approach evaluates a metric's ability to correlate translation quality with translation utility in an extrinsic task. We find that contemporary MT metrics exhibit negligible correlation with the outcomes of a downstream task indicating their inability to identify useful translations. We observe that the scores provided by neural metrics are not interpretable, in large part due to having undefined ranges. We further find that different tasks show varying sensitivity to MT errors. To assess the capability of individual metrics in identifying various machine translation errors, we create a contrastive challenge set. ACES consists of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge spanning 146 language pairs. We evaluate 47 metrics on the ACES dataset belonging to different design paradigms. We also investigate claims that Large Language Models (LLMs) are effective as MT evaluators, addressing the limitations of previous studies by providing a more holistic evaluation that covers a range of linguistic phenomena and language pairs and includes both low- and medium-resource languages. Our results demonstrate that different metric families struggle with different phenomena and that LLM-based methods fail to demonstrate reliable performance. We conduct several analyses and observe that many metrics ignore the information in the source sentence, have a tendency to prefer surface-level overlap and end up incorporating properties of base multilingual models which are not always beneficial. Throughout the thesis, it becomes evident that singular scores produced by metrics are uninformative. We provide several recommendations to improve metric design while advocating MT evaluation based on the prediction of error labels instead of error scores. To facilitate this, we also release Span-ACES where the incorrect translations from ACES are annotated at the span level.

en

dc.identifier.uri

https://hdl.handle.net/1842/42171

dc.identifier.uri

http://dx.doi.org/10.7488/era/4892

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

Amrhein, C., Moghe, N., and Guillou, L. (2022). ACES: Translation accuracy challenge sets for evaluating machine translation metrics. In Koehn, P., Barrault, L., Bojar, O., Bougares, F., Chatterjee, R., Costa-jussa, M. R., Federmann, C., Fishel, M., ` Fraser, A., Freitag, M., Graham, Y., Grundkiewicz, R., Guzman, P., Haddow, B., Huck, M., Jimeno Yepes, A., Kocmi, T., Martins, A., Morishita, M., Monz, C., Nagata, M., Nakazawa, T., Negri, M., Nev´ eol, A., Neves, M., Popel, M., Turchi, ´ M., and Zampieri, M., editors, Proceedings of the Seventh Conference on Machine Translation (WMT), pages 479–513, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics

en

dc.relation.hasversion

Amrhein, C., Moghe, N., and Guillou, L. (2023). ACES: Translation accuracy challenge sets at WMT 2023. In Koehn, P., Haddow, B., Kocmi, T., and Monz, C., editors, Proceedings of the Eighth Conference on Machine Translation, pages 695–712, Singapore. Association for Computational Linguistics

en

dc.relation.hasversion

Moghe, N., Fazla, A., Amrhein, C., Kocmi, T., Steedman, M., Birch, A., Sennrich, R., and Guillou, L. (2024). Machine translation meta evaluation through translation accuracy challenge sets. Computing Research Repository, arXiv:2401.16313

en

dc.relation.hasversion

Moghe, N., Hardmeier, C., and Bawden, R. (2020). The University of Edinburgh Uppsala University’s submission to the WMT 2020 chat translation task. In Barrault, L., Bojar, O., Bougares, F., Chatterjee, R., Costa-jussa, M. R., Federmann, C., Fishel, ` M., Fraser, A., Graham, Y., Guzman, P., Haddow, B., Huck, M., Yepes, A. J., Koehn, P., Martins, A., Morishita, M., Monz, C., Nagata, M., Nakazawa, T., and Negri, M., editors, Proceedings of the Fifth Conference on Machine Translation, pages 473–478, Online. Association for Computational Linguistics

en

dc.relation.hasversion

Moghe, N., Razumovskaia, E., Guillou, L., Vulic, I., Korhonen, A., and Birch, A. ´ (2023a). Multi3NLU++: A multilingual, multi-intent, multi-domain dataset for natu ral language understanding in task-oriented dialogue. In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors, Findings of the Association for Computational Linguis tics: ACL 2023, pages 3732–3755, Toronto, Canada. Association for Computational Linguistics

en

dc.relation.hasversion

Moghe, N., Sherborne, T., Steedman, M., and Birch, A. (2023b). Extrinsic evaluation of machine translation metrics. In Rogers, A., Boyd-Graber, J., and Okazaki, N.,editors, Proceedings of the 61st Annual Meeting of the Association for Computa tional Linguistics (Volume 1: Long Papers), pages 13060–13078, Toronto, Canada. Association for Computational Linguistics.

en

dc.relation.hasversion

Moghe, N., Steedman, M., and Birch, A. (2021). Cross-lingual intermediate fine-tuning improves dialogue state tracking. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t., editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1137–1150, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics

en

dc.subject

machine translation systems

en

dc.subject

segment-level evaluation

en

dc.subject

MT metrics

en

dc.subject

MT evaluation

en

dc.subject

evaluating translations

en

dc.title

Segment-level evaluation of machine translation metrics

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Moghe2024.pdf
Size:: 1.49 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection