TY - GEN
T1 - Are LLMs Breaking MT Metrics? Results of the WMT24 Metrics Shared Task.
AU - Freitag, Markus
AU - Mathur, Nitika
AU - Deutsch, Daniel
AU - Lo, Chi-kiu
AU - Avramidis, Eleftherios
AU - Rei, Ricardo
AU - Thompson, Brian
AU - Blain, Frédéric
AU - Kocmi, Tom
AU - Wang, Jiayi
AU - Adelani, David Ifeoluwa
AU - Buchicchio, Marianna
AU - Zerva, Chrysoula
AU - Lavie, Alon
N1 - DBLP License: DBLP's bibliographic metadata records provided through http://dblp.org/ are distributed under a Creative Commons CC0 1.0 Universal Public Domain Dedication. Although the bibliographic metadata records are provided consistent with CC0 1.0 Dedication, the content described by the metadata records is not. Content may be subject to copyright, rights of privacy, rights of publicity and other restrictions.
PY - 2024
Y1 - 2024
N2 - The WMT24 Metrics Shared Task evaluated the performance of automatic metrics for machine translation (MT), with a major focus on LLM-based translations that were generated as part of the WMT24 General MT Shared Task. As LLMs become increasingly popular in MT, it is crucial to determine whether existing evaluation metrics can accurately assess the output of these systems.To provide a robust benchmark for this evaluation, human assessments were collected using Multidimensional Quality Metrics (MQM), continuing the practice from recent years. Furthermore, building on the success of the previous year, a challenge set subtask was included, requiring participants to design contrastive test suites that specifically target a metric’s ability to identify and penalize different types of translation errors.Finally, the meta-evaluation procedure was refined to better reflect real-world usage of MT metrics, focusing on pairwise accuracy at both the system- and segment-levels.We present an extensive analysis on how well metrics perform on three language pairs: English to Spanish (Latin America), Japanese to Chinese, and English to German. The results strongly confirm the results reported last year, that fine-tuned neural metrics continue to perform well, even when used to evaluate LLM-based translation systems.
AB - The WMT24 Metrics Shared Task evaluated the performance of automatic metrics for machine translation (MT), with a major focus on LLM-based translations that were generated as part of the WMT24 General MT Shared Task. As LLMs become increasingly popular in MT, it is crucial to determine whether existing evaluation metrics can accurately assess the output of these systems.To provide a robust benchmark for this evaluation, human assessments were collected using Multidimensional Quality Metrics (MQM), continuing the practice from recent years. Furthermore, building on the success of the previous year, a challenge set subtask was included, requiring participants to design contrastive test suites that specifically target a metric’s ability to identify and penalize different types of translation errors.Finally, the meta-evaluation procedure was refined to better reflect real-world usage of MT metrics, focusing on pairwise accuracy at both the system- and segment-levels.We present an extensive analysis on how well metrics perform on three language pairs: English to Spanish (Latin America), Japanese to Chinese, and English to German. The results strongly confirm the results reported last year, that fine-tuned neural metrics continue to perform well, even when used to evaluate LLM-based translation systems.
U2 - 10.18653/v1/2024.wmt-1.2
DO - 10.18653/v1/2024.wmt-1.2
M3 - Conference contribution
SP - 47
EP - 81
BT - Proceedings of the Ninth Conference on Machine Translation
A2 - Haddow, Barry
A2 - Kocmi, Tom
A2 - Koehn, Philipp
A2 - Monz, Christof
PB - Association for Computational Linguistics
ER -