Results of WMT23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Innocent.

Markus Freitag, Nitika Mathur, Chi-kiu Lo, Eleftherios Avramidis, Ricardo Rei, Brian Thompson, Tom Kocmi, Frédéric Blain, Daniel Deutsch, Craig Stewart, Chrysoula Zerva, Sheila Castilho, Alon Lavie, George F. Foster

    Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

    Abstract

    This paper presents the results of the WMT23 Metrics Shared Task. Participants submitting automatic MT evaluation metrics were asked to score the outputs of the translation systems competing in the WMT23 News Translation
    Task. All metrics were evaluated on how well they correlate with human ratings at the system and segment level. Similar to last year, we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). Following last year’s success, we also included a challenge
    set subtask, where participants had to create contrastive test suites for evaluating metrics’ ability to capture and penalise specific types of translation errors. Furthermore, we improved our meta-evaluation procedure by considering
    fewer tasks and calculating a global score by weighted averaging across the various tasks.
    Original languageEnglish
    Title of host publicationProceedings of the Eighth Conference on Machine Translation
    Pages578-628
    Number of pages51
    Publication statusPublished - 2023

    Fingerprint

    Dive into the research topics of 'Results of WMT23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Innocent.'. Together they form a unique fingerprint.

    Cite this