ReproHum: 0033-03: How Reproducible Are Fluency Ratings of Generated Text? A Reproduction of August et al. 2022

    Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

    25 Downloads (Pure)

    Abstract

    In earlier work, August et al. (2022) evaluated three different Natural Language Generation systems on their ability to generate fluent, relevant, and factual scientific definitions. As part of the ReproHum project (Belz et al., 2023), we carried out a partial reproduction study of their human evaluation procedure, focusing on human fluency ratings. Following the standardised ReproHum procedure, our reproduction study follows the original study as closely as possible, with two raters providing 300 ratings each. In addition to this, we carried out a second study where we collected ratings from eight additional raters and analysed the variability of the ratings. We successfully reproduced the inferential statistics from the original study (i.e. the same hypotheses were supported), albeit with a lower inter-annotator agreement. The remainder of our paper shows significant variation between different raters, raising questions about what it really means to reproduce human evaluation studies.
    Original languageEnglish
    Title of host publicationProceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
    EditorsSimone Balloccu, Anya Belz, Rudali Huidrom, Ehud Reiter, Joao Sedoc, Craig Thomson
    Place of PublicationTorino, Italia
    PublisherELRA and ICCL
    Pages132-144
    Number of pages13
    Publication statusPublished - 1 May 2024

    Keywords

    • Fluency ratings
    • Natural Language Generation
    • Evaluation
    • Reproduction

    Fingerprint

    Dive into the research topics of 'ReproHum: 0033-03: How Reproducible Are Fluency Ratings of Generated Text? A Reproduction of August et al. 2022'. Together they form a unique fingerprint.

    Cite this