How reproducible is best-worst scaling for human evaluation? A reproduction of `Data-to-text Generation with Macro Planning'

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

This paper is part of the larger ReproHum project, where different teams of researchers aim to reproduce published experiments from the NLP literature. Specifically, ReproHum focuses on the reproducibility of human evaluation studies, where participants indicate the quality of different outputs of Natural Language Generation (NLG) systems. This is necessary because without reproduction studies, we do not know how reliable earlier results are. This paper aims to reproduce the second human evaluation study of Puduppully Lapata (2021), while another lab is attempting to do the same. This experiment uses best-worst scaling to determine the relative performance of different NLG systems. We found that the worst performing system in the original study is now in fact the best performing system across the board. This means that we cannot fully reproduce the original results. We also carry out alternative analyses of the data, and discuss how our results may be combined with the other reproduction study that is carried out in parallel with this paper.
Original languageEnglish
Title of host publicationProceedings of the 3rd Workshop on Human Evaluation of NLP Systems
EditorsAnya Belz, Maja Popović, Ehud Reiter, Craig Thomson, João Sedoc
Place of PublicationVarna, Bulgaria
PublisherIncoma Ltd., Shoumen, Bulgaria
Pages75-88
Number of pages14
Publication statusPublished - 1 Sept 2023

Fingerprint

Dive into the research topics of 'How reproducible is best-worst scaling for human evaluation? A reproduction of `Data-to-text Generation with Macro Planning''. Together they form a unique fingerprint.

Cite this