A reproduction study of methods for evaluating dialogue system output: Replicating Santhanam and Shaikh (2019)

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

In this paper, we describe our reproduction effort of the paper: Towards Best Experiment Design for Evaluating Dialogue System Output by Santhanam and Shaikh (2019) for the 2022 ReproGen shared task. We aim to produce the same results, using different human evaluators, and a different implementation of the automatic metrics used in the original paper. Although overall the study posed some challenges to reproduce (e.g. difficulties with reproduction of automatic metrics and statistics), in the end we did find that the results generally replicate the findings of Santhanam and Shaikh (2019) and seem to follow similar trends.
Original languageEnglish
Title of host publicationProceedings of the 15th International Conference on Natural Language Generation: Generation Challenges
Place of PublicationWaterville, Maine, USA and virtual meeting
PublisherAssociation for Computational Linguistics
Pages86-93
Number of pages8
ISBN (Print)978-1-955917-60-5
Publication statusPublished - Jul 2022

Keywords

  • Reproduction study
  • NLP research
  • Human evaluation task
  • Dialogue system output

Fingerprint

Dive into the research topics of 'A reproduction study of methods for evaluating dialogue system output: Replicating Santhanam and Shaikh (2019)'. Together they form a unique fingerprint.

Cite this