Our Dialogue System Sucks - but Luckily we are at the Top of the Leaderboard!: A Discussion on Current Practices in NLP Evaluation

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

46 Downloads (Pure)

Abstract

Currently, leaderboards are often used to evaluate natural language processing (NLP) systems and in particular large language models. In this paper we argue why we should step away from leaderboards and follow a more inclusive approach both in developing as well as in evaluating models. The focus of evaluation should be on the complete context in which the system operates. To accomplish this, researchers should take an inclusive approach and take note of developments in multiple scientific fields (from NLP to communication science).

Original languageEnglish
Title of host publicationProceedings of the 6th Conference on ACM Conversational User Interfaces, CUI 2024
Number of pages5
ISBN (Electronic)9798400705113
DOIs
Publication statusPublished - 8 Jul 2024

Publication series

NameProceedings of the 6th Conference on ACM Conversational User Interfaces, CUI 2024

Keywords

  • Evaluation
  • NLP
  • leaderboards
  • multidisciplinary research

Fingerprint

Dive into the research topics of 'Our Dialogue System Sucks - but Luckily we are at the Top of the Leaderboard!: A Discussion on Current Practices in NLP Evaluation'. Together they form a unique fingerprint.

Cite this