Automatic Construction of Evaluation Suites for Natural Language Generation Datasets

Simon Mille, Kaustubh Dhole, Saad Mahamood, Laura Perez-Beltrachini, Varun Prashant Gangal, Mihir Kale, Emiel van Miltenburg, Sebastian Gehrmann

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

15 Downloads (Pure)

Abstract

Machine learning approaches applied to NLP are often evaluated by summarizing their performance in a single number, for example accuracy. Since most test sets are constructed as an i.i.d. sample from the overall data, this approach overly simplifies the complexity of language and encourages overfitting to the head of the data distribution. As such, rare language phenomena or text about underrepresented groups are not equally included in the evaluation. To encourage more in-depth model analyses, researchers have proposed the use of multiple test sets, also called challenge sets, that assess specific capabilities of a model. In this paper, we develop a framework based on this idea which is able to generate controlled perturbations and identify subsets in text-to-scalar, text-to-text, or data-to-text settings. By applying this framework to the GEM generation benchmark, we develop evaluation suites made of 80 challenge sets, demonstrate the kinds of analyses that it enables, and shed light onto the limits of current generation models.
Original languageEnglish
Title of host publicationProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1 (NeurIPS Datasets and Benchmarks 2021)
Publication statusPublished - 2021
EventConference on Neural Information Processing Systems 2021: Datasets and Benchmarks - Hybrid
Duration: 28 Nov 20219 Dec 2021
Conference number: 35
https://neurips.cc/

Conference

ConferenceConference on Neural Information Processing Systems 2021
Abbreviated titleNeurIPS 2021
Period28/11/219/12/21
Internet address

Fingerprint

Dive into the research topics of 'Automatic Construction of Evaluation Suites for Natural Language Generation Datasets'. Together they form a unique fingerprint.

Cite this