CACAPO dataset



The Combinations of Aligned data-sentenCes from nAturally PrOduced texts (hereafter: CACAPO) dataset is a dataset for data-to-text generation. The dataset contains over 20,000 sentences from automatically scraped news reports for the sports, weather, stock, and incidents domain in English and Dutch, aligned with relevant attribute-value paired data. To our knowledge, this is the first dataset based on “naturally occurring” human-written texts (i.e., texts that were not collected in a task-based setting), that covers various domains, as well as multiple languages.
Date made available2 Aug 2022

Cite this