This paper describes the CACAPO dataset, built for training both neural pipeline and end-to-end data-to-text language generation systems. The dataset is multilingual (Dutch and English), and contains almost 10,000 sentences from human-written news texts in the sports, weather, stocks, and incidents domain, together with aligned attribute-value paired data. The dataset is unique in that the linguistic variation and indirect ways of expressing data in these texts reflect the challenges of real world NLG tasks.
|Title of host publication||Proceedings of The 13th International Conference on Natural Language Generation|
|Place of Publication||Dublin, Ireland|
|Number of pages||15|
|Publication status||Published - 1 Dec 2020|
|Event||International Conference on Natural Language Generation - online, Dublin , Ireland|
Duration: 15 Dec 2020 → 18 Dec 2020
Conference number: 13
|Conference||International Conference on Natural Language Generation|
|Abbreviated title||INLG 2020|
|Period||15/12/20 → 18/12/20|
FingerprintDive into the research topics of 'The CACAPO Dataset: A Multilingual, Multi-Domain Dataset for Neural Pipeline and End-to-End Data-to-Text Generation'. Together they form a unique fingerprint.
van der Lee, C. (Creator), Emmery, C. (Creator), Wubben, S. (Creator) & Krahmer, E. (Creator), DataverseNL, 2 Aug 2022
DOI: 10.34894/libyhp, https://dataverse.nl/citation?persistentId=doi:10.34894/LIBYHP