Advances in Natural Language Generation: Generating Varied Outputs from Semantic Inputs

Thiago Castro Ferreira

Research output: ThesisDoctoral ThesisScientific

50 Downloads (Pure)

Abstract

Natural Language Generation (NLG) -- also known as Automatic Text Generation -- is the computational process of generating understandable natural language text from non-linguistic input data (Reiter & Dale, 2000; Gatt & Krahmer, 2018). Practical applications of the process include automatically generated weather forecasts (Goldberg et al., 1994; Sripada et al., 2004; Belz, 2008; Konstas & Lapata, 2013), news written by "robot-journalists" (Clerwall, 2014) and neonatal intensive care reports for doctors and caregivers (Reiter, 2007; Portet et al., 2009).This thesis focused on two particular problems in the NLG process: how to generate more varied texts to describe the same communicative goal (Chapters 2-6) and what is an appropriate semantic input to generate language from (Chapters 7-8).

For the first problem, we aimed to model linguistic variation in the NLG process, focusing on the generation of noun phrases, a task called Referring Expression Generation (REG). By collecting and analyzing new corpora of referring expressions (described in Chapters 2 and 4), we were able to develop new state-of-the-art data-driven models for two subtasks in modular systems of REG: the choice of referential form (i.e., whether a reference in the text should be a proper name, a pronoun, a description, etc.; Chapter 3) and proper name generation (i.e., given that a reference has the proper name form, should it be the full name of the entity, first name, surname or other proper name form?; Chapter 5). Additionally, we introduced an end-to-end approach, based on neural networks, which, different from modular REG systems, generates varied referring expressions to a discourse entity, deciding on its referential form and content in one shot without explicit feature extraction. Using a new delexicalized version of the WebNLG corpus (Gardent et al., 2017a,b), we showed that the neural model substantially improved over two strong baselines in terms of accuracy of the referring expressions and fluency of the lexicalized texts (Chapter 6).

The second problem addressed in this thesis concerned the input to NLG systems. While there is broad consensus among scholars on the output of NLG systems (i.e., text or speech), there is far less agreement on what the input should be. To address the problem, researchers have started looking for candidate input formats that could be used more broadly within the community. In this thesis, we have looked in detail at two of them: Abstract Meaning Representation (AMR; Chapter 7) and RDF Triples from the semantic web (Chapter 8), which have fundamental differences in terms of level of specification, limitations and availability of resources. To convert both meaning representations into text, we proposed NLG models based on a pipeline architecture as well as models that work in a less modular style, by using methods from Statistical (Koehn et al., 2003) and Neural (Bahdanau et al., 2015) Machine Translation. We concluded that both representations are helpful for NLG research, and which is preferred presumably depends on the specific goal of the NLG system to be developed or on the NLG problem to be addressed. For instance, to study the full textual realization process, working with RDF triples seems preferable over AMRs, while for text-to-text NLG approaches or for the study of specific issues, such as lexical choice or phrase ordering within a sentence, AMRs may be the better choice. 

In conclusion, this thesis has focused on the automatic generation of more varied output texts, based on various semantic input representations. We hope to have contributed to a better understanding of the NLG process, paving the way for improved and more engaging automatically generated text.
Original languageEnglish
QualificationDoctor of Philosophy
Supervisors/Advisors
  • Krahmer, Emiel, Promotor
  • Wubben, Sander, Co-promotor
  • van den Bosch, Antal, Member PhD commission, External person
  • Gatt, Albert, Member PhD commission, External person
  • Gardent, Claire , Member PhD commission, External person
  • Schilder, Frank, Member PhD commission, External person
  • Goudbeek, Martijn, Member PhD commission
Thesis sponsors
Award date19 Sep 2018
Place of PublicationEnschede
Publisher
Publication statusPublished - 2018

Fingerprint

Semantics
Semantic Web
Linguistics
Feature extraction
Pipelines
Availability
Robots
Neural networks
Specifications

Cite this

Castro Ferreira, Thiago. / Advances in Natural Language Generation : Generating Varied Outputs from Semantic Inputs. Enschede : Ipskamp, 2018. 236 p.
@phdthesis{27f0eae0fcfd499788e80f4d74bd24c8,
title = "Advances in Natural Language Generation: Generating Varied Outputs from Semantic Inputs",
abstract = "Natural Language Generation (NLG) -- also known as Automatic Text Generation -- is the computational process of generating understandable natural language text from non-linguistic input data (Reiter & Dale, 2000; Gatt & Krahmer, 2018). Practical applications of the process include automatically generated weather forecasts (Goldberg et al., 1994; Sripada et al., 2004; Belz, 2008; Konstas & Lapata, 2013), news written by {"}robot-journalists{"} (Clerwall, 2014) and neonatal intensive care reports for doctors and caregivers (Reiter, 2007; Portet et al., 2009).This thesis focused on two particular problems in the NLG process: how to generate more varied texts to describe the same communicative goal (Chapters 2-6) and what is an appropriate semantic input to generate language from (Chapters 7-8).For the first problem, we aimed to model linguistic variation in the NLG process, focusing on the generation of noun phrases, a task called Referring Expression Generation (REG). By collecting and analyzing new corpora of referring expressions (described in Chapters 2 and 4), we were able to develop new state-of-the-art data-driven models for two subtasks in modular systems of REG: the choice of referential form (i.e., whether a reference in the text should be a proper name, a pronoun, a description, etc.; Chapter 3) and proper name generation (i.e., given that a reference has the proper name form, should it be the full name of the entity, first name, surname or other proper name form?; Chapter 5). Additionally, we introduced an end-to-end approach, based on neural networks, which, different from modular REG systems, generates varied referring expressions to a discourse entity, deciding on its referential form and content in one shot without explicit feature extraction. Using a new delexicalized version of the WebNLG corpus (Gardent et al., 2017a,b), we showed that the neural model substantially improved over two strong baselines in terms of accuracy of the referring expressions and fluency of the lexicalized texts (Chapter 6).The second problem addressed in this thesis concerned the input to NLG systems. While there is broad consensus among scholars on the output of NLG systems (i.e., text or speech), there is far less agreement on what the input should be. To address the problem, researchers have started looking for candidate input formats that could be used more broadly within the community. In this thesis, we have looked in detail at two of them: Abstract Meaning Representation (AMR; Chapter 7) and RDF Triples from the semantic web (Chapter 8), which have fundamental differences in terms of level of specification, limitations and availability of resources. To convert both meaning representations into text, we proposed NLG models based on a pipeline architecture as well as models that work in a less modular style, by using methods from Statistical (Koehn et al., 2003) and Neural (Bahdanau et al., 2015) Machine Translation. We concluded that both representations are helpful for NLG research, and which is preferred presumably depends on the specific goal of the NLG system to be developed or on the NLG problem to be addressed. For instance, to study the full textual realization process, working with RDF triples seems preferable over AMRs, while for text-to-text NLG approaches or for the study of specific issues, such as lexical choice or phrase ordering within a sentence, AMRs may be the better choice. In conclusion, this thesis has focused on the automatic generation of more varied output texts, based on various semantic input representations. We hope to have contributed to a better understanding of the NLG process, paving the way for improved and more engaging automatically generated text.",
author = "{Castro Ferreira}, Thiago",
note = "Series: TiCC Ph.D. Series Volume: 64",
year = "2018",
language = "English",
series = "TiCC Ph.D. Series",
publisher = "Ipskamp",

}

Castro Ferreira, T 2018, 'Advances in Natural Language Generation: Generating Varied Outputs from Semantic Inputs', Doctor of Philosophy, Enschede.

Advances in Natural Language Generation : Generating Varied Outputs from Semantic Inputs. / Castro Ferreira, Thiago.

Enschede : Ipskamp, 2018. 236 p.

Research output: ThesisDoctoral ThesisScientific

TY - THES

T1 - Advances in Natural Language Generation

T2 - Generating Varied Outputs from Semantic Inputs

AU - Castro Ferreira, Thiago

N1 - Series: TiCC Ph.D. Series Volume: 64

PY - 2018

Y1 - 2018

N2 - Natural Language Generation (NLG) -- also known as Automatic Text Generation -- is the computational process of generating understandable natural language text from non-linguistic input data (Reiter & Dale, 2000; Gatt & Krahmer, 2018). Practical applications of the process include automatically generated weather forecasts (Goldberg et al., 1994; Sripada et al., 2004; Belz, 2008; Konstas & Lapata, 2013), news written by "robot-journalists" (Clerwall, 2014) and neonatal intensive care reports for doctors and caregivers (Reiter, 2007; Portet et al., 2009).This thesis focused on two particular problems in the NLG process: how to generate more varied texts to describe the same communicative goal (Chapters 2-6) and what is an appropriate semantic input to generate language from (Chapters 7-8).For the first problem, we aimed to model linguistic variation in the NLG process, focusing on the generation of noun phrases, a task called Referring Expression Generation (REG). By collecting and analyzing new corpora of referring expressions (described in Chapters 2 and 4), we were able to develop new state-of-the-art data-driven models for two subtasks in modular systems of REG: the choice of referential form (i.e., whether a reference in the text should be a proper name, a pronoun, a description, etc.; Chapter 3) and proper name generation (i.e., given that a reference has the proper name form, should it be the full name of the entity, first name, surname or other proper name form?; Chapter 5). Additionally, we introduced an end-to-end approach, based on neural networks, which, different from modular REG systems, generates varied referring expressions to a discourse entity, deciding on its referential form and content in one shot without explicit feature extraction. Using a new delexicalized version of the WebNLG corpus (Gardent et al., 2017a,b), we showed that the neural model substantially improved over two strong baselines in terms of accuracy of the referring expressions and fluency of the lexicalized texts (Chapter 6).The second problem addressed in this thesis concerned the input to NLG systems. While there is broad consensus among scholars on the output of NLG systems (i.e., text or speech), there is far less agreement on what the input should be. To address the problem, researchers have started looking for candidate input formats that could be used more broadly within the community. In this thesis, we have looked in detail at two of them: Abstract Meaning Representation (AMR; Chapter 7) and RDF Triples from the semantic web (Chapter 8), which have fundamental differences in terms of level of specification, limitations and availability of resources. To convert both meaning representations into text, we proposed NLG models based on a pipeline architecture as well as models that work in a less modular style, by using methods from Statistical (Koehn et al., 2003) and Neural (Bahdanau et al., 2015) Machine Translation. We concluded that both representations are helpful for NLG research, and which is preferred presumably depends on the specific goal of the NLG system to be developed or on the NLG problem to be addressed. For instance, to study the full textual realization process, working with RDF triples seems preferable over AMRs, while for text-to-text NLG approaches or for the study of specific issues, such as lexical choice or phrase ordering within a sentence, AMRs may be the better choice. In conclusion, this thesis has focused on the automatic generation of more varied output texts, based on various semantic input representations. We hope to have contributed to a better understanding of the NLG process, paving the way for improved and more engaging automatically generated text.

AB - Natural Language Generation (NLG) -- also known as Automatic Text Generation -- is the computational process of generating understandable natural language text from non-linguistic input data (Reiter & Dale, 2000; Gatt & Krahmer, 2018). Practical applications of the process include automatically generated weather forecasts (Goldberg et al., 1994; Sripada et al., 2004; Belz, 2008; Konstas & Lapata, 2013), news written by "robot-journalists" (Clerwall, 2014) and neonatal intensive care reports for doctors and caregivers (Reiter, 2007; Portet et al., 2009).This thesis focused on two particular problems in the NLG process: how to generate more varied texts to describe the same communicative goal (Chapters 2-6) and what is an appropriate semantic input to generate language from (Chapters 7-8).For the first problem, we aimed to model linguistic variation in the NLG process, focusing on the generation of noun phrases, a task called Referring Expression Generation (REG). By collecting and analyzing new corpora of referring expressions (described in Chapters 2 and 4), we were able to develop new state-of-the-art data-driven models for two subtasks in modular systems of REG: the choice of referential form (i.e., whether a reference in the text should be a proper name, a pronoun, a description, etc.; Chapter 3) and proper name generation (i.e., given that a reference has the proper name form, should it be the full name of the entity, first name, surname or other proper name form?; Chapter 5). Additionally, we introduced an end-to-end approach, based on neural networks, which, different from modular REG systems, generates varied referring expressions to a discourse entity, deciding on its referential form and content in one shot without explicit feature extraction. Using a new delexicalized version of the WebNLG corpus (Gardent et al., 2017a,b), we showed that the neural model substantially improved over two strong baselines in terms of accuracy of the referring expressions and fluency of the lexicalized texts (Chapter 6).The second problem addressed in this thesis concerned the input to NLG systems. While there is broad consensus among scholars on the output of NLG systems (i.e., text or speech), there is far less agreement on what the input should be. To address the problem, researchers have started looking for candidate input formats that could be used more broadly within the community. In this thesis, we have looked in detail at two of them: Abstract Meaning Representation (AMR; Chapter 7) and RDF Triples from the semantic web (Chapter 8), which have fundamental differences in terms of level of specification, limitations and availability of resources. To convert both meaning representations into text, we proposed NLG models based on a pipeline architecture as well as models that work in a less modular style, by using methods from Statistical (Koehn et al., 2003) and Neural (Bahdanau et al., 2015) Machine Translation. We concluded that both representations are helpful for NLG research, and which is preferred presumably depends on the specific goal of the NLG system to be developed or on the NLG problem to be addressed. For instance, to study the full textual realization process, working with RDF triples seems preferable over AMRs, while for text-to-text NLG approaches or for the study of specific issues, such as lexical choice or phrase ordering within a sentence, AMRs may be the better choice. In conclusion, this thesis has focused on the automatic generation of more varied output texts, based on various semantic input representations. We hope to have contributed to a better understanding of the NLG process, paving the way for improved and more engaging automatically generated text.

M3 - Doctoral Thesis

T3 - TiCC Ph.D. Series

PB - Ipskamp

CY - Enschede

ER -

Castro Ferreira T. Advances in Natural Language Generation: Generating Varied Outputs from Semantic Inputs. Enschede: Ipskamp, 2018. 236 p. (TiCC Ph.D. Series).