Advances in Natural Language Generation: Generating Varied Outputs from Semantic Inputs

Thiago Castro Ferreira

    Research output: ThesisDoctoral Thesis

    529 Downloads (Pure)

    Abstract

    Natural Language Generation (NLG) -- also known as Automatic Text Generation -- is the computational process of generating understandable natural language text from non-linguistic input data (Reiter & Dale, 2000; Gatt & Krahmer, 2018). Practical applications of the process include automatically generated weather forecasts (Goldberg et al., 1994; Sripada et al., 2004; Belz, 2008; Konstas & Lapata, 2013), news written by "robot-journalists" (Clerwall, 2014) and neonatal intensive care reports for doctors and caregivers (Reiter, 2007; Portet et al., 2009).This thesis focused on two particular problems in the NLG process: how to generate more varied texts to describe the same communicative goal (Chapters 2-6) and what is an appropriate semantic input to generate language from (Chapters 7-8).

    For the first problem, we aimed to model linguistic variation in the NLG process, focusing on the generation of noun phrases, a task called Referring Expression Generation (REG). By collecting and analyzing new corpora of referring expressions (described in Chapters 2 and 4), we were able to develop new state-of-the-art data-driven models for two subtasks in modular systems of REG: the choice of referential form (i.e., whether a reference in the text should be a proper name, a pronoun, a description, etc.; Chapter 3) and proper name generation (i.e., given that a reference has the proper name form, should it be the full name of the entity, first name, surname or other proper name form?; Chapter 5). Additionally, we introduced an end-to-end approach, based on neural networks, which, different from modular REG systems, generates varied referring expressions to a discourse entity, deciding on its referential form and content in one shot without explicit feature extraction. Using a new delexicalized version of the WebNLG corpus (Gardent et al., 2017a,b), we showed that the neural model substantially improved over two strong baselines in terms of accuracy of the referring expressions and fluency of the lexicalized texts (Chapter 6).

    The second problem addressed in this thesis concerned the input to NLG systems. While there is broad consensus among scholars on the output of NLG systems (i.e., text or speech), there is far less agreement on what the input should be. To address the problem, researchers have started looking for candidate input formats that could be used more broadly within the community. In this thesis, we have looked in detail at two of them: Abstract Meaning Representation (AMR; Chapter 7) and RDF Triples from the semantic web (Chapter 8), which have fundamental differences in terms of level of specification, limitations and availability of resources. To convert both meaning representations into text, we proposed NLG models based on a pipeline architecture as well as models that work in a less modular style, by using methods from Statistical (Koehn et al., 2003) and Neural (Bahdanau et al., 2015) Machine Translation. We concluded that both representations are helpful for NLG research, and which is preferred presumably depends on the specific goal of the NLG system to be developed or on the NLG problem to be addressed. For instance, to study the full textual realization process, working with RDF triples seems preferable over AMRs, while for text-to-text NLG approaches or for the study of specific issues, such as lexical choice or phrase ordering within a sentence, AMRs may be the better choice. 

    In conclusion, this thesis has focused on the automatic generation of more varied output texts, based on various semantic input representations. We hope to have contributed to a better understanding of the NLG process, paving the way for improved and more engaging automatically generated text.
    Original languageEnglish
    QualificationDoctor of Philosophy
    Supervisors/Advisors
    • Krahmer, Emiel, Promotor
    • Wubben, Sander, Co-promotor
    • van den Bosch, Antal, Member PhD commission, External person
    • Gatt, Albert, Member PhD commission, External person
    • Gardent, Claire , Member PhD commission, External person
    • Schilder, Frank, Member PhD commission, External person
    • Goudbeek, Martijn, Member PhD commission
    Thesis sponsors
    Award date19 Sept 2018
    Place of PublicationEnschede
    Publisher
    Publication statusPublished - 2018

    Fingerprint

    Dive into the research topics of 'Advances in Natural Language Generation: Generating Varied Outputs from Semantic Inputs'. Together they form a unique fingerprint.

    Cite this