Leveraging Large Language Models as Simulated Users for Initial, Low-Cost Evaluations of Designed Conversations

Research output: Chapter in Book/Report/Conference proceedingChapterScientificpeer-review

Abstract

In this paper, we explore the use of large language models, in this case the ChatGPT API, as simulated users to evaluate designed, rule-based conversations. This type of evaluation can be introduced as a low-cost method to identify common usability issues prior to testing conversational agents with actual users. Preliminary findings show that ChatGPT is good at playing the part of a user, providing realistic testing scenarios for designed conversations even if these involve certain background knowledge or context. GPT-4 shows vast improvements over ChatGPT (3.5). In future work, it is important to evaluate the performance of simulated users in a more structured, generalizable manner, for example by comparing their behavior to that of actual users. In addition, ways to fine-tune the LLM could be explored to improve its performance, and the output of simulated conversations could be analyzed to automatically derive usability metrics such as the number of turns needed to reach the goal. Finally, the use of simulated users with open-ended conversational agents could be explored, where the LLM may also be able to reflect on the user experience of the conversation.
Original languageEnglish
Title of host publicationChatbot Research and Design
PublisherSpringer International
Chapter5
Pages77-93
Number of pages17
ISBN (Electronic)1611-3349
DOIs
Publication statusPublished - 2024
EventCONVERSATIONS 2023: 7th International Workshop - Oslo, Norway
Duration: 22 Nov 202323 Nov 2023
Conference number: 7

Publication series

NameLecture Notes in Computer Science (LNCS)
PublisherSpringer
Volume14524

Conference

ConferenceCONVERSATIONS 2023
Country/TerritoryNorway
CityOslo
Period22/11/2323/11/23

Keywords

  • conversational agents
  • large language models
  • automatic evaluations

Fingerprint

Dive into the research topics of 'Leveraging Large Language Models as Simulated Users for Initial, Low-Cost Evaluations of Designed Conversations'. Together they form a unique fingerprint.

Cite this