Abstract
In this paper, we explore the use of large language models, in this case the ChatGPT API, as simulated users to evaluate designed, rule-based conversations. This type of evaluation can be introduced as a low-cost method to identify common usability issues prior to testing conversational agents with actual users. Preliminary findings show that ChatGPT is good at playing the part of a user, providing realistic testing scenarios for designed conversations even if these involve certain background knowledge or context. GPT-4 shows vast improvements over ChatGPT (3.5). In future work, it is important to evaluate the performance of simulated users in a more structured, generalizable manner, for example by comparing their behavior to that of actual users. In addition, ways to fine-tune the LLM could be explored to improve its performance, and the output of simulated conversations could be analyzed to automatically derive usability metrics such as the number of turns needed to reach the goal. Finally, the use of simulated users with open-ended conversational agents could be explored, where the LLM may also be able to reflect on the user experience of the conversation.
Original language | English |
---|---|
Title of host publication | Chatbot Research and Design |
Publisher | Springer International |
Chapter | 5 |
Pages | 77-93 |
Number of pages | 17 |
ISBN (Electronic) | 1611-3349 |
DOIs | |
Publication status | Published - 2024 |
Event | CONVERSATIONS 2023: 7th International Workshop - Oslo, Norway Duration: 22 Nov 2023 → 23 Nov 2023 Conference number: 7 |
Publication series
Name | Lecture Notes in Computer Science (LNCS) |
---|---|
Publisher | Springer |
Volume | 14524 |
Conference
Conference | CONVERSATIONS 2023 |
---|---|
Country/Territory | Norway |
City | Oslo |
Period | 22/11/23 → 23/11/23 |
Keywords
- conversational agents
- large language models
- automatic evaluations