Can Large Language Models Generate Role-Consistent Clinical Dialogue for Education? A Multi-agent Approach

David Power

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: The development of high-quality healthcare simulation scenarios and educationalclinical dialogue is resource-intensive, limiting scalability in healthcare education. Large languagemodels (LLMs) offer new opportunities for generating dynamic conversational simulations but raiseconcerns regarding role fidelity, realism, and evaluation.Innovation: We describe a role-locked, multi-agent LLM system designed to generate realisticemergency department conversations involving a doctor, nurse, and patient. Separate LLM agentswere assigned fixed clinical roles and interacted within a shared conversational environment, supportedby explicit role constraints and rule-based role-guarding mechanisms. An independent LLM wasused as an automated evaluator (“AI-as-judge”) to assess role fidelity, turn coherence, communicationrealism, and educational usability.Evaluation: Twenty-five simulated conversations were generated and evaluated using the automatedjudge. A subset of ten conversations underwent independent human evaluation by two clinically experienced raters using aligned assessment domains. Automated evaluation demonstrated consistentlyhigh ratings across all domains, with all simulations judged educationally usable. Human evaluation showed substantial agreement for role fidelity and moderate agreement across other domains,providing expert plausibility benchmarking for the automated approach.Implications: This work demonstrates the feasibility of a role-locked, multi-agent LLM architecturefor generating educationally plausible conversational simulations. The combination of automated andlimited human evaluation provides early validity evidence supporting feasibility and usability. Thisapproach may support rapid prototyping and scalable development of simulation-based educationalcontent.

Version published to 10.35542/osf.io/etv6d_v1 on OSF Preprints
Feb 23, 2026

Balancing Safety and Educational Availability in a Large Language Model-Based Virtual Patient for Medical Interview Training: Robustness Evaluation Under Direct and Indirect Instructional Contamination

This article has 1 author:
1. Yuusuke Harada
This article has no evaluationsLatest version Mar 4, 2026
Deterministic Retrieval-Grounded Language Models for Clinical Counseling: Large-Scale Multilingual Evaluation with Cryptographically Verifiable Pipelines

This article has 1 author:
1. Panagiotis Karmiris
This article has no evaluationsLatest version Mar 17, 2026
Cognitively Diverse Multiple-Choice Question Generation: A Hybrid Multi-Agent Framework with Large Language Models

This article has 7 authors:
1. Yu Tian
2. Linh Huynh
3. Katerina Christhilf
4. Shubham Chakraborty
5. Micah Watanabe
6. Tracy Arner
7. Danielle McNamara
This article has no evaluationsLatest version Mar 13, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Balancing Safety and Educational Availability in a Large Language Model-Based Virtual Patient for Medical Interview Training: Robustness Evaluation Under Direct and Indirect Instructional Contamination

Deterministic Retrieval-Grounded Language Models for Clinical Counseling: Large-Scale Multilingual Evaluation with Cryptographically Verifiable Pipelines

Cognitively Diverse Multiple-Choice Question Generation: A Hybrid Multi-Agent Framework with Large Language Models