Development and Prospective Validation of CPX-MATE: An End-to-End Medical Education Platform Integrating Voice-Based Virtual Patient Simulation and Automated Real-time Evaluation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Objective Structured Clinical Examination (OSCE; Clinical Performance Examination [CPX] in South Korea) is a high-stakes assessment of clinical performance, communication, and reasoning during time-limited patient encounters. As AI-enabled virtual standardized patient (VSP) simulation and automated scoring are introduced for OSCE-like training, prospective evidence is needed on how such systems perform and are perceived when embedded in real educational workflows. Methods We developed CPX with Medical students’ Assistant for Training and Evaluation (CPX-MATE), a web-based platform integrating (1) CPX with Virtual Standardized Patient (CPX-VSP), real-time voice dialogue with a VSP using speech-to-speech (STS) models, and (2) CPX with Real-Time Evaluator (CPX-RTE), automated transcription, checklist-based scoring, and feedback from encounter audio using a Speech-to-Text model and a large language model. During an emergency medicine clerkship (Nov 2025-Jan 2026), 60 senior medical students completed two 12-min CPX encounters (VSP with acute pancreatitis; HSP with ureteral stone) with immediate CPX-RTE feedback. For CPX-VSP, students were assigned to either a full-capacity or a resource-limited STS configuration (n = 30 each). Dialogue fidelity was evaluated by turn-by-turn analysis of student–VSP exchanges, classifying responses into clinically meaningful error types (tangential, oversharing, role-breaking, off-script). CPX-RTE performance was assessed by agreement (Gwet’s AC1) with professor real-time and resident video-based ratings using a 45-item checklist. Usability of CPX-VSP and CPX-RTE, with overall system usability scale (SUS), were surveyed, and mean per-session costs for CPX-VSP and CPX-RTE were calculated. Results Across 3,282 dialogue turns, overall error rates were 1.77% versus 9.43% for full-capacity versus resource-limited STS configurations (p < 0.001), driven by fewer tangential and oversharing responses; no off-script errors were observed. The mean per-session cost was $0.12 for resource-limited configuration and $0.78 for full-capacity configuration. CPX-RTE showed high agreement with human ratings (AC1 = 0.916 vs professor; 0.916 vs resident), with slightly different levels of agreement across four sections, and high usability across all domains (mean scores, 4.65–4.92), with a per-session cost of $0.17. CPX-MATE demonstrated good overall usability (median [IQR] of 77.5 [70.0–85.0]). Conclusions Embedded within a prospective clinical clerkship, CPX-MATE demonstrated operational fidelity and human-level checklist agreement as an end-to-end, voice-based AI-assisted OSCE platform. This real-world deployment supports its scalable integration as a complementary assessment tool while highlighting the importance of systematic validation and context-aware implementation in medical education.