Accuracy and Safety of an AI Ambient Scribe Compared with Handwritten Clinical Notes
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In a prospective real-world evaluation at Groote Schuur Hospital, South Africa, a large-language-model ambient scribe was compared with contemporaneous handwritten clinical notes. The system generated notes from raw audio without diarisation, transcript editing, or clinician review. Across 49 encounters, documentation quality was independently assessed using a SOAP-aligned rubric (0–5 per domain) and a symmetric severity-graded error taxonomy. AI-generated notes outperformed handwritten notes in 48 encounters and tied in one, with higher mean overall SOAP scores (4.9 vs 2.9) and a 97.1% posterior probability (95% credible interval, 91.0%–99.8%) of superior documentation quality. Posterior rates of moderate-to-severe hallucinations, distortions, omissions, and clinically significant errors were at least fivefold higher in handwritten notes. Hallucinations were not confined to AI outputs, challenging their framing as an AI-specific risk. In LMIC settings, ambient AI scribes could complement existing documentation workflows and may form part of a broader pathway toward scalable digital health infrastructure.