Accuracy and Safety of an AI Ambient Scribe Compared with Handwritten Clinical Notes

Byron De John
Johannes M.N Enslin
Joshua Fieggen
Linda Camara
Bruce Bassett
Graham Fieggen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

In a prospective real-world evaluation at Groote Schuur Hospital, South Africa, a large-language-model ambient scribe was compared with contemporaneous handwritten clinical notes. The system generated notes from raw audio without diarisation, transcript editing, or clinician review. Across 49 encounters, documentation quality was independently assessed using a SOAP-aligned rubric (0–5 per domain) and a symmetric severity-graded error taxonomy. AI-generated notes outperformed handwritten notes in 48 encounters and tied in one, with higher mean overall SOAP scores (4.9 vs 2.9) and a 97.1% posterior probability (95% credible interval, 91.0%–99.8%) of superior documentation quality. Posterior rates of moderate-to-severe hallucinations, distortions, omissions, and clinically significant errors were at least fivefold higher in handwritten notes. Hallucinations were not confined to AI outputs, challenging their framing as an AI-specific risk. In LMIC settings, ambient AI scribes could complement existing documentation workflows and may form part of a broader pathway toward scalable digital health infrastructure.

Version published to 10.21203/rs.3.rs-9139641/v1 on Research Square
Apr 15, 2026

Performance of Vision–Language Models Compared with 252 Medical Students on Text-only and Image-based Dermatology Examinations

This article has 8 authors:
1. Ozan Erdem
2. Abdurrahim Yilmaz
3. Ahmet Sait Sahin
4. Bugra Burc Dagtas
5. Ece Gokyayla
6. Melek Aslan Kayıran
7. Vefa Aslı Erdemir
8. Mehmet Salih Gurel
This article has no evaluationsLatest version Apr 9, 2026
RESPECT: a conversational AI system for informed consent with accuracy, safety, and stakeholder-centered evaluation

This article has 3 authors:
1. Salvatore Giorgi
2. Katie Ryan
3. Jane Paik Kim
This article has no evaluationsLatest version May 9, 2026
Large Language Model Performance in UK Advice & Guidance: A Pilot Study in Neurology

This article has 11 authors:
1. J Healy
2. A Marvasti
3. D Wallace
4. A Baheerathan
5. A Ghosh
6. J Kossoff
7. S Thio
8. MS Balaratnam
9. S Haider
10. S Ellershaw
11. R Dobson
This article has no evaluationsLatest version May 18, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Performance of Vision–Language Models Compared with 252 Medical Students on Text-only and Image-based Dermatology Examinations

RESPECT: a conversational AI system for informed consent with accuracy, safety, and stakeholder-centered evaluation

Large Language Model Performance in UK Advice & Guidance: A Pilot Study in Neurology