AI-Generated Clinical Summaries: Errors and Susceptibility to Speech and Speaker Variability

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Summary Box

What is already known on this topic

  • Clinical AI Scribe outputs can contain errors, and the impact of human factors (e.g. communication style, accents, speech impairments) in clinical contexts remains under-characterised.

What this study adds

  • In controlled simulations, patient personality and accent did not significantly alter total CAIS errors, with omissions predominating and hallucinations/inaccuracies remaining low.

  • Speech-impairment effects were highly varied, with near-perfect recognition for cleft palate and vowel disorders, whereas phonological impairment substantially reduced accuracy.

How this study might affect research, practice or policy

  • Supports clinician-in-the-loop deployment with local validation across representative accents and impairment profiles, prioritising detection of clinically critical errors.

  • Routine governance should include subgroup performance reporting (accents, impairments) and ongoing audit of error rates.

Objective

The study aims to evaluate whether variability in patients’ communication style (personality, international English accents, and speech impairments) affects the accuracy of a Clinical AI Scribe (CAIS), and to identify where performance degrades.

Method and Analysis

We conducted simulated primary-care consultations in a purpose-built lab using trained actors. To investigate personality types, four scenarios were enacted, each with five patient-personality types. For accents, human-verified transcripts of consultations were used to generate all doctor/patient combinations of seven different accents (including a synthetic reference voice) across five scenarios. The CAIS produced SOAP-structured summaries that were compared with the transcripts. Errors were classified as omissions, factual inaccuracies, or hallucinations. For speech impairments, public recordings representing five profiles were transcribed and word-recognition accuracy was calculated.

Results

Personality types showed no statistically significant differences in errors (all p >0.05). Extraversion had the highest total errors (median 3.5), while conscientiousness and agreeableness were lower (1.5 and 2.0, respectively). Across accents, both pairwise tests and group comparisons were non-significant for both patient and doctor voices (patients: p =0.851; doctors: p =0.98). Omissions predominated, with low rates of hallucinations and factual inaccuracies. Omissions were slightly higher for Chinese- and Indian-accented doctors (both medians 3.0). In contrast, speech impairments differed: cleft palate and vowel disorders were near-perfect, whereas phonological impairment markedly reduced recognition ( p <0.001).

Conclusions

Under controlled conditions, CAIS performance was broadly stable across communication styles and most accents but remained vulnerable to specific speech characteristics, particularly phonological impairment. Future evaluations using real-world, multi-speaker clinical audio are needed to confirm performance.

Article activity feed