Measuring the Quality of AI-Generated Clinical Notes: A Systematic Review and Experimental Benchmark of Evaluation Methods

Alexandra Dahlberg
Tiila Käenniemi
Tiia Winther-Jensen
Olli Tapiola
Rami Luisto
Tuukka Puranen
Max Gordon
Enni Sanmark
Ville Vartiainen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background : High-quality clinical documentation is essential for safe, effective care, yet producing it is time consuming and error prone. Large language models (LLMs) can assist with note generation, but clinical adoption is determined by the resulting note quality. However current evaluation practices vary, and their clinical relevance is unclear. Drawing on a multidisciplinary perspective, we examined how quality is assessed and how those assessments align with clinical demands. Methods : We systematically searched Ovid Medline and Scopus on 10 April 2025 for peer-reviewed studies that used LLMs in generating clinical notes and included an evaluation of the quality of the resulting text. The screening followed PRISMA and the protocol was preregistered in PROSPERO. Data on metrics, and outcomes were synthesised narratively. Based on these findings, we designed an experimental setup to test the most common evaluation metrics and an LLM-as-evaluator, included for its scalability across large test sets. The experiment used synthetic cases with targeted perturbations. Findings : Thirty-seven studies were included. The reporting was dominated by lexical overlap metrics, chiefly ROUGE and BLEU. Semantic similarity metrics, such as BERTScore and BLEURT, were less common. A human evaluation was frequent but heterogeneous, with criteria and methods defined using varying degrees of detail; the most common foci were correctness, fluency, and aspects of clinical acceptability. In our experimental setup, lexical overlap metrics detected deletions and modifications but penalised meaning-preserving paraphrases. Semantic metrics and LLM-as-evaluators were more tolerant of paraphrased perturbations, yet remained sensitive to relevant changes, with performance varying by model and language. Interpretation : Current practice relies on lexical overlap metrics that are useful for cursory checks but insufficient as proxies for quality. We recommend a layered strategy that pairs semantic metrics with LLM-as-evaluator for scalability and includes targeted human adjudication. Broader, safety-focused validation across institutions and languages is needed before routine deployment. Funding : Business Finland through the GenAID research project. Personal grants listed under Acknowledgments.

Version published to 10.1101/2025.11.18.25340507 on medRxiv
Nov 20, 2025

Beyond Accuracy: An Efficiency- and Safety-Aware Framework for Evaluating Clinical AI with Large Language Models

This article has 4 authors:
1. Nazar Zaki
2. Amal Akor
3. Salahdein Aburuz
4. Sham ZainAlAbdin
This article has no evaluationsLatest version Oct 19, 2025
A medical algorithmic audit framework for evaluating the safety, equity, and quality of an AI Scribe tool in a paediatric developmental assessment clinic

This article has 10 authors:
1. Melissa D McCradden
2. Sheng Tng
3. Deepa Jeyaseelan
4. Cathy Leane
5. Marnie Campbell
6. Timothy Braund
7. Lana Earle-Bandaralage
8. Mary Ebrahimi
9. Ashish Sharma
10. Jonathan Tang
This article has no evaluationsLatest version Oct 1, 2025
Does LLM Assistance Improve Healthcare Delivery? An Evaluation Using On-site Physicians and Laboratory Tests∗

This article has 5 authors:
1. Jason Abaluck
2. Robert Pless
3. Nirmal Ravi
4. Anja Sautmann
5. Aaron Schwartz
This article has no evaluationsLatest version Nov 3, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Beyond Accuracy: An Efficiency- and Safety-Aware Framework for Evaluating Clinical AI with Large Language Models

A medical algorithmic audit framework for evaluating the safety, equity, and quality of an AI Scribe tool in a paediatric developmental assessment clinic

Does LLM Assistance Improve Healthcare Delivery? An Evaluation Using On-site Physicians and Laboratory Tests∗