Real-World Evaluation of Large Language Models in Healthcare (RWE-LLM): A New Realm of AI Safety & Validation

Meenesh Bhimani
Alex Miller
Jonathan D. Agnew
Markel Sanz Ausin
Mariska Raglow-Defranco
Harpreet Mangat
Michelle Voisard
Maggie Taylor
Sebastian Bierman-Lytle
Vishal Parikh
Juliana Ghukasyan
Rae Lasko
Saad Godil
Ashish Atreja
Subhabrata Mukherjee

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

The deployment of artificial intelligence (AI) in healthcare necessitates robust safety validation frameworks, particularly for systems directly interacting with patients. While theoretical frameworks exist, there remains a critical gap between abstract principles and practical implementation. Traditional LLM benchmarking approaches provide very limited output coverage and are insufficient for healthcare applications requiring high safety standards.

Objective

To develop and evaluate a comprehensive framework for healthcare AI safety validation through large-scale clinician engagement.

Methods

We implemented the RWE-LLM (Real-World Evaluation of Large Language Models in Healthcare) framework, drawing inspiration from red teaming methodologies while expanding their scope to achieve comprehensive safety validation. Our approach emphasizes output testing rather than relying solely on input data quality across four stages: pre-implementation, tiered review, resolution, and continuous monitoring. We engaged 6,234 US licensed clinicians (5,969 nurses and 265 physicians) with an average of 11.5 years of clinical experience. The framework employed a three-tier review process for error detection and resolution, evaluating a non-diagnostic AI Care Agent focused on patient education, follow-ups, and administrative support across four iterations (pre-Polaris and Polaris 1.0, 2.0, and 3.0).

Results

Over 307,000 unique calls were evaluated using the RWE-LLM framework. Each interaction was subject to potential error flagging across multiple severity categories, from minor clinical inaccuracies to significant safety concerns. The multi-tiered review system successfully processed all flagged interactions, with internal nursing reviews providing initial expert evaluation followed by physician adjudication when necessary. The framework demonstrated effective throughput in addressing identified safety concerns while maintaining consistent processing times and documentation standards. Systematic improvements in safety protocols were achieved through a continuous feedback loop between error identification and system enhancement. Performance metrics demonstrated substantial safety improvements between iterations, with correct medical advice rates improving from ∼80.0% (pre-Polaris), to 96.79% (Polaris 1.0), to 98.75% (Polaris 2.0) and 99.38% (Polaris 3.0). Incorrect advice resulting in potential minor harm decreased from 1.32% to 0.13% and 0.07%, and severe harm concerns were eliminated (0.06% to 0.10% and 0.00%).

Conclusions

The successful nationwide implementation of the RWE-LLM framework establishes a practical model for ensuring AI safety in healthcare settings. Our methodology demonstrates that comprehensive output testing provides significantly stronger safety assurance than traditional input validation approaches used by horizontal LLMs. While resource-intensive, this approach proves that rigorous safety validation for healthcare AI systems is both necessary and achievable, setting a benchmark for future deployments.

Version published to 10.1101/2025.03.17.25324157v1 on medRxiv
Mar 18, 2025

A Bilingual On-premise AI agent for Clinical Drafting: Seamless EHR integration in the Y-KNOT Project

This article has 12 authors:
1. Hanjae Kim
2. So-Yeon Lee
3. Seng Chan You
4. Sookyung Huh
5. Jai-Eun Kim
6. Sung-Tae Kim
7. Dong-Ryul Ko
8. Ji Hoon Kim
9. Jae Hoon Lee
10. Joon Seok Lim
11. Moo Suk Park
12. Kang Young Lee
This article has no evaluationsLatest version Apr 4, 2025
Grounding Large Language Model in Clinical Diagnostics

This article has 14 authors:
1. Jian Li
2. Xi Chen
3. Hanyu Zhou
4. Huahui Yi
5. Mingke You
6. Weizhi Liu
7. Li Wang
8. Hairui Li
9. Xue Zhang
10. Yingman Guo
11. Lei Fan
12. Qicheng Lao
13. Weili Fu
14. Kang Li
This article has no evaluationsLatest version Apr 15, 2025
Real-World Usage Patterns of Large Language Models in Healthcare

This article has 4 authors:
1. Alyssa Unell
2. Mehr Kashyap
3. Michael Pfeffer
4. Nigam Shah
This article has no evaluationsLatest version May 6, 2025

Listed in

Abstract

Background

Objective

Methods

Results

Conclusions

Article activity feed

Related articles

A Bilingual On-premise AI agent for Clinical Drafting: Seamless EHR integration in the Y-KNOT Project

Grounding Large Language Model in Clinical Diagnostics

Real-World Usage Patterns of Large Language Models in Healthcare