Leveraging simulation to provide a practical framework for assessing the novel scope of risk of LLMs in healthcare

Mark Kalinich
James Luccarelli
Frank Moss
John Torous

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Large language models (LLMs) are rapidly entering clinical care, yet their definitionally probabilistic outputs have delivered a variety of grossly unsafe responses to users. The difficulty in quantifying and mitigating the novel risks posed by LLMs threatens to stall the regulatory evaluation and clinical deployment of LLM-based software as a medical device (LLM-SaMD). A practical, evidence-based framework is urgently needed for extending existing medical-device regulations to encompass LLM-SaMDs. Using synthetic interactions between a chatbot and a potentially suicidal user, we demonstrate a simulation-based framework that provides a reproducible and generalizable method for evaluating the novel risks of LLM-SaMDs.

Methods

We developed a framework integrating LLM performance testing into SaMD risk estimation. Fourteen open-source models ranging from 270 million to 70 billion parameters (Qwen, Gemma, and LLaMA families) were evaluated on three safety-classification tasks: suicidal-ideation detection, therapy-request detection, and therapy-like interaction detection. Synthetic datasets were generated by Gemini 2.5 Pro and verified by psychiatrists. Model false-negative rates informed probabilistic estimates of P ₁ , the likelihood of a hazard progressing to a hazardous situation, and P ₂ , the likelihood of that situation resulting in harm.

Results

LLM success at generating synthetic safety datasets varied substantially by task, with strong performance for neutral and non-therapeutic content but frequent errors in suicidal-ideation and therapy-like interactions. Across 14 models (270 million–70 billion parameters), performance generally improved with size but included notable outliers. Estimated P ₁ values (hazard to hazardous situation) ranged from 2.0×10 ^-8 to 2.6×10 ^-4 and P ₂ (hazardous situation to harm) from 7.1×10 ^-5 to 9.6×10 ^-3 , spanning up to four orders of magnitude.

Conclusion

Simulation extends existing device-safety frameworks to address the novel risks of large language models. Rather than replacing regulatory judgment, it provides a reproducible method for quantifying uncertainty, clarifying assumptions, and linking model failures to plausible harms. Our case example demonstrates a generalizable approach that can overcome current regulatory barriers while remaining practical for manufacturers and regulators, supporting timely and transparent oversight that keeps patients safe while avoiding unnecessary barriers to delivering the clinical promise of LLM-based medical devices.

Brief Description

This study introduces a quantitative framework for evaluating and mitigating the unique risks that large language models (LLMs) pose in healthcare. By mapping the pathways from LLM-generated hazards to harms onto existing regulatory risk-analysis structures and estimating the probability of these transitions through computational simulation, the framework empirically bounds uncertainty and identifies where real-world evidence is needed to validate and monitor model performance before, during, and after clinical deployment.

Version published to 10.1101/2025.11.10.25339903 on medRxiv
Nov 13, 2025

Beyond Accuracy: An Efficiency- and Safety-Aware Framework for Evaluating Clinical AI with Large Language Models

This article has 4 authors:
1. Nazar Zaki
2. Amal Akor
3. Salahdein Aburuz
4. Sham ZainAlAbdin
This article has no evaluationsLatest version Oct 19, 2025
Breaking the Cost Barrier: How Quantization Enables Efficient Development and Deployment of LLMs for Public Healthcare

This article has 1 author:
1. Andrew Maranhão Ventura D’addario
This article has no evaluationsLatest version Nov 19, 2025
Quantifying Explainability in Healthcare AI with the Extended Collaborative Intelligence Index (X-CII): A Synthetic Evaluation Framework

This article has 1 author:
1. Unya Torisan
This article has no evaluationsLatest version Oct 22, 2025

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusion

Brief Description

Article activity feed

Related articles

Beyond Accuracy: An Efficiency- and Safety-Aware Framework for Evaluating Clinical AI with Large Language Models

Breaking the Cost Barrier: How Quantization Enables Efficient Development and Deployment of LLMs for Public Healthcare

Quantifying Explainability in Healthcare AI with the Extended Collaborative Intelligence Index (X-CII): A Synthetic Evaluation Framework