Leveraging simulation to provide a practical framework for assessing the novel scope of risk of LLMs in healthcare
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Large language models (LLMs) are rapidly entering clinical care, yet their definitionally probabilistic outputs have delivered a variety of grossly unsafe responses to users. The difficulty in quantifying and mitigating the novel risks posed by LLMs threatens to stall the regulatory evaluation and clinical deployment of LLM-based software as a medical device (LLM-SaMD). A practical, evidence-based framework is urgently needed for extending existing medical-device regulations to encompass LLM-SaMDs. Using synthetic interactions between a chatbot and a potentially suicidal user, we demonstrate a simulation-based framework that provides a reproducible and generalizable method for evaluating the novel risks of LLM-SaMDs.
Methods
We developed a framework integrating LLM performance testing into SaMD risk estimation. Fourteen open-source models ranging from 270 million to 70 billion parameters (Qwen, Gemma, and LLaMA families) were evaluated on three safety-classification tasks: suicidal-ideation detection, therapy-request detection, and therapy-like interaction detection. Synthetic datasets were generated by Gemini 2.5 Pro and verified by psychiatrists. Model false-negative rates informed probabilistic estimates of P 1 , the likelihood of a hazard progressing to a hazardous situation, and P 2 , the likelihood of that situation resulting in harm.
Results
LLM success at generating synthetic safety datasets varied substantially by task, with strong performance for neutral and non-therapeutic content but frequent errors in suicidal-ideation and therapy-like interactions. Across 14 models (270 million–70 billion parameters), performance generally improved with size but included notable outliers. Estimated P 1 values (hazard to hazardous situation) ranged from 2.0×10 -8 to 2.6×10 -4 and P 2 (hazardous situation to harm) from 7.1×10 -5 to 9.6×10 -3 , spanning up to four orders of magnitude.
Conclusion
Simulation extends existing device-safety frameworks to address the novel risks of large language models. Rather than replacing regulatory judgment, it provides a reproducible method for quantifying uncertainty, clarifying assumptions, and linking model failures to plausible harms. Our case example demonstrates a generalizable approach that can overcome current regulatory barriers while remaining practical for manufacturers and regulators, supporting timely and transparent oversight that keeps patients safe while avoiding unnecessary barriers to delivering the clinical promise of LLM-based medical devices.
Brief Description
This study introduces a quantitative framework for evaluating and mitigating the unique risks that large language models (LLMs) pose in healthcare. By mapping the pathways from LLM-generated hazards to harms onto existing regulatory risk-analysis structures and estimating the probability of these transitions through computational simulation, the framework empirically bounds uncertainty and identifies where real-world evidence is needed to validate and monitor model performance before, during, and after clinical deployment.