Evaluating Large Language Models for Automatic Detection of In-Hospital Cardiac Arrest: Multi-Site Analysis of Clinical Notes
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In-hospital cardiac arrest (IHCA) affects over 200,000 patients annually in the United States, yet its detection through manual chart review remains resource-intensive and often delayed. We evaluated the performance of four open-source large language models (LLMs) and GPT-4o in identifying IHCA cases from 2,674 clinical notes across five hospitals. While GPT-4o achieved the highest performance (F1-score: 0.90, recall: 0.97), several open-source models demonstrated comparable capabilities, suggesting their viability for clinical applications. Our systematic analysis of model outputs revealed that performance was strongly influenced by site-specific documentation practices, with inter-site agreement rates varying by over 20%. Through detailed error analysis, we identified key challenges including medical terminology hallucinations and structural inconsistencies in model reasoning. These findings establish a framework for implementing LLM-based IHCA detection systems while highlighting critical considerations for their clinical deployment.