Hazard-aware adaptations bridge the generalization gap in large language models: a nationwide study

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Despite growing excitement in deploying large language models (LLMs) for healthcare, most machine learning studies show success on the same few limited public data sources. It is unclear if and how most results generalize to real-world clinical settings. To measure this gap and shorten it, we analyzed protected notes from over 100 Veterans Affairs (VA) sites, focusing on extracting smoking history—a persistent and clinically impactful problem in natural language processing (NLP). Here we applied adaptation techniques to an LLM over two institutional datasets, a popular public dataset (MIMIC-III) and our VA one, across five smoking history NLP tasks of varying complexity. We demonstrate that adapted prompts, engineered to address observed errors, achieve better generalizability across institutions compared to zero-shot prompts. We analyzed 2,955 notes and LLM outputs to codify errors in a hazard framework, identifying whether error frequency differences between institutions stemmed from generalization failures or inherent data differences. While overall accuracy with the adapted prompt was similar between institutions (macro-F1=0.86 in VA, 0.85 in MIMIC), hazard distributions varied significantly. In some cases, a dataset had more errors in a specific category due to a higher prevalence of the associated hazard, such as templated information in VA notes (p adj =0.004). However, when task-specific requirements conflicted with pre-trained model behavior, errors in the untrained institution were more frequent despite similar hazard prevalence (p adj =0.007), showing a limit of LLM generalizability. As a potential clinical application, our adapted LLM system identified lung cancer screening eligibility in 59% of Veterans who later developed the disease, compared to 8% with current national VA tools. Our results demonstrate LLM generalizability on real-world, national patient data while identifying hazards to address for improved performance and broader applicability.

Article activity feed