irAE-GPT: Leveraging large language models to identify immune-related adverse events in electronic health records and clinical trial datasets
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Large language models (LLMs) have emerged as transformative technologies, revolutionizing natural language understanding and generation across various domains, including medicine. In this study, we investigated the capabilities, limitations, and generalizability of Generative Pre-trained Transformer (GPT) models in analyzing unstructured patient notes from large healthcare datasets to identify immune-related adverse events (irAEs) associated with the use of immune checkpoint inhibitor (ICI) therapy.
Methods
We evaluated the performance of GPT-3.5, GPT-4, and GPT-4o models on manually annotated datasets of patients receiving ICI therapy, sampled from two electronic health record (EHR) systems and seven clinical trials. A zero-shot prompt was designed to exhaustively identify irAEs at the patient level (main analysis) and the note level (secondary analysis). The LLM-based system followed a multi-label classification approach to identify any combination of irAEs associated with individual patients or clinical notes. System evaluation was conducted for each available irAE as well as for broader categories of irAEs classified at the organ level.
Results
Our analysis included 442 patients across three institutions. The most common irAEs manually identified in the patient datasets included pneumonitis (N=64), colitis (N=56), rash (N=32), and hepatitis (N=28). Overall, GPT models achieved high sensitivity and specificity but only moderate positive predictive values, reflecting a potential bias towards overpredicting irAE outcomes. GPT-4o achieved the highest F1 and micro-averaged F1 scores for both patient-level and note-level evaluations. Highest performance was observed in the hematological (F1 range=1.0-1.0), gastrointestinal (F1 range=0.81-0.85), and musculoskeletal and rheumatologic (F1 range=0.67-1.0) irAE categories. Error analysis uncovered substantial limitations of GPT models in handling textual causation, where adverse events should not only be accurately identified in clinical text but also causally linked to immune checkpoint inhibitors.
Conclusion
The GPT models demonstrated generalizable abilities in identifying irAEs across EHRs and clinical trial reports. Using GPT models to automate adverse event detection in large healthcare datasets will reduce the burden on physicians and healthcare professionals by eliminating the need for manual review. This will strengthen safety monitoring and lead to improved patient care.