Assessment of a zero-shot large language model in measuring documented goals-of-care discussions

Robert Y. Lee
Kevin S. Li
James Sibley
Trevor Cohen
William B. Lober
Danae G. Dotolo
Erin K. Kross

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Importance

Goals-of-care (GOC) discussions and their documentation are an important process measure in palliative care. However, existing natural language processing (NLP) models for identifying GOC documentation require costly training data that do not transfer to other constructs of interest. Newer large language models (LLMs) hold promise for measuring linguistically complex constructs with fewer or no task-specific training.

Objective

To evaluate the performance of a publicly available LLM with no task-specific training data (zero-shot prompting) for identifying EHR-documented GOC discussions.

Design, Setting, and Participants

This diagnostic study compared performance in identifying electronic health record (EHR)-documented GOC discussions of two NLP models: Llama 3.3 using zero-shot prompting, and a task-specific BERT (Bidirectional Encoder Representations from Transformers)-based model trained on a corpus of 4,642 manually annotated notes. Models were evaluated using text corpora drawn from clinical trials enrolling adult patients with chronic life-limiting illness hospitalized at a US health system over 2018-2023.

Outcomes and Measures

The outcomes were NLP model performance, evaluated by the area under the Receiver Operating Characteristic curve (AUC), area under the precision-recall curve (AUPRC), and maximal F ₁ score. NLP performance was evaluated for both note-level and patient-level classification over a 30-day period.

Results

Across three text corpora, GOC documentation represented <1% of EHR text and was found in 7.3-9.9% of notes for 23-37% of patients. In a 617-patient held-out test set, Llama 3.3 (zero-shot) and BERT (task-specific, trained) exhibited comparable performance in identifying GOC documentation. Llama 3.3 identified GOC documentation with AUC 0.979, AUPRC 0.873, and F ₁ 0.83; BERT identified the same with AUC 0.981, AUPRC 0.874, and F ₁ 0.83. In examining the cumulative incidence of GOC documentation over the specified 30-day period, Llama 3.3 identified patients with GOC documentation with AUC 0.977, AUPRC 0.955, and F ₁ 0.89; and BERT identified the same with AUC 0.981, AUPRC 0.952, and F ₁ 0.89.

Conclusions and Relevance

A zero-shot large language model with no task-specific training performs similarly to a task-specific supervised-learning BERT model trained on thousands of manually labeled EHR notes in identifying documented goals-of-care discussions. These findings demonstrate promise for rigorous use of LLMs in measuring novel clinical trial outcomes.

KEY POINTS

Question

Can newer large language AI models accurately measure documented goals-of-care discussions without task-specific training data?

Findings

In this diagnostic/prognostic study, a publicly available large language model prompted with an outcome definition and no task-specific training demonstrated comparable performance identifying documented goals-of-care discussions to a previous deep-learning model that had been trained on an annotated corpus of 4,642 notes.

Meaning

Natural language processing allows the measurement of previously-inaccessible outcomes for clinical research. Compared to traditional natural language processing and machine learning methods, newer large language AI models allow investigators to measure novel outcomes without needing costly training data.

Version published to 10.1101/2025.05.23.25328115v1 on medRxiv
May 25, 2025

Semantic Encoding in Medical LLMs for Vocabulary Standardisation

This article has 3 authors:
1. Samuel Mainwood
2. Aashish Bhandari
3. Sonika Tyagi
This article has no evaluationsLatest version Jun 17, 2025
A modular pipeline for natural language processing-screened human abstraction of a pragmatic trial outcome from electronic health records

This article has 13 authors:
1. Robert Y. Lee
2. Kevin S. Li
3. James Sibley
4. Trevor Cohen
5. William B. Lober
6. Janaki O’Brien
7. Nicole LeDuc
8. Kasey Mallon Andrews
9. Anna Ungar
10. Jessica Walsh
11. Elizabeth L. Nielsen
12. Danae G. Dotolo
13. Erin K. Kross
This article has no evaluationsLatest version Jun 24, 2025
Automated Detection of Early-Stage Dementia Using Large Language Models: A Comparative Study on Narrative Speech

This article has 3 authors:
1. Kevin Mekulu
2. Faisal Aqlan
3. Hui Yang
This article has no evaluationsLatest version Jun 7, 2025