Assessment of a zero-shot large language model in measuring documented goals-of-care discussions

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Importance

Goals-of-care (GOC) discussions and their documentation are an important process measure in palliative care. However, existing natural language processing (NLP) models for identifying GOC documentation require costly training data that do not transfer to other constructs of interest. Newer large language models (LLMs) hold promise for measuring linguistically complex constructs with fewer or no task-specific training.

Objective

To evaluate the performance of a publicly available LLM with no task-specific training data (zero-shot prompting) for identifying EHR-documented GOC discussions.

Design, Setting, and Participants

This diagnostic study compared performance in identifying electronic health record (EHR)-documented GOC discussions of two NLP models: Llama 3.3 using zero-shot prompting, and a task-specific BERT (Bidirectional Encoder Representations from Transformers)-based model trained on a corpus of 4,642 manually annotated notes. Models were evaluated using text corpora drawn from clinical trials enrolling adult patients with chronic life-limiting illness hospitalized at a US health system over 2018-2023.

Outcomes and Measures

The outcomes were NLP model performance, evaluated by the area under the Receiver Operating Characteristic curve (AUC), area under the precision-recall curve (AUPRC), and maximal F 1 score. NLP performance was evaluated for both note-level and patient-level classification over a 30-day period.

Results

Across three text corpora, GOC documentation represented <1% of EHR text and was found in 7.3-9.9% of notes for 23-37% of patients. In a 617-patient held-out test set, Llama 3.3 (zero-shot) and BERT (task-specific, trained) exhibited comparable performance in identifying GOC documentation. Llama 3.3 identified GOC documentation with AUC 0.979, AUPRC 0.873, and F 1 0.83; BERT identified the same with AUC 0.981, AUPRC 0.874, and F 1 0.83. In examining the cumulative incidence of GOC documentation over the specified 30-day period, Llama 3.3 identified patients with GOC documentation with AUC 0.977, AUPRC 0.955, and F 1 0.89; and BERT identified the same with AUC 0.981, AUPRC 0.952, and F 1 0.89.

Conclusions and Relevance

A zero-shot large language model with no task-specific training performs similarly to a task-specific supervised-learning BERT model trained on thousands of manually labeled EHR notes in identifying documented goals-of-care discussions. These findings demonstrate promise for rigorous use of LLMs in measuring novel clinical trial outcomes.

KEY POINTS

Question

Can newer large language AI models accurately measure documented goals-of-care discussions without task-specific training data?

Findings

In this diagnostic/prognostic study, a publicly available large language model prompted with an outcome definition and no task-specific training demonstrated comparable performance identifying documented goals-of-care discussions to a previous deep-learning model that had been trained on an annotated corpus of 4,642 notes.

Meaning

Natural language processing allows the measurement of previously-inaccessible outcomes for clinical research. Compared to traditional natural language processing and machine learning methods, newer large language AI models allow investigators to measure novel outcomes without needing costly training data.

Article activity feed