Large language models for accurate disease detection in electronic health records

Nils Bürgisser
Etienne Chalot
Samia Mehouachi
Clement P. Buclin
Kim Lauper
Delphine S. Courvoisier
Denis Mongin

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Importance

The use of large language models (LLMs) in medicine is increasing, with potential applications in electronic health records (EHR) to create patient cohorts or identify patients who meet clinical trial recruitment criteria. However, significant barriers remain, including the extensive computer resources required, lack of performance evaluation, and challenges in implementation.

Objective

This study aims to propose and test a framework to detect disease diagnosis using a recent light LLM on French-language EHR documents. Specifically, it focuses on detecting gout (“goutte” in French), a ubiquitous French term that have multiple meanings beyond the disease. The study will compare the performance of the LLM-based framework with traditional natural language processing techniques and test its dependence on the parameter used.

Design

The framework was developed using a training and testing set of 700 paragraphs assessing “gout”, issued from a random selection of retrospective EHR documents. All paragraphs were manually reviewed and classified by two health-care professionals (HCP) into disease (true gout) and non-disease (gold standard). The LLM’s accuracy was tested using few-shot and chain-of-thought prompting and compared to a regular expression (regex)-based method, focusing on the effects of model parameters and prompt structure. The framework was further validated on 600 paragraphs assessing “Calcium Pyrophosphate Deposition Disease (CPPD)”.

Setting

The documents were sampled from the electronic health-records of a tertiary university hospital in Geneva, Switzerland.

Participants

Adults over 18 years of age.

Exposure

Meta’s Llama 3 8B LLM or traditional method, against a gold standard.

Main Outcomes and Measures

Positive and negative predictive value, as well as accuracy of tested models.

Results

The LLM-based algorithm outperformed the regex method, achieving a 92.7% [88.7-95.4%] positive predictive value, a 96.6% [94.6-97.8%] negative predictive value, and an accuracy of 95.4% [93.6-96.7%] for gout. In the validation set on CPPD, accuracy was 94.1% [90.2-97.6%]. The LLM framework performed well over a wide range of parameter values.

Conclusions and Relevance

LLMs were able to accurately detect disease diagnoses from EHRs, even in non-English languages. They could facilitate creating large disease registries in any language, improving disease care assessment and patient recruitment for clinical trials.

Key points

Question

How accurate and efficient are large language models (LLMs) in detecting diseases from unstructured electronic health records (EHR) text compared to traditional natural language processing techniques?

Findings

This study proposes a framework based on Meta’s Llama 3 8B, a recent public LLM, outperforming traditional natural language processing techniques in detecting gout and calcium pyrophosphate deposition disease in unstructured text. It achieves high positive and negative predictive values and accuracy. Performance was robust over a wide range of parameters.

Meaning

The proposed framework can ease the use of LLMs in effectively detecting disease in EHR data for various clinical applications.

Version published to 10.1101/2024.07.27.24311106 on medRxiv
Jul 29, 2024

General-purpose large language models can achieve physician-level accuracy in complex medical data extraction

This article has 2 authors:
1. Manu Rajeev
2. Ananthu Narayan
This article has no evaluationsLatest version Jun 10, 2026
Relationship Extraction for Adverse Drug Events in Clinical Notes Using Large Language Models

This article has 10 authors:
1. Joseph M Plasek
2. Yiming Li
3. Mary G Amato
4. Dinah Foer
5. Diane L. Seger
6. Shayma Alzaidi
7. Huiyuan Zhou
8. Gretchen Purcell Jackson
9. David W Bates
10. Li Zhou
This article has no evaluationsLatest version Jun 1, 2026
Use of large language models by academic hospitalists: results of a multicenter survey

This article has 5 authors:
1. Eric Bressman
2. Andrew Auerbach
3. Angela Keniston
4. Caroline Jens
5. Sumant Ranji
This article has no evaluationsLatest version May 29, 2026