Extracting Pulmonary Embolism Diagnoses From Radiology Impressions Using GPT-4o: Large Language Model Evaluation Study

Mohammed Mahyoub
Kacie Dougherty
Ajit Shukla

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Pulmonary embolism (PE) is a critical condition requiring rapid diagnosis to reduce mortality. Extracting PE diagnoses from radiology reports manually is time-consuming, highlighting the need for automated solutions. Advances in natural language processing, especially transformer models like GPT-4o, offer promising tools to improve diagnostic accuracy and workflow efficiency in clinical settings.

Objective

This study aimed to develop an automatic extraction system using GPT-4o to extract PE diagnoses from radiology report impressions, enhancing clinical decision-making and workflow efficiency.

Methods

In total, 2 approaches were developed and evaluated: a fine-tuned Clinical Longformer as a baseline model and a GPT-4o-based extractor. Clinical Longformer, an encoder-only model, was chosen for its robustness in text classification tasks, particularly on smaller scales. GPT-4o, a decoder-only instruction-following LLM, was selected for its advanced language understanding capabilities. The study aimed to evaluate GPT-4o’s ability to perform text classification compared to the baseline Clinical Longformer. The Clinical Longformer was trained on a dataset of 1000 radiology report impressions and validated on a separate set of 200 samples, while the GPT-4o extractor was validated using the same 200-sample set. Postdeployment performance was further assessed on an additional 200 operational records to evaluate model efficacy in a real-world setting.

Results

GPT-4o outperformed the Clinical Longformer in 2 of the metrics, achieving a sensitivity of 1.0 (95% CI 1.0-1.0; Wilcoxon test, P<.001) and an F1-score of 0.975 (95% CI 0.9495-0.9947; Wilcoxon test, P<.001) across the validation dataset. Postdeployment evaluations also showed strong performance of the deployed GPT-4o model with a sensitivity of 1.0 (95% CI 1.0-1.0), a specificity of 0.94 (95% CI 0.8913-0.9804), and an F1-score of 0.97 (95% CI 0.9479-0.9908). This high level of accuracy supports a reduction in manual review, streamlining clinical workflows and improving diagnostic precision.

Conclusions

The GPT-4o model provides an effective solution for the automatic extraction of PE diagnoses from radiology reports, offering a reliable tool that aids timely and accurate clinical decision-making. This approach has the potential to significantly improve patient outcomes by expediting diagnosis and treatment pathways for critical conditions like PE.

Version published to 10.2196/67706
Apr 9, 2025
Version published to 10.2196/preprints.67706
Oct 18, 2024
Version published to 10.1101/2024.10.14.24315482 on medRxiv
Oct 15, 2024

Accurate Clinical Entity Recognition and Code Mapping of Anatomopathological Reports Using BioClinicalBERT Enhanced by Retrieval-Augmented Generation: A Hybrid Deep Learning Approach

This article has 9 authors:
1. Hamida Abdaoui
2. Chamseddine Barki
3. Ismail Dergaa
4. Karima Tlili
5. Halil İbrahim Ceylan
6. Nicola Luigi Bragazzi
7. Andrea de Giorgio
8. Ridha Ben Salah
9. Hanene Boussi Rahmouni
This article has no evaluationsLatest version Dec 27, 2025
Machine Learning-Driven Probability Scoring Enhances Diagnostic Certainty and Reduces Costs in Suspected Periprosthetic Joint Infection

This article has 6 authors:
1. Jim Parr
2. Van Thai-Paquette
3. Amy Worden
4. James Baker
5. Paul Edwards
6. Krista O'Shaughnessey Toler
This article has no evaluationsLatest version Jan 19, 2026
Evaluation of the ‘qXR’ Software for the Detection of Pulmonary Nodules and Signs Suggestive of Heart Failure: A Comparative Analysis in a Latin American General Hospital

This article has 11 authors:
1. Adriana Anchía-Alfaro
2. Sebastián Arguedas-Chacón
3. Georgia Hanley-Vargas
4. Sofía Suárez-Sánchez
5. Luis Andrés Aguilar-Castro
6. Sergio Daniel Seas-Azofeifa
7. Kal Che Wong Hsu
8. Diego Quesada-Loría
9. María Felicia Montero-Arias
10. Juliana Salas-Segura
11. Esteban Zavaleta-Monestel
This article has no evaluationsLatest version Jan 7, 2026

Discuss this preprint

Listed in

Abstract

Objective

Methods

Results

Conclusions

Article activity feed

Related articles

Accurate Clinical Entity Recognition and Code Mapping of Anatomopathological Reports Using BioClinicalBERT Enhanced by Retrieval-Augmented Generation: A Hybrid Deep Learning Approach

Machine Learning-Driven Probability Scoring Enhances Diagnostic Certainty and Reduces Costs in Suspected Periprosthetic Joint Infection

Evaluation of the ‘qXR’ Software for the Detection of Pulmonary Nodules and Signs Suggestive of Heart Failure: A Comparative Analysis in a Latin American General Hospital