Evaluation of the Clinical Utility of DxGPT, a GPT-4 Based Large Language Model, through an Analysis of Diagnostic Accuracy and User Experience

Marina Alvarez-Estape
Ivan Cano
Rosa Pino
Carla González Grado
Andrea Aldemira-Liz
Javier Gonzálvez-Ortuño
Juanjo do Olmo
Javier Logroño
Marcelo Martínez
Carlos Mascías
Julián Isla
Jordi Martínez Roldán
Cristian Launes
Francesc Garcia-Cuyas
Paula Esteller-Cucala

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Importance

The time to accurately diagnose rare pediatric diseases often spans years. Assessing the diagnostic accuracy of an LLM-based tool on real pediatric cases can help reduce this time, providing quicker diagnoses for patients and their families.

Objective

To evaluate the clinical utility of DxGPT as a support tool for differential diagnosis of both common and rare diseases.

Design

Unicentric descriptive cross-sectional exploratory study. Anonymized data from 50 pediatric patients’ medical histories, covering common and rare pathologies, were used to generate clinical case notes. Each clinical case included essential data, with some expanded by complementary tests.

Setting

This study was conducted at a reference pediatric hospital, Sant Joan de Déu Barcelona Children’s Hospital.

Participants

A total of 50 clinical cases were diagnosed by 78 volunteer doctors (medical diagnostic team) with varying experience, each reviewing 3 clinical cases.

Interventions

Each clinician listed up to five diagnoses per clinical case note. The same was done on the DxGPT web platform, obtaining the Top-5 diagnostic proposals. To evaluate DxGPT’s variability, each note was queried three times.

Main Outcome(s) and Measure(s)

The study mainly focused on comparing diagnostic accuracy, defined as the percentage of cases with the correct diagnosis, between the medical diagnostic team and DxGPT. Other evaluation criteria included qualitative assessments. The medical diagnostic team also completed a survey on their user experience with DxGPT.

Results

Top-5 diagnostic accuracy was 65% for clinicians and 60% for DxGPT, with no significant differences. Accuracies for common diseases were higher (Clinicians: 79%, DxGPT: 71%) than for rare diseases (Clinicians: 50%, DxGPT: 49%). Accuracy increased similarly in both groups with expanded information, but this increase was only stastitically significant in clinicians (simple 52% vs. expanded 69%; p =0.03). DxGPT’s response variability affected less than 5% of clinical case notes. A survey of 48 clinicians rated the DxGPT platform 3.9/5 overall, 4.1/5 for usefulness, and 4.5/5 for usability.

Conclusions and Relevance

DxGPT showed diagnostic accuracies similar to medical staff from a pediatric hospital, indicating its potential for supporting differential diagnosis in other settings. Clinicians praised its usability and simplicity. These tools could provide new insights for challenging diagnostic cases.

Key Points

Question

Is DxGPT, a large language model-based (LLM-based) tool, effective for differential diagnosis support, specifically in the context of a clinical pediatric setting?

Findings

In this unicentric cross-sectional study, diagnostic accuracy, measured as the proportion of clinical cases where any of the five diagnostic options included the correct diagnosis, showed comparable results between clinicians and DxGPT. Top-5 accuracy was 65% for clinicians and 60% for DxGPT.

Meaning

These findings highlight the potential of LLM-based tools like DxGPT to support clinicians in making accurate and timely diagnoses, ultimately improving patient care.

Version published to 10.1101/2024.07.23.24310847v2 on medRxiv
Jul 31, 2024
Version published to 10.1101/2024.07.23.24310847v1 on medRxiv
Jul 26, 2024

Systematic identification of rare disease patients in electronic health records enables evaluation of clinical outcomes

This article has 6 authors:
1. Arjun S. Yadaw
2. Eric Sid
3. Hythem Sidky
4. Chenjie Zeng
5. Qian Zhu
6. Ewy A. Mathé
This article has no evaluationsLatest version May 6, 2025
High Concordance Between GPT-4o and Multidisciplinary Tumor Board Decisions in Breast Cancer: A Retrospective Decision Support Analysis

This article has 10 authors:
1. Emre Buyukceran
2. Ayça Seyfettin
3. Andelib Babatürk
4. Murat Bulut Özkan
5. Dilşen Çolak
6. İlhami Ünal
7. Esin Kaymaz
8. Esin Ergün
9. Mustafa Özdeş Emer
10. Hüsnü Hakan Mersin
This article has no evaluationsLatest version Jun 13, 2025
Comparative Evaluation the Knowledge of Large Language Models about Response Evaluation Criteria in Solid Tumors?

This article has 3 authors:
1. Eren Çamur
2. Turay Cesur
3. Yasin Celal Güneş
This article has no evaluationsLatest version May 7, 2025