Close, But no Cigar: Comparative Evaluation of ChatGPT-4o and OpenAI o1-preview in Answering Pancreatic Ductal Adenocarcinoma-Related Questions

Cheng-Peng Li
Yuan Chu
Dao-Ning Liu
Erfan Ghanad
Schaima Abdelhadi
Flavius Șandra-Petrescu
Christoph Reißfelder
Georgi Vassilev
Cui Yang

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

This study aimed to evaluate the effectiveness of ChatGPT-4o and OpenAI o1-preview in responding to pancreatic ductal adenocarcinoma (PDAC)-related queries. The study assessed both LLMs’ accuracy, comprehensiveness, and safety when answering clinical questions, based on the National Comprehensive Cancer Network® (NCCN) Clinical Practice Guidelines for PDAC.

Methods

The study used a 20-question dataset derived from clinical scenarios related to PDAC. Two board-certified surgeons independently evaluated the responses by ChatGPT-4o and OpenAI o1-preview for their accuracy, comprehensiveness, and safety using a Likert scale. Statistical analyses were conducted to compare the performances of the two models. We also analyzed the impact of OpenAI o1-preview’s Chain of Thought (CoT) technology.

Results

Both models demonstrated high median scores across all dimensions (5 out of 5). OpenAI o1-preview outperformed ChatGPT-4o in comprehensiveness (p = 0.026) and demonstrated superior reasoning ability, with a higher accuracy rate of 75% compared to 60% for ChatGPT-4o. OpenAI o1-preview generated more concise responses (median 64 vs. 82 words, p < 0.001). The CoT method in OpenAI o1-preview appeared to enhance its reasoning capabilities, particularly in complex treatment decisions. However, both models made critical errors in some complex clinical scenarios.

Conclusion

OpenAI o1-preview, with its CoT technology, demonstrates higher comprehensiveness than ChatGPT-4.0 and showed a tendency of improved accuracy. However, both models still make critical errors and cause some harm to patients. Even the most advanced models are not suitable for offering reliable medical information and cannot function as an assistant for decision-making.

Version published to 10.1101/2025.07.26.25332239 on medRxiv
Jul 31, 2025

Evaluating the Efficacy of Large Language Models in Addressing Patient-Centric Inquiries in Multiple Cancers

This article has 2 authors:
1. Soheila Borhani
2. Xiaoqian Jiang
This article has no evaluationsLatest version Aug 7, 2025
ChatGPT vs DeepSeek: A Comparative Study of Diagnostic Accuracy and Clinical Reasoning in Rare and Complex Diseases

This article has 6 authors:
1. Jialin Liu
2. Weiping Cao
3. Bo Yuan
4. Wenyi Xie
5. Changyu Wang
6. Siru Liu
This article has no evaluationsLatest version Aug 28, 2025
Assessing ChatGPT-4 as a Clinical Decision Support Tool in Neuro-Oncology Radiotherapy: A Prospective Comparative Study

This article has 6 authors:
1. Paolo Tini
2. Federica Novi
3. Flavio Donnini
4. Armando Perrella
5. Giulio Bagnacci
6. Maria Antonietta Mazzei
This article has no evaluationsLatest version Sep 3, 2025

Listed in

Abstract

Background

Methods

Results

Conclusion

Article activity feed

Related articles

Evaluating the Efficacy of Large Language Models in Addressing Patient-Centric Inquiries in Multiple Cancers

ChatGPT vs DeepSeek: A Comparative Study of Diagnostic Accuracy and Clinical Reasoning in Rare and Complex Diseases

Assessing ChatGPT-4 as a Clinical Decision Support Tool in Neuro-Oncology Radiotherapy: A Prospective Comparative Study