Close, But no Cigar: Comparative Evaluation of ChatGPT-4o and OpenAI o1-preview in Answering Pancreatic Ductal Adenocarcinoma-Related Questions

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

This study aimed to evaluate the effectiveness of ChatGPT-4o and OpenAI o1-preview in responding to pancreatic ductal adenocarcinoma (PDAC)-related queries. The study assessed both LLMs’ accuracy, comprehensiveness, and safety when answering clinical questions, based on the National Comprehensive Cancer Network® (NCCN) Clinical Practice Guidelines for PDAC.

Methods

The study used a 20-question dataset derived from clinical scenarios related to PDAC. Two board-certified surgeons independently evaluated the responses by ChatGPT-4o and OpenAI o1-preview for their accuracy, comprehensiveness, and safety using a Likert scale. Statistical analyses were conducted to compare the performances of the two models. We also analyzed the impact of OpenAI o1-preview’s Chain of Thought (CoT) technology.

Results

Both models demonstrated high median scores across all dimensions (5 out of 5). OpenAI o1-preview outperformed ChatGPT-4o in comprehensiveness (p = 0.026) and demonstrated superior reasoning ability, with a higher accuracy rate of 75% compared to 60% for ChatGPT-4o. OpenAI o1-preview generated more concise responses (median 64 vs. 82 words, p < 0.001). The CoT method in OpenAI o1-preview appeared to enhance its reasoning capabilities, particularly in complex treatment decisions. However, both models made critical errors in some complex clinical scenarios.

Conclusion

OpenAI o1-preview, with its CoT technology, demonstrates higher comprehensiveness than ChatGPT-4.0 and showed a tendency of improved accuracy. However, both models still make critical errors and cause some harm to patients. Even the most advanced models are not suitable for offering reliable medical information and cannot function as an assistant for decision-making.

Article activity feed