Close, But no Cigar: Comparative Evaluation of ChatGPT-4o and OpenAI o1-preview in Answering Pancreatic Ductal Adenocarcinoma-Related Questions
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
This study aimed to evaluate the effectiveness of ChatGPT-4o and OpenAI o1-preview in responding to pancreatic ductal adenocarcinoma (PDAC)-related queries. The study assessed both LLMs’ accuracy, comprehensiveness, and safety when answering clinical questions, based on the National Comprehensive Cancer Network® (NCCN) Clinical Practice Guidelines for PDAC.
Methods
The study used a 20-question dataset derived from clinical scenarios related to PDAC. Two board-certified surgeons independently evaluated the responses by ChatGPT-4o and OpenAI o1-preview for their accuracy, comprehensiveness, and safety using a Likert scale. Statistical analyses were conducted to compare the performances of the two models. We also analyzed the impact of OpenAI o1-preview’s Chain of Thought (CoT) technology.
Results
Both models demonstrated high median scores across all dimensions (5 out of 5). OpenAI o1-preview outperformed ChatGPT-4o in comprehensiveness (p = 0.026) and demonstrated superior reasoning ability, with a higher accuracy rate of 75% compared to 60% for ChatGPT-4o. OpenAI o1-preview generated more concise responses (median 64 vs. 82 words, p < 0.001). The CoT method in OpenAI o1-preview appeared to enhance its reasoning capabilities, particularly in complex treatment decisions. However, both models made critical errors in some complex clinical scenarios.
Conclusion
OpenAI o1-preview, with its CoT technology, demonstrates higher comprehensiveness than ChatGPT-4.0 and showed a tendency of improved accuracy. However, both models still make critical errors and cause some harm to patients. Even the most advanced models are not suitable for offering reliable medical information and cannot function as an assistant for decision-making.