Evaluation of the Precision of a Surgery Subspecialty-Specific Large Language Model, AtlasGPT, in Relation to Standard Models Using Board-Like Questions

Brandon L. Staple
Elijah M. Staple
Cynthia Wallace
Bevan D. Staple

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Standard Large Language Models (sLLMs), such as GPT-4, exhibit notable accuracy when evaluated against multiple-choice questions (MCQs) from the Self-Assessment Neurosurgery Exam. However, due to their training on broad generalizations across various fields, sLLMs often fall short in capturing the nuanced context required in specialized areas like neurosurgery. Recently, a Domain-specific Large Language Model, AtlasGPT, has shown enhanced accuracy compared to sLLMs like Gemini and GPT-4 by utilizing model fine-tuning and retrieval-augmented generation techniques to extract relevant neurosurgical information from a dedicated database. Nonetheless, it remains uncertain whether such a model can surpass the current leading sLLMs in the critical domain of adversarial testing in medicine, or if these models could effectively complement or replace existing examination preparation resources. This study aims to explore these questions by evaluating the accuracy of four advanced sLLMs (namely, GPT-3.5, Gemini, Claude 3.5 Sonnet, and Mistral) in comparison to AtlasGPT, using a benchmark of 150 text-only, surrogate neurosurgical written board-style MCQs. The analysis of variance indicates that AtlasGPT significantly outperformed all four sLLMs, achieving an accuracy of 96.7% (p < 0.05), while Claude scored 94.7% (p < 0.05), Gemini 92.0% (p < 0.05), Mistral 88.7% (p < 0.05), and GPT-3.5 74.7% (p < 0.05). A post-hoc Bonferroni correction analysis revealed that the most statistically significant difference in mean accuracy was between GPT-3.5 and AtlasGPT (p = 0.000000028), followed by comparisons between GPT-3.5 and Claude (p = 0.000001), GPT-3.5 and Gemini (p = 0.000048), Mistral and AtlasGPT (p = 0.0078), and Mistral and GPT-3.5 (p = 0.0017). These findings demonstrate the remarkable capabilities of the current leading sLLMs in the critical domain of adversarial testing in medicine and the potential of medical subspecialty-focused models like AtlasGPT to outperform standard models and improve medical knowledge, decision-making, and educational materials in complex fields like neurosurgery.

Version published to 10.21203/rs.3.rs-6442642/v1 on Research Square
May 9, 2025

Comparative Evaluation the Knowledge of Large Language Models about Response Evaluation Criteria in Solid Tumors?

This article has 3 authors:
1. Eren Çamur
2. Turay Cesur
3. Yasin Celal Güneş
This article has no evaluationsLatest version May 7, 2025
Evaluating the performance of large language & visual-language models in cervical cytology screening

This article has 16 authors:
1. Qi Hong
2. Shijie Liu
3. Liying Wu
4. Qiqi Lu
5. Pinglan Yang
6. Dingyu Chen
7. Gong Rao
8. Xinyi Liu
9. Hua Ye
10. Peiqi Zhuang
11. Wenxiu Yang
12. Shaoqun Zeng
13. Qianjin Feng
14. Xiuli Liu
15. Jing Cai
16. Shenghua Cheng
This article has no evaluationsLatest version May 23, 2025
Evaluating Large Language Model-Generated Brain MRI Protocols: Performance of GPT4o, o3-mini, DeepSeek-R1 and Qwen2.5-72B

This article has 16 authors:
1. Su Hwan Kim
2. Severin Schramm
3. Lena Schmitzer
4. Kerem Serguen
5. Sebastian Ziegelmayer
6. Felix Busch
7. Alexander Komenda
8. Marcus R. Makowski
9. Lisa C. Adams
10. Keno K. Bressem
11. Claus Zimmer
12. Jan Kirschke
13. Benedikt Wiestler
14. Dennis Hedderich
15. Tom Finck
16. Jannis Bodden
This article has no evaluationsLatest version Apr 9, 2025

Listed in

Abstract

Article activity feed

Related articles

Comparative Evaluation the Knowledge of Large Language Models about Response Evaluation Criteria in Solid Tumors?

Evaluating the performance of large language & visual-language models in cervical cytology screening

Evaluating Large Language Model-Generated Brain MRI Protocols: Performance of GPT4o, o3-mini, DeepSeek-R1 and Qwen2.5-72B