Evaluation of the Precision of a Surgery Subspecialty-Specific Large Language Model, AtlasGPT, in Relation to Standard Models Using Board-Like Questions

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Standard Large Language Models (sLLMs), such as GPT-4, exhibit notable accuracy when evaluated against multiple-choice questions (MCQs) from the Self-Assessment Neurosurgery Exam. However, due to their training on broad generalizations across various fields, sLLMs often fall short in capturing the nuanced context required in specialized areas like neurosurgery. Recently, a Domain-specific Large Language Model, AtlasGPT, has shown enhanced accuracy compared to sLLMs like Gemini and GPT-4 by utilizing model fine-tuning and retrieval-augmented generation techniques to extract relevant neurosurgical information from a dedicated database. Nonetheless, it remains uncertain whether such a model can surpass the current leading sLLMs in the critical domain of adversarial testing in medicine, or if these models could effectively complement or replace existing examination preparation resources. This study aims to explore these questions by evaluating the accuracy of four advanced sLLMs (namely, GPT-3.5, Gemini, Claude 3.5 Sonnet, and Mistral) in comparison to AtlasGPT, using a benchmark of 150 text-only, surrogate neurosurgical written board-style MCQs. The analysis of variance indicates that AtlasGPT significantly outperformed all four sLLMs, achieving an accuracy of 96.7% (p < 0.05), while Claude scored 94.7% (p < 0.05), Gemini 92.0% (p < 0.05), Mistral 88.7% (p < 0.05), and GPT-3.5 74.7% (p < 0.05). A post-hoc Bonferroni correction analysis revealed that the most statistically significant difference in mean accuracy was between GPT-3.5 and AtlasGPT (p = 0.000000028), followed by comparisons between GPT-3.5 and Claude (p = 0.000001), GPT-3.5 and Gemini (p = 0.000048), Mistral and AtlasGPT (p = 0.0078), and Mistral and GPT-3.5 (p = 0.0017). These findings demonstrate the remarkable capabilities of the current leading sLLMs in the critical domain of adversarial testing in medicine and the potential of medical subspecialty-focused models like AtlasGPT to outperform standard models and improve medical knowledge, decision-making, and educational materials in complex fields like neurosurgery.

Article activity feed