Evaluation of ChatGPT's Performance in Residency Training Progress Exams and Competency Exams in Orthopedics and Traumatology
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Artificial intelligence (AI) technologies have rapidly expanded into the field of medical education, offering innovative tools for training and assessment.This study aimed to evaluate the performance of the ChatGPT-3.5 language model in the “Residency Training Progress Examination” (UEGS) and the “Competency Examination” administered by the Turkish Society of Orthopedics and Traumatology (TOTBID). The objective was to determine whether ChatGPT performs comparably to orthopedic residents and whether it can achieve a passing score in the Competency Exam. Methods A total of 2,000 UEGS and 1,000 Competency Exam questions (2012–2023, excluding 2020) were presented to ChatGPT-3.5 using standardized prompts designed within the Role–Goals–Context (RGC) framework. The model’s responses were statistically compared with those of orthopedic residents and specialists using the Mann–Whitney U and Kruskal–Wallis tests (p < 0.05). Results ChatGPT achieved the highest accuracy in the General Orthopedics category (62%) and the lowest in Adult Reconstructive Surgery (40%). It outperformed residents only in the Spine Surgery category (p < 0.05). In the Competency Exams, ChatGPT passed four of ten exams. Conclusion ChatGPT-3.5 demonstrated limited reliability and accuracy in orthopedic examinations and should be used cautiously as an educational support tool. Future studies involving newer multimodal versions of large language models may clarify their potential role in medical education and assessment.