Do They Learn When They Read? A Two-Stage Evaluation of AI Models’ Orthopedic Knowledge Using Orthobullets and Miller’s Review
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objective: This study aims to evaluate the orthopedic knowledge levels of large language model (LLM)-based artificial intelligence systems. ChatGPT (OpenAI), Google Gemini, Claude (Anthropic), and Perplexity AI were compared based on their responses to 110 multiple-choice questions sourced from the Orthobullets platform. In addition, the study examined how the performance of these models changed after being granted access to Miller’s Review of Orthopaedics, a widely recognized core reference in orthopedic education. Methods: The study was conducted in two phases. In the first phase, 110 orthopedic questions were posed to each AI model without providing any external reference material. In the second phase, the models' prior data were reset, and only the PDF version of Miller’s Review of Orthopaedics was uploaded. The same 110 questions were then re-asked, this time with access to the textbook. The number of correct responses and overall accuracy rates were calculated for each model, and performance differences between phases were analyzed. Results: A comparative analysis of each AI model’s knowledge before and after literature access was presented. Accuracy rates, topic-based performance distributions, and improvements were reported using descriptive statistics. Certain models demonstrated a significant increase in accuracy following access to the reference material. Conclusion: When provided with access to orthopedic literature, AI models are capable of substantially enhancing their knowledge levels, indicating their potential as effective tools in medical education and clinical decision support. This study contributes to the integration of LLMs in healthcare applications by comparing their ability to learn from structured medical sources.