Do They Learn When They Read? A Two-Stage Evaluation of AI Models’ Orthopedic Knowledge Using Orthobullets and Miller’s Review

Mahircan Demir
Hunkar Cagdas Bayrak
Ibrahim Faruk Adiguzel

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective: This study aims to evaluate the orthopedic knowledge levels of large language model (LLM)-based artificial intelligence systems. ChatGPT (OpenAI), Google Gemini, Claude (Anthropic), and Perplexity AI were compared based on their responses to 110 multiple-choice questions sourced from the Orthobullets platform. In addition, the study examined how the performance of these models changed after being granted access to Miller’s Review of Orthopaedics, a widely recognized core reference in orthopedic education. Methods: The study was conducted in two phases. In the first phase, 110 orthopedic questions were posed to each AI model without providing any external reference material. In the second phase, the models' prior data were reset, and only the PDF version of Miller’s Review of Orthopaedics was uploaded. The same 110 questions were then re-asked, this time with access to the textbook. The number of correct responses and overall accuracy rates were calculated for each model, and performance differences between phases were analyzed. Results: A comparative analysis of each AI model’s knowledge before and after literature access was presented. Accuracy rates, topic-based performance distributions, and improvements were reported using descriptive statistics. Certain models demonstrated a significant increase in accuracy following access to the reference material. Conclusion: When provided with access to orthopedic literature, AI models are capable of substantially enhancing their knowledge levels, indicating their potential as effective tools in medical education and clinical decision support. This study contributes to the integration of LLMs in healthcare applications by comparing their ability to learn from structured medical sources.

Version published to 10.20944/preprints202505.2471.v1
May 30, 2025

Evaluation of the Precision of a Surgery Subspecialty-Specific Large Language Model, AtlasGPT, in Relation to Standard Models Using Board-Like Questions

This article has 4 authors:
1. Brandon L. Staple
2. Elijah M. Staple
3. Cynthia Wallace
4. Bevan D. Staple
This article has no evaluationsLatest version May 9, 2025
From Lectures to Learning Outcomes: Meaningful Integration of AI-Generated Content in Pre-Clerkship Medical Training

This article has 11 authors:
1. Jay Khurana
2. Hossam A Zaki
3. Ellie Pavlick
4. Jillian Turbitt
5. Heather McGee
6. Sahil Gupta
7. Sriya Sai Pushpa Datla
8. Salma Eldeeb
9. Thais Salazar Mather
10. Sarita Warrier
11. Joyce Ou
This article has no evaluationsLatest version May 13, 2025
Comparative Evaluation the Knowledge of Large Language Models about Response Evaluation Criteria in Solid Tumors?

This article has 3 authors:
1. Eren Çamur
2. Turay Cesur
3. Yasin Celal Güneş
This article has no evaluationsLatest version May 7, 2025

Listed in

Abstract

Article activity feed

Related articles

Evaluation of the Precision of a Surgery Subspecialty-Specific Large Language Model, AtlasGPT, in Relation to Standard Models Using Board-Like Questions

From Lectures to Learning Outcomes: Meaningful Integration of AI-Generated Content in Pre-Clerkship Medical Training

Comparative Evaluation the Knowledge of Large Language Models about Response Evaluation Criteria in Solid Tumors?