Artificial Intelligence in Clinical Practice: Evaluating Chatbot Performance on Board-Level Questions in Geriatrics
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
BACKGROUND Artificial intelligence (AI) language models are increasingly being explored as tools to support medical education and clinical care. Evaluating their performance on valid and reliable assessments such as board certification exams may provide insight into their potential integration into real-world medical settings. This study evaluated the accuracy, consistency, and difficulty assessment of four advanced AI models using board-level geriatrics questions. METHODS Four AI models—Grok-3, ChatGPT-4o, Microsoft Copilot, and Google Gemini 2.0 Flash—were tested on 300 text-based multiple-choice questions from the BoardVitals geriatrics certification question bank. The questions were equally divided into easy, medium, and hard categories. Each model was asked to classify the question's difficulty and provide an answer twice. Model responses were evaluated for accuracy, consistency between attempts, quality of explanations, and alignment with the difficulty ratings predefined by BoardVitals. RESULTS GPT-4o demonstrated the highest overall accuracy (85.3%), followed by Grok-3 (82.0%), Copilot (78.7%), and Gemini (74.0%). All models performed best on easy questions and showed a decrease in accuracy as the difficulty increased. GPT-4o exhibited the highest consistency (96.3%), followed by Grok-3 (95.0%), Copilot (90.7%), and Gemini (81.3%). While their overall performance surpassed the average success rates of human users in the database, the agreement between model-assigned and reference difficulty ratings was moderate (mean κ = 0.41). GPT-4o received the highest mean quality score (4.68 ± 0.84), followed by Grok-3 (4.59 ± 0.98), Copilot (4.30 ± 1.07), and Gemini (3.88 ± 1.53). CONCLUSIONS Advanced AI models demonstrate strong performance on geriatrics board-level content, with potential implications for education and decision support. However, the struggle with complex scenarios, question difficulty assessment, and the inconsistent answer explanation quality reveals a limitation in the implementation of these tools into practice. A thorough process, with experienced clinicians’ supervision present at every step, is essential for their safe and meaningful integration.