Psychometric Alignment Between Human and Artificial Intelligence Performance in Cardiology Residency In-Service Examinations
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Large language models (LLMs) have demonstrated rapidly expanding capabilities across medical knowledge tasks, including professional examinations. However, most existing evaluations focus primarily on overall accuracy and provide limited insight into how AI performance relates to the psychometric structure of examination items. Methods We evaluated the performance of five large language models on a dataset of 199 cardiology residency in-service examination questions. The models included three frontier general-purpose systems (Claude 4.6 Opus, Gemini 3.1 Flash-Lite, and GPT-5.4) and two medically oriented open-source models (MedQwen-2.5 and Qwen-3.5). Item-level analyses were conducted to examine the associations between AI accuracy and psychometric characteristics of exam questions, including human-defined item difficulty and item discrimination. Multivariable logistic regression was used to identify independent predictors of AI performance. Alignment between human and AI performance was assessed using Spearman correlation and distractor overlap analysis. Results Frontier models substantially outperformed medically oriented open-source models, achieving accuracies of 86.4% for Claude Opus, 82.9% for Gemini Flash-Lite, and 82.4% for GPT-5.4, compared with 53.3% for MedQwen and 18.6% for Qwen-3.5-35B. AI performance followed a clear gradient across human-defined difficulty levels, with frontier models answering 65–74% of hard questions and 92–96% of easy questions correctly. In multivariable analyses, item difficulty was the only psychometric factor consistently associated with AI success across frontier models (OR range 0.37–0.47, all p < 0.01). Human and AI performance were significantly correlated across items (Spearman ρ ≈ 0.25–0.30, p < 0.001). When AI models answered incorrectly, they frequently selected the same distractors as human examinees, with error overlap ranging from 31% to 53%. Conclusions Large language models demonstrate strong performance on cardiology residency examination questions and exhibit meaningful alignment with human-defined item difficulty and performance patterns. These findings suggest that AI performance on medical examinations is structured by the same psychometric characteristics that shape human assessment outcomes. Integrating AI benchmarking with psychometric analysis may provide a more informative framework for evaluating future AI systems in medical education and knowledge assessment.