Psychometric Alignment Between Human and Artificial Intelligence Performance in Cardiology Residency In-Service Examinations

Aykan CELIK
Tuncay KIRIS
Ugur KOCABAS
Emre OZDEMIR
Mustafa KARACA

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Large language models (LLMs) have demonstrated rapidly expanding capabilities across medical knowledge tasks, including professional examinations. However, most existing evaluations focus primarily on overall accuracy and provide limited insight into how AI performance relates to the psychometric structure of examination items. Methods We evaluated the performance of five large language models on a dataset of 199 cardiology residency in-service examination questions. The models included three frontier general-purpose systems (Claude 4.6 Opus, Gemini 3.1 Flash-Lite, and GPT-5.4) and two medically oriented open-source models (MedQwen-2.5 and Qwen-3.5). Item-level analyses were conducted to examine the associations between AI accuracy and psychometric characteristics of exam questions, including human-defined item difficulty and item discrimination. Multivariable logistic regression was used to identify independent predictors of AI performance. Alignment between human and AI performance was assessed using Spearman correlation and distractor overlap analysis. Results Frontier models substantially outperformed medically oriented open-source models, achieving accuracies of 86.4% for Claude Opus, 82.9% for Gemini Flash-Lite, and 82.4% for GPT-5.4, compared with 53.3% for MedQwen and 18.6% for Qwen-3.5-35B. AI performance followed a clear gradient across human-defined difficulty levels, with frontier models answering 65–74% of hard questions and 92–96% of easy questions correctly. In multivariable analyses, item difficulty was the only psychometric factor consistently associated with AI success across frontier models (OR range 0.37–0.47, all p < 0.01). Human and AI performance were significantly correlated across items (Spearman ρ ≈ 0.25–0.30, p < 0.001). When AI models answered incorrectly, they frequently selected the same distractors as human examinees, with error overlap ranging from 31% to 53%. Conclusions Large language models demonstrate strong performance on cardiology residency examination questions and exhibit meaningful alignment with human-defined item difficulty and performance patterns. These findings suggest that AI performance on medical examinations is structured by the same psychometric characteristics that shape human assessment outcomes. Integrating AI benchmarking with psychometric analysis may provide a more informative framework for evaluating future AI systems in medical education and knowledge assessment.

Version published to 10.21203/rs.3.rs-9247601/v1 on Research Square
Apr 6, 2026

Diagnostic Performance and Cost-Efficiency of Large Language Models in Secondary Hypertension: A Blinded Comparative Study

This article has 4 authors:
1. Asena Gökçay Canpolat
2. Özge Baş Aksu
3. Rıfat Emral
4. Uğur Canpolat
This article has no evaluationsLatest version Mar 18, 2026
Examining the domains of AI-assisted learning and their relationship with academic performance among nursing and allied health students: a cross-sectional study

This article has 3 authors:
1. Norni Mualip
2. Renny Grece Jeplin
3. Jaffri Hashim
This article has no evaluationsLatest version Mar 27, 2026
Language-dependent variability in large language model performance on pharmaceutical knowledge tasks

This article has 7 authors:
1. Hiroto Asano
2. Yu-Shi Tian
3. Asuka Hatabu
4. Minako Ohishi
5. Kaori Fukuzawa
6. Daisuke Takaya
7. Kenji Ikeda
This article has no evaluationsLatest version Mar 27, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Diagnostic Performance and Cost-Efficiency of Large Language Models in Secondary Hypertension: A Blinded Comparative Study

Examining the domains of AI-assisted learning and their relationship with academic performance among nursing and allied health students: a cross-sectional study

Language-dependent variability in large language model performance on pharmaceutical knowledge tasks