Comparison of AI-Generated and Clinician-Designed Multiple Choice Questions in Emergency Medicine Exam: A Psychometric Analysis
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background/aim This study compared the effectiveness and psychometric quality of artificial intelligence (AI)-generated multiple-choice questions (MCQs), specifically from ChatGPT-4o, with clinician-designed MCQs in an emergency medicine residency program. Methods Eighteen emergency medicine residents participated, completing an examination of 100 questions—50 AI-generated and 50 clinician-designed—based on core emergency medicine topics. Psychometric analysis assessed item difficulty, discrimination, and reliability through the point-biserial correlation coefficient (PBCC). Results Results showed no significant difference in discrimination indices between AI-generated and clinician-designed MCQs, indicating both question sets were similarly effective at differentiating between high and low performers. However, AI-generated MCQs were significantly more difficult (mean item difficulty index, 0.65 versus 0.76; p = 0.02). Residents performed significantly better on AI-generated questions compared to clinician-designed ones (mean score, 76.8 versus 67.3; p = 0.003). Both question sets demonstrated comparable reliability in assessing resident knowledge, as indicated by similar PBCC values. Conclusion This study highlights the potential for AI-generated MCQs to supplement clinician-designed assessments effectively, demonstrating comparable psychometric properties and reliability. However, the higher difficulty level of AI-generated questions suggests the necessity for expert review and oversight to ensure appropriateness and context accuracy. Further research with larger sample sizes and diverse medical settings is recommended to validate these findings and explore the broader implications of incorporating AI into medical education assessment strategies.