Large Language Models for the assessment of medical students’ clinical decision-making
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The assessment of medical students’ clinical decision making (CDM) skills is fundamental to healthcare education as it identifies knowledge gaps, enables targeted feedback, and validates that graduates meet professional competency standards. However, traditional assessment methods relying on expert human raters are resource-intensive and difficult to scale. This study investigated whether Large Language Models (LLMs), i.e. ChatGPT, could serve as automated assessors of medical students' CDM skills. We compared LLM-generated assessments with ratings from two humans across 21 medical student history-taking conversations using the Clinical Reasoning Indicator - Health Training Indicator (CRI-HTI). The results showed strong agreement between human raters and the LLM ( ICC = .675-.782, MAE = 0.343), with over 91% of ratings within 0.5 points of each other. Item-level analysis revealed moderate to excellent reliability across all eight CRI-HTI criteria. Additionally, we tested for gender bias by presenting identical transcripts with different gender designations (men, women, neutral) to the LLM. No significant differences were found between gendered prompts ( p > .05), suggesting that the LLM maintained consistent evaluation standards regardless of the subject's gender. These findings provide empirical evidence that LLMs could serve as consistent and gender-indiscriminating raters for supporting CDM assessment in medical education, potentially offering a scalable solution for providing timely feedback to medical students.