Large Language Models for the assessment of medical students’ clinical decision-making

Sina Chole Benker
Jonathan Vollprecht
Cihan Papan
Max Hao Lu
Dogus Darici

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The assessment of medical students’ clinical decision making (CDM) skills is fundamental to healthcare education as it identifies knowledge gaps, enables targeted feedback, and validates that graduates meet professional competency standards. However, traditional assessment methods relying on expert human raters are resource-intensive and difficult to scale. This study investigated whether Large Language Models (LLMs), i.e. ChatGPT, could serve as automated assessors of medical students' CDM skills. We compared LLM-generated assessments with ratings from two humans across 21 medical student history-taking conversations using the Clinical Reasoning Indicator - Health Training Indicator (CRI-HTI). The results showed strong agreement between human raters and the LLM ( ICC = .675-.782, MAE = 0.343), with over 91% of ratings within 0.5 points of each other. Item-level analysis revealed moderate to excellent reliability across all eight CRI-HTI criteria. Additionally, we tested for gender bias by presenting identical transcripts with different gender designations (men, women, neutral) to the LLM. No significant differences were found between gendered prompts ( p > .05), suggesting that the LLM maintained consistent evaluation standards regardless of the subject's gender. These findings provide empirical evidence that LLMs could serve as consistent and gender-indiscriminating raters for supporting CDM assessment in medical education, potentially offering a scalable solution for providing timely feedback to medical students.

Version published to 10.21203/rs.3.rs-6660928/v1 on Research Square
Jun 17, 2025

CLEVER: Clinical Large Language Model Evaluationby Expert Review

This article has 4 authors:
1. Veysel Kocaman
2. Mustafa Kaya
3. Andrei Ferrer
4. David Talby
This article has no evaluationsLatest version Jul 23, 2025
Implementation of Large Language Models in Electronic Health Records

This article has 3 authors:
1. Maxime Griot
2. Jean Vanderdonckt
3. Demet Yuksel
This article has no evaluationsLatest version Jul 4, 2025
Evaluation of Large Language Models in Medical Examinations: A Scoping Review Protocol

This article has 5 authors:
1. Weiqi Wang
2. Baifeng Wang
3. Yan Zhu
4. Zhe Wang
5. Suyuan Peng
This article has no evaluationsLatest version Jun 12, 2025

Listed in

Abstract

Article activity feed

Related articles

CLEVER: Clinical Large Language Model Evaluationby Expert Review

Implementation of Large Language Models in Electronic Health Records

Evaluation of Large Language Models in Medical Examinations: A Scoping Review Protocol