Large Language Models in Undergraduate Medical Education: A Scoping Review of Use Cases, Effectiveness, and Limitations
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Large Language Models (LLMs) like ChatGPT have been identified as potential additions to undergraduate medical education (UGME). Their applications include assessment, simulation, and personalized learning, although their efficacy and risks are still not well understood. However, the breadth, nature, and range of this evidence remain unclear, a mapping of evidence of their utility, accuracy, and limitations is necessary to inform evidence-based integration. Methods A scoping review was performed to identify the empirical literature regarding the use of LLMs in UGME. Eligible studies were experimental, cross-sectional and qualitative designs that evaluated LLM use in formative and or summative activities. Studies were included if they applied LLMs for assessment, simulation, or educational support among pre-clinical or clinical students. Study design, country/income level, LLM model, purpose, mode of use, student level, prompting mode, outcomes, and limitations were extracted. Results A total of nine studies were included from seven countries (high- and upper-middle-income countries). Studies ranged from cross-sectional trials and feasibility studies to qualitative focus groups and mixed-methods scoring analysis. The most popular LLM was ChatGPT (versions 3.5, 4, and 4o). Applications ranged from MCQ generation, automated scoring (OSCEs and short answers), clinical simulation, revision support, and documentation feedback. Performance varied: MCQ usability (91% templates were usable), high correlation with humans’ scores (r = 0.599–0.732), and GPT-4 items were judged as almost equivalent to expert-written questions. The risks of the intervention included hallucinations (38% success rate), content mistakes, lack of empathy and biased answer generation. Prompt engineering and human tending were necessary for output quality. Conclusion LLMs appear to be of moderate to high feasibility in UGME settings, particularly when combined with structured prompts and expert review. Despite its potential for formative use and scalability, rigorous psychometric validation and learner-centered research are needed.