Quality assurance and validity of AI-generated Single Best Answer questions

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Recent advancements in generative artificial intelligence have opened new avenues in educational methodologies, particularly in medical education. This study seeks to assess whether generative AI might be useful in addressing the depletion of assessment question banks, a challenge intensified during the Covid-era due to the prevalence of open-book examinations, and to augment the pool of formative assessment opportunities available to students. While many recent publications have sought to ascertain whether AI can achieve a passing standard in existing examinations, this study investigates the potential for AI to generate the exam itself. Summary of Work This research utilized a commercially available AI large language model (LLM), OpenAI GPT-4, to generate 220 single best answer (SBA) questions, adhering to Medical Schools Council Assessment Alliance guidelines the and a selection of Learning Outcomes (LOs) of the Scottish Graduate-Entry Medicine (ScotGEM) program. The AI-generated questions underwent quality-assurance screening to ensure compliance with the stipulated guidelines and LOs. A subset of these questions was then incorporated into an examination format alongside an equal number of human-authored questions and subsequently undertaken by a cohort of medical students. The performance of both AI-generated and human-authored questions was evaluated, focusing on facility and discrimination index as key metrics. Summary of Results The screening process revealed that 69% of AI-generated SBAs were fit for inclusion in the examinations with little or no modifications required. Modifications, when necessary, were predominantly due to reasons such as the inclusion of "all of the above" options, usage of American English spellings, and non-alphabetized answer choices. 31% of questions were rejected for inclusion in the examinations, due to factual inaccuracies and non-alignment with students’ learning. When included in an examination, post hoc statistical analysis indicated no significant difference in performance between the AI- and human- authored questions in terms of facility and discrimination index. Discussion and Conclusion The outcomes of this study suggest that AI LLMs can generate SBA questions that are in line with best-practice guidelines and specific LOs. However, the a robust quality assurance process is necessary to ensure that erroneous questions are identified and rejected. The insights gained from this research provide a foundation for further investigation into refining AI prompts, aiming for a more reliable generation of curriculum-aligned questions. LLMs show significant potential in supplementing traditional methods of question generation in medical education. This approach offers a viable solution to rapidly replenish and diversify assessment resources in medical curricula, marking a step forward in the intersection of AI and education.

Article activity feed