Performance of Large Language Models in Nursing Licensure Examinations: A Systematic Review and Meta-Analysis
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objectives : This systematic review and meta-analysis assessed the performance of large language models (LLMS) in nursing licensure examinations. Despite the increasing use of LLMS in healthcare education, their capabilities in nursing licensure examinations remain uncertain. This study provides evidence on the accuracy and limitations of LLMs to help guide their integration into nursing education and licensure. Design : The systematic review and meta-analysis adhered to PRISMA 2020 guidelines. Data sources : PubMed, CINAHL, PsycINFO, EMCARE, and ERIC were searched from April to June 2025. Eligibility criteria: Studies were eligible if they evaluated LLMs (e.g., GPT-4, ChatGPT, Qwen-2.5) using multiple-choice nursing licensure questions under exam-like conditions and reported quantitative accuracy. Open-ended items were excluded from the meta-analysis but narratively synthesised Review methods : Two reviewers independently screened, extracted data, and appraised the risk of bias. A random-effects meta-analysis estimated pooled accuracy; subgroup and meta-regression analyses explored heterogeneity. Results : Twelve studies assessed 13,870 MCQs across five exam systems and eight LLMs. Pooled accuracy was 69.6% (95% CI: 65.6–73.6%) with substantial heterogeneity (I² = 98%). GPT-4 outperformed GPT-3.5 (77.2% vs. 60.4%, six studies); domain-customised and newer models reached 93.6%. LLMs excelled in general medicine and pharmacology but underperformed in ethics and psychosocial integrity. Accuracy differed significantly by exam system (p < 0.01), but not by question difficulty (p = 0.90) or format (p = 0.96). Translated NCLEX-RN items reduced accuracy (p = 0.03); CNNLE was the only system with a significant positive effect (p < 0.001). Methodological variability and underreporting of model parameters were common. Conclusions: LLMs show promise for low-stakes educational applications, such as formative assessments within hybrid teaching models; however, they are unsuitable for unmoderated, high-stakes licensure decisions due to inconsistent performance. Regulatory guidelines, equitable access, and nursing-specific model development are needed to ensure fairness and validity. Research must prioritise standardised frameworks, error analysis, and broader geographic representation to address these limitations.