A Performance-Based Rubric for Generative AI Use in Medical Students’ Research Tasks: Development and Initial Psychometric Evaluation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: As generative AI becomes embedded in medical training, patient safety depends on graduates’ ability to recognize AI limitations and bias, document AI involvement transparently, and verify AI-generated information rather than accept it uncritically. We developed a performance-based rubric to assess observable generative AI (LLM) literacy behaviors within authentic coursework. Methods: In a single-institution evaluation (Spring 2025), third-year medical students (n = 50 submissions) completed a structured research proposal and submitted the corresponding AI chat transcript and an AI-use disclosure. A four-domain rubric was developed through three pilot–revise cycles: AI Use Documentation, Prompt Generation, Verification, and Integration. Each domain was scored 0–3 (total 0–12). Three educators independently scored all submissions. Inter-rater reliability was assessed using ICC (average-measures, agreement). Construct-relevant patterns were examined via domain distributions (floor effects), performance bands (lower 25%, middle 50%, upper 25%), within-submission differences across domains (Friedman with Bonferroni-adjusted Wilcoxon tests), inter-domain associations (Spearman), and correlation with overall GPA (Spearman). Results: Mean (SD) domain scores were: AI Use Documentation 0.67 (1.08), Prompt Generation 1.33 (0.69), Verification 0.41 (0.71), and Integration 1.64 (0.67); total score 4.06 (1.80). Floor effects were substantial for AI Use Documentation (64% scored 0) and Verification (60% scored 0). Inter-rater reliability was high (ICC: Documentation 0.99, Prompt Generation 0.84, Verification 0.93, Integration 0.83). Verification was significantly lower than Prompt Generation and Integration (Bonferroni-adjusted p < 0.008). Inter-domain correlations were weak (ρ −0.206 to 0.310). Total scores showed no significant association with GPA (r = 0.194, p = 0.201). Conclusions: This rubric demonstrated strong scoring reliability and produced initial psychometric evidence consistent with measuring distinct, observable LLM-use competencies. Findings highlight prominent gaps in verification and transparent documentation, reinforcing competency guidance that emphasizes recognizing AI limitations and verifying AI output to protect patient safety. Further multi-site validation and implementation work is warranted.