Natural Language Processing and Generative AI in the Automated Scoring and Feedback of Reflective Writing in Medical Education: A Validity and Fairness Analysis
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The study explored the application of Natural Language Processing (NLP) and generative AI tools in assessing reflective writing submitted by medical students in Ghana. It evaluated the validity, fairness, and cultural alignment of AI-generated feedback by comparing AI-generated scores with human rater assessments and analyzing demographic group discrepancies. A total of 180 reflective essays were sampled, with an equal number (n = 60) collected from each university . Quantitative methods included Cohen’s Kappa and Intraclass Correlation Coefficients (ICC) to assess inter-rater agreement, while logistic regression and multiple regression models examined potential biases across gender, university affiliation, and English proficiency. Qualitative data were gathered through interviews with students and faculty to explore perceptions of fairness, trust, and the AI’s capacity to capture cultural and linguistic nuances. Results indicated that the AI system demonstrated strong inter-rater reliability, with Cohen’s Kappa values of 0.74 (AI vs Rater 1) and 0.76 (AI vs Rater 2), and ICC values of 0.78 and 0.80, respectively. Human raters showed higher agreement with each other (Cohen’s Kappa = 0.81, ICC = 0.85). However, significant discrepancies were found across demographic groups, particularly for English proficiency, where lower proficiency students tended to receive higher AI scores than human raters (log-odds of 0.45, p = 0.001). Thematic analysis of qualitative interviews revealed concerns over the lack of empathy in AI feedback, misalignment with cultural and linguistic nuances, and mixed levels of trust in AI-generated assessments. These findings suggest that while AI holds promise for improving efficiency in assessment, careful attention must be given to its limitations in fairness and cultural sensitivity. The study concluded with recommendations for improving AI systems through contextual adaptation, hybrid assessment models, faculty training, and regular bias audits to ensure equitable and effective use of AI in educational settings.