Evaluation of AI-Generated Multiple-Choice Questions for Periodontology Exams: A Quality Assessment Study

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: This study evaluated the quality of multiple-choice questions (MCQs) generated by ChatGPT-4o compared with faculty written items in periodontology using the Integrated National Board Dental Examination (INBDE) rubric. Methods: Thirty MCQs were assessed in a blinded cross-sectional comparison at Tufts University School of Dental Medicine. 15 questions were generated by ChatGPT-4o based on course objectives and INBDE guidelines, and 15 were randomly selected from the departmental exam bank. Fourteen periodontology faculty members rated each item on six INBDE criteria including clarity, content accuracy, distractor quality, fairness, curricular alignment, and grammar using a five-point Likert scale ranging from poor (1) to excellent (5). Composite scores were analyzed using a generalized linear mixed model. Results: AI-generated items achieved significantly higher composite scores than human-written questions (20.7 ± 4.9 vs 18.3 ± 5.1; p < 0.001). In descriptive comparisons, AI items also received higher ratings across all six domains, particularly in clarity and grammar. Reviewers were unable to reliably identify the source of the items, and 84.1% of AI generated questions were judged suitable for exam use compared with 55.7% of faculty written items. Conclusions: ChatGPT-4o produced high-quality and well-structured MCQs, and reviewers frequently reported difficulty distinguishing their origin in this blinded assessment. While these results highlight the potential value of AI-assisted assessment design, expert supervision remains essential to ensure accuracy, cognitive depth, and alignment with educational standards. AI should be a supportive tool that complements rather than replaces faculty expertise in question development.

Article activity feed