Performance of Chatgpt 4.1 and Gemini 3 on Ocular Oncology Board-Style Questions: A Comparative Study

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Purpose : Large language models have shown promising performance in general medical education, but evidence regarding their accuracy in ocular oncology is limited. This study compared the performance of GPT‑4.1 and Gemini‑3 on ocular oncology board‑style questions. Methods : Fifty‑eight board‑style questions on ocular tumours were obtained from an established ophthalmology question bank. Each question was independently entered into GPT‑4.1 and Gemini‑3 using identical prompts. Accuracy was assessed by ophthalmologists with ocular oncology expertise. Response length and response time were also recorded. Results: GPT‑4.1 answered 63.8% of questions correctly, while Gemini‑3 achieved an accuracy of 65.5%, with no statistically significant difference between models. Gemini‑3 generated significantly longer responses than GPT‑4.1, whereas response time was comparable. No correlation was observed between response length and accuracy. Conclusions : GPT‑4.1 and Gemini‑3 demonstrated comparable and moderate accuracy on ocular oncology board‑style questions. Increased response verbosity did not improve accuracy, highlighting the need for expert oversight when using language models for subspecialty ophthalmology education.

Article activity feed