Performance of Chatgpt 4.1 and Gemini 3 on Ocular Oncology Board-Style Questions: A Comparative Study

Dilan Yildiz
Emine Betul Akbas Ozyurek

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Purpose : Large language models have shown promising performance in general medical education, but evidence regarding their accuracy in ocular oncology is limited. This study compared the performance of GPT‑4.1 and Gemini‑3 on ocular oncology board‑style questions. Methods : Fifty‑eight board‑style questions on ocular tumours were obtained from an established ophthalmology question bank. Each question was independently entered into GPT‑4.1 and Gemini‑3 using identical prompts. Accuracy was assessed by ophthalmologists with ocular oncology expertise. Response length and response time were also recorded. Results: GPT‑4.1 answered 63.8% of questions correctly, while Gemini‑3 achieved an accuracy of 65.5%, with no statistically significant difference between models. Gemini‑3 generated significantly longer responses than GPT‑4.1, whereas response time was comparable. No correlation was observed between response length and accuracy. Conclusions : GPT‑4.1 and Gemini‑3 demonstrated comparable and moderate accuracy on ocular oncology board‑style questions. Increased response verbosity did not improve accuracy, highlighting the need for expert oversight when using language models for subspecialty ophthalmology education.

Version published to 10.21203/rs.3.rs-8608668/v1 on Research Square
Mar 3, 2026

Large Language Models as Ophthalmic Patient Educators: A Comparative Evaluation of Readability, Understandability, and Actionability

This article has 3 authors:
1. Shivam Chandra
2. Vineet Kumar
3. Patrianakos Thomas
This article has no evaluationsLatest version Mar 20, 2026
Accuracy of Large Language Models on Multiple-Choice Questions in Oral and Dentomaxillofacial Radiology: A Comparison Based on the Revised Bloom’s Taxonomy

This article has 3 authors:
1. Ezgi TURK AKBULUT
2. Hatice Ahsen DENIZ
3. Can ATES
This article has no evaluationsLatest version Feb 25, 2026
Blind Expert Evaluation of Open-Weight LLMs for Thyroid Cancer Patient Education in a Non-English Setting: GPT-OSS-20B vs MedGemma-27B-Instruct

This article has 6 authors:
1. Mehmet Poyrazer
2. Hüseyin Yağcı
3. Alican Bozdaş
4. Ayşe Münevver Mühürdaroğlu
5. Çağatay Emir Önder
6. Sevde Nur Fırat
This article has no evaluationsLatest version Mar 13, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Large Language Models as Ophthalmic Patient Educators: A Comparative Evaluation of Readability, Understandability, and Actionability

Accuracy of Large Language Models on Multiple-Choice Questions in Oral and Dentomaxillofacial Radiology: A Comparison Based on the Revised Bloom’s Taxonomy

Blind Expert Evaluation of Open-Weight LLMs for Thyroid Cancer Patient Education in a Non-English Setting: GPT-OSS-20B vs MedGemma-27B-Instruct