Testing new versions of ChatGPT in terms of physiology and electrophysiology of hearing: improved accuracy but not consistency

W. Wiktor Jędrzejczak
Henryk Skarżyński
Krzysztof Kochanek

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Introduction

ChatGPT has revolutionized many aspects of modern life, including scientific ones. Since its introduction, new versions have been introduced and advertised as having better performance. But is this true? This study aimed to assess the accuracy and consistency of six versions of ChatGPT (3.5, 4, 4o mini, 4o, 4o1 mini, and 4o1 preview). Of interest was the variability of responses given to asking the same question multiple times.

Methods

We evaluated 6 versions of ChatGPT based on their responses to 30 single-answer, multiple-choice exam questions from a 1-year course on objective methods of testing hearing. The questions were posed 10 times to each version of ChatGPT across two days (5 times each day). The accuracy of the responses was evaluated in terms of a response key. To evaluate consistency (repeatability) of the responses over time, percent agreement and Cohen’s Kappa were calculated.

Results

The overall accuracy of ChatGPT increased with each version, starting from around 53% for version 3.5 and rising to 86% for version 4o1 preview. The greatest improvement in accuracy and repeatability came with the introduction of version 4o. Repeatability progressively rose with newer releases with the exception of version 4o1 mini. While the current top version 4o1 preview has similar repeatability to 4o, the faster version, 4o1 mini, had significantly lower repeatability than the older 4o mini.

Conclusion

Newer versions of ChatGPT generally show improvement in terms of accuracy, but not in repeatability. The variability of responses is probably the current main limitation of ChatGPT for professional applications. Users must be especially careful with version 4o1 mini.

Version published to 10.1101/2024.10.08.24315089v1 on medRxiv
Oct 8, 2024

When ChatGPT-4o Is (Less) Human-Like: Preliminary Subjective Rating Tests for Psycholinguistic Research

This article has 2 authors:
1. Takumi Kosaka
2. Aoi Kikkawa
This article has no evaluationsLatest version Sep 16, 2024
Performance of ChatGPT in pediatric audiology as rated by students and experts

This article has 6 authors:
1. Anna Ratuszniak
2. Elzbieta Gos
3. Artur Lorens
4. Piotr H. Skarzynski
5. Henryk Skarzynski
6. W. Wiktor Jedrzejczak
This article has no evaluationsLatest version Oct 27, 2024
I answer to clinical scenarios as a real doctor because I “Think”: “ChatGPTo1” new reasoning ability

This article has 6 authors:
1. Gianluca Mondillo
2. Simone Colosimo
3. Alessandra Perrotta
4. Vittoria Frattolillo
5. Mariapia Masino
6. Pierluigi Marzuillo
This article has no evaluationsLatest version Sep 30, 2024

Listed in

Abstract

Introduction

Methods

Results

Conclusion

Article activity feed

Related articles

When ChatGPT-4o Is (Less) Human-Like: Preliminary Subjective Rating Tests for Psycholinguistic Research

Performance of ChatGPT in pediatric audiology as rated by students and experts

I answer to clinical scenarios as a real doctor because I “Think”: “ChatGPTo1” new reasoning ability