How easily can AI chatbots spread misinformation in audiology and otolaryngology?

W. Wiktor Jedrzejczak
Agata Szkiełkowska
Danuta Raj-Koziak
Elżbieta Włodarczyk
Henryk Skarżyński
Krzysztof Kochanek

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Chatbots powered by large language models (LLMs) have recently emerged as prominent sources of information. However, their ability to propagate misinformation as well as information, particularly in specialized fields like audiology and otolaryngology, remains underexplored. This study aimed to evaluate the accuracy of six popular chatbots – ChatGPT, Gemini, Claude, DeepSeek, Grok, and Mistral – in response to questions framed around a range of unproven methods in audiological and otolaryngological care.

Methods

A set of 50 questions was developed based on common conversations between patients and clinicians. We then posed these questions to the chatbots. We tested each chatbot 10 times to account for variable responses, producing a total of 3,000 responses. The responses were compared with correct answers based on the general opinion of 11 professionals. The consistency of the responses was evaluated by Cohen’s Kappa.

Results

Most chatbot responses to the majority of questions were deemed accurate. Grok consistently performed best, where its answers aligned perfectly with the opinions of the experts. Deepseek exhibited the lowest accuracy, scoring 95.8%. Mistral exhibited the lowest consistency, scoring 0.96.

Conclusions

Although the evaluated chatbots generally avoided endorsing scientifically unsupported methods, some of the answers given could mislead and facilitate misinformation. The best performer among the group was Grok, which provided consistently accurate responses, showing it has potential for use in clinical and educational settings.

Version published to 10.1101/2025.04.24.25326281v1 on medRxiv
Apr 25, 2025

Evaluating the Performance of AI Chatbots in Responding to Dental Implant FAQs: A Comparative Study

This article has 5 authors:
1. Mesut TUZLALI
2. Nagehan BAKİ
3. Kübra ARAL
4. Cüneyt Asım ARAL
5. Erkan BAHÇE
This article has no evaluationsLatest version Jun 10, 2025
Accuracy and Completeness of AI Chatbots (ChatGPT-4o vs. Gemini 2.5 Pro) for Leopard Gecko (Eublepharis macularius) Husbandry Information

This article has 1 author:
1. Richard Digirolamo
This article has no evaluationsLatest version May 15, 2025
The Evaluation of Generated Responses by ChatGPT to Complex Linguistics Related Questions

This article has 1 author:
1. Hadis Habibi
This article has no evaluationsLatest version May 28, 2025

Listed in

Abstract

Background

Methods

Results

Conclusions

Article activity feed

Related articles

Evaluating the Performance of AI Chatbots in Responding to Dental Implant FAQs: A Comparative Study

Accuracy and Completeness of AI Chatbots (ChatGPT-4o vs. Gemini 2.5 Pro) for Leopard Gecko (Eublepharis macularius) Husbandry Information

The Evaluation of Generated Responses by ChatGPT to Complex Linguistics Related Questions