A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare

Leyao Wang
Zhiyu Wan
Congning Ni
Qingyuan Song
Yang Li
Ellen Wright Clayton
Bradley A. Malin
Zhijun Yin

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

AI (mark2d2)

Abstract

Background

The launch of the Chat Generative Pre-trained Transformer (ChatGPT) in November 2022 has attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including healthcare. Numerous studies have since been conducted regarding how to employ state-of-the-art LLMs in health-related scenarios to assist patients, doctors, and public health administrators.

Objective

This review aims to summarize the applications and concerns of applying conversational LLMs in healthcare and provide an agenda for future research on LLMs in healthcare.

Methods

We utilized PubMed, ACM, and IEEE digital libraries as primary sources for this review. We followed the guidance of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRIMSA) to screen and select peer-reviewed research articles that (1) were related to both healthcare applications and conversational LLMs and (2) were published before September 1 ^st , 2023, the date when we started paper collection and screening. We investigated these papers and classified them according to their applications and concerns.

Results

Our search initially identified 820 papers according to targeted keywords, out of which 65 papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT from OpenAI (60), followed by Bard from Google (1), Large Language Model Meta AI (LLaMA) from Meta (1), and other LLMs (5). These papers were classified into four categories in terms of their applications: 1) summarization, 2) medical knowledge inquiry, 3) prediction, and 4) administration, and four categories of concerns: 1) reliability, 2) bias, 3) privacy, and 4) public acceptability. There are 49 (75%) research papers using LLMs for summarization and/or medical knowledge inquiry, and 58 (89%) research papers expressing concerns about reliability and/or bias. We found that conversational LLMs exhibit promising results in summarization and providing medical knowledge to patients with a relatively high accuracy. However, conversational LLMs like ChatGPT are not able to provide reliable answers to complex health-related tasks that require specialized domain expertise. Additionally, no experiments in our reviewed papers have been conducted to thoughtfully examine how conversational LLMs lead to bias or privacy issues in healthcare research.

Conclusions

Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications brought bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in healthcare.

Version published to 10.1101/2024.04.26.24306390 on medRxiv
Apr 27, 2024

Large Language Models in Healthcare Simulation Education: A Bibliometric Analysis with AI-Assisted Screening

This article has 5 authors:
1. Matthew Pears
2. Karan Wadhwa
3. Stephen R Payne
4. Stathis TH Konstantinidis
5. Chandra Shekhar Biyani
This article has no evaluationsLatest version Jun 4, 2026
NigBench: A multilingual point-of-care medical query benchmarking study of large language models in Nigeria

This article has 18 authors:
1. Tobi Olatunji
2. Chinemelu Aka
3. Chibuzor Okocha
4. Emmanuel Ayodele
5. Jennifer Orisakwe
6. Toni Adekunle
7. Mardhiyah Sanni
8. Abdulameed Abiola
9. Tassallah Abdullahi
10. Oluwatomi Owopetu
11. Tolu Afolaranmi
12. Peter Suoyo Yougha
13. Mira Emmanuel-Fabula
14. Vaishnavi Menon
15. Alastair Denniston
16. Xiao Liu
17. Gwydion Williams
18. Bilal A. Mateen
This article has no evaluationsLatest version Jul 10, 2026
Use of large language models by academic hospitalists: results of a multicenter survey

This article has 5 authors:
1. Eric Bressman
2. Andrew Auerbach
3. Angela Keniston
4. Caroline Jens
5. Sumant Ranji
This article has no evaluationsLatest version May 29, 2026

Discuss this preprint

Listed in

Abstract

Background

Objective

Methods

Results

Conclusions

Article activity feed

Related articles

Large Language Models in Healthcare Simulation Education: A Bibliometric Analysis with AI-Assisted Screening

NigBench: A multilingual point-of-care medical query benchmarking study of large language models in Nigeria

Use of large language models by academic hospitalists: results of a multicenter survey