A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions

Asma Musabah Alkalbani
Ahmed Salim Alrawahi
Ahmad Salah
Venus Haghighi
Yang Zhang
Salam Alkindi
Quan Z Sheng

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: Large Language Models (LLMs) are one of the artificial intelligence (AI) technologies used to understand and generate text, summarize information, and comprehend contextual cues. LLMs have been increasingly used by researchers in various medical applications, but their effectiveness and limitations are still uncertain, especially across various medical specialties. Objective: This review evaluates recent literature on how LLMs are utilized in research studies across 19 medical specialties. It also explores the challenges involved and suggests areas for future research focus. Methods: Two researchers performed literature searches in PubMed, Web of Science and Scopus to identify published literature from January 2021 to March 2024. The studies included the usage of LLM on performing medical tasks. Data was extracted and analyzed by five reviewers. To assess risk of bias, quality assessment was performed using the revised tool for the quality assessment of artificial intelligence-centered diagnostic accuracy studies (QUADAS-AI). Results: Results were synthesized through categorical analysis of evaluation metrics, impact types, and validation approaches across medical specialties. A total of 84 studies were included in this review and mainly originated from two countries; USA (35/84) and China (16/84). Although reviewed LLM applications spread across 19 medical specialties, multi-specialty applications were demonstrated in 22 studies. Various aims for using LLMs include clinical natural language processing (31/84), supporting medical decision (20/84), medical education (15/84), diagnoses (15/84), patient management and patient engagement (3/84). GPT-based and BERT-based LLMs are most used in (83/84) studies. Despite reported positive impacts such as improved efficiency and diagnostic accuracy, challenges related to reliability, accuracy and ethics remain. The overall risk of bias was low in 72 studies, high in 11 studies and not clear in 3 studies. Conclusion: GPT-based and BERT-based LLMs dominate medical specialty applications, with over 98.8% of reviewed studies using these models. Despite their potential benefits in medical process efficiency and diagnostics, a key finding from challenges regarding accuracy was the substantial variability in performance among the LLMs. For instance, LLMs' accuracy ranged from 3% in diagnostic support to over 90% in some clinical NLP tasks. Heterogeneity in the utilization of LLMs across diverse medical tasks and contexts prevented meaningful meta-analysis, as the studies lacked standardized methodologies, outcome measures, and implementation approaches. Therefore, room for improvement remains wide for developing domain-specific LLMs using medical data and establishing validation standards to ensure reliability and effectiveness.

Version published to 10.21203/rs.3.rs-5128451/v2 on Research Square
Apr 16, 2025
Version published to 10.21203/rs.3.rs-5128451/v1 on Research Square
Nov 25, 2024

Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

This article has 8 authors:
1. Lu He
2. D. Phuong Do
3. Vishesh Girish Shet
4. Omar Farghaly
5. Priya Deshpande
6. Praveen Madiraju
7. Jiancheng Ye
8. Molly Beestrum
This article has no evaluationsLatest version Jan 16, 2026
Large Language Models in Radiology Exams: A Comparative Analysis of Performance in Turkish and English

This article has 2 authors:
1. Şahinde ATLANOĞLU
2. Mehmet Ali GEDİK
This article has no evaluationsLatest version Jan 21, 2026
How can we best communicate the findings of public health-related systematic reviews? A Study Within a Review (SWAR)

This article has 7 authors:
1. Niamh Gildernew
2. Mike Clarke
3. Miriam Brazzelli
4. Mari Imamura
5. Clare Robertson
6. Gianni Virgili
7. Sinead Noelle Duggan
This article has no evaluationsLatest version Dec 22, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

Large Language Models in Radiology Exams: A Comparative Analysis of Performance in Turkish and English

How can we best communicate the findings of public health-related systematic reviews? A Study Within a Review (SWAR)