A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Large Language Models (LLMs) are one of the artificial intelligence (AI) technologies used to understand and generate text, summarize information, and comprehend contextual cues. LLMs have been increasingly used by researchers in various medical applications, but their effectiveness and limitations are still uncertain, especially across various medical specialties. Objective: This review evaluates recent literature on how LLMs are utilized in research studies across 19 medical specialties. It also explores the challenges involved and suggests areas for future research focus. Methods: Two researchers performed literature searches in PubMed, Web of Science and Scopus to identify published literature from January 2021 to March 2024. The studies included the usage of LLM on performing medical tasks. Data was extracted and analyzed by five reviewers. To assess risk of bias, quality assessment was performed using the revised tool for the quality assessment of artificial intelligence-centered diagnostic accuracy studies (QUADAS-AI). Results: Results were synthesized through categorical analysis of evaluation metrics, impact types, and validation approaches across medical specialties. A total of 84 studies were included in this review and mainly originated from two countries; USA (35/84) and China (16/84). Although reviewed LLM applications spread across 19 medical specialties, multi-specialty applications were demonstrated in 22 studies. Various aims for using LLMs include clinical natural language processing (31/84), supporting medical decision (20/84), medical education (15/84), diagnoses (15/84), patient management and patient engagement (3/84). GPT-based and BERT-based LLMs are most used in (83/84) studies. Despite reported positive impacts such as improved efficiency and diagnostic accuracy, challenges related to reliability, accuracy and ethics remain. The overall risk of bias was low in 72 studies, high in 11 studies and not clear in 3 studies. Conclusion: GPT-based and BERT-based LLMs dominate medical specialty applications, with over 98.8% of reviewed studies using these models. Despite their potential benefits in medical process efficiency and diagnostics, a key finding from challenges regarding accuracy was the substantial variability in performance among the LLMs. For instance, LLMs' accuracy ranged from 3% in diagnostic support to over 90% in some clinical NLP tasks. Heterogeneity in the utilization of LLMs across diverse medical tasks and contexts prevented meaningful meta-analysis, as the studies lacked standardized methodologies, outcome measures, and implementation approaches. Therefore, room for improvement remains wide for developing domain-specific LLMs using medical data and establishing validation standards to ensure reliability and effectiveness.

Article activity feed