Evaluating General-Purpose LLMs for Patient-Facing Use: Dermatology-Centered Systematic Review and Meta-Analysis

Irene S. Gabashvili

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

General-purpose large language models (LLMs) have rapidly evolved from experimental tools into widely adopted components of healthcare. Their proliferation – accelerated by the “ChatGPT effect” – has sparked intense interest across patient-facing specialties. Among these, dermatology provides a high-visibility use case through which to assess LLM capabilities, evaluation practices, and adoption trends.

Objective

To systematically review and meta-analyze quantitative evaluations of general-purpose LLMs in dermatology, while extracting broader insights applicable to patient-centered use of AI across medical fields.

Methods

We conducted a multi-phase systematic review and meta-analysis, incorporating studies published through August 1, 2025. A total of 88 studies met inclusion criteria, covering over 100 dermatology-related tasks and yielding more than 2,500 normalized performance scores across metrics such as accuracy, sensitivity, readability, and clinical safety. This review also re-evaluates previously tested benchmarks to assess reproducibility and model improvement over time. Statistical analyses focused on heterogeneity (Cochran’s Q, I²), evaluator effects, and evolving methodological practices.

Results

LLM performance varied by architecture, prompt design, and task complexity. No single model demonstrated universal superiority, though retrieval-augmented and hybrid systems consistently outperformed others on complex reasoning tasks. Performance also varied by task, with smaller models sometimes outperforming flagships and “thinking” modes occasionally over-reasoning. Dermatology-specific models excelled in narrow contexts but lacked generalizability. Evaluation practices matured over time – shifting from static benchmarks to multi-rubric frameworks and simulations – yet high heterogeneity persisted (I² ≈ 90%) due to differences in study design and evaluator type.

Sentiment toward LLMs evolved from early skepticism (2022), to over-optimism (2023), to a more critical and diverse perspective by 2025. Preliminary ChatGPT-5 data, though limited to a small set of challenging conditions, suggest lower hallucination rates and better recognition of dermatological presentations on darker skin.

Conclusions

LLMs are entering clinical workflows rapidly, yet static evaluation methods often fail to keep pace. Our findings underscore the need for dynamic, modular, and evaluator-aware frameworks that reflect real-world complexity, patient interaction, and personalization. As traditional benchmarks lose relevance in the face of rapidly evolving model architectures, future evaluation strategies must embrace living reviews, human-in-the-loop simulations, and transparent meta-evaluation. Although dermatology serves as the focal domain, the challenges and recommendations articulated here are broadly applicable to all patient-facing fields in medicine.

Limitations

High heterogeneity, frequent model deprecation, and inconsistent study designs limit generalizability. While preliminary evidence from ChatGPT-5 shows improved performance for rare diseases and underrepresented skin tones, comprehensive, multi-model validation remains lacking. AI reliance on indexed literature continues to restrict the incorporation of patient-led research and independent evidence.

Protocol Registration

PROSPERO registration no. CRD42023417336

Version published to 10.1101/2025.08.11.25333149 on medRxiv
Aug 11, 2025

Large Language Models Improve Coding Accuracy and Reimbursement in a Neonatal Intensive Care Unit

This article has 11 authors:
1. Emma Holmes
2. Caroline Massarelli
3. Felix Richter
4. Stephanie Bernard
5. Robert Freeman
6. Nicholas Gavin
7. Courtney Juliano
8. Bruce D. Gelb
9. Benjamin S Glicksberg
10. Girish N Nadkarni
11. Eyal Klang
This article has no evaluationsLatest version Jul 24, 2025
Parent/caregiver needs during pediatric genome-wide sequencing: a scoping literature review

This article has 2 authors:
1. Priyanka Murali
2. Joon-Ho Yu
This article has no evaluationsLatest version Jul 16, 2025
Evaluating the Impact of Authoritative and Subjective Cues on Large Language Model Reliability for Clinical Inquiries: An Experimental Study

This article has 4 authors:
1. Yu Chang
2. Po-Chung Ju
3. Ming-Hong Hsieh
4. Cheng-Chen Chang
This article has no evaluationsLatest version Jul 16, 2025

Listed in

Abstract

Background

Objective

Methods

Results

Conclusions

Limitations

Protocol Registration

Article activity feed

Related articles

Large Language Models Improve Coding Accuracy and Reimbursement in a Neonatal Intensive Care Unit

Parent/caregiver needs during pediatric genome-wide sequencing: a scoping literature review

Evaluating the Impact of Authoritative and Subjective Cues on Large Language Model Reliability for Clinical Inquiries: An Experimental Study