Integrating Expert Knowledge into Large Language Models Improves Performance for Psychiatric Reasoning and Diagnosis

Karthik V Sarma
Kaitlin E Hanss
Andrew J M Halls
Andrew Krystal
Daniel F Becker
Anne L Glowinski
Atul J Butte

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Purpose and Methods

The authors sought to evaluate the performance of common large language models (LLMs) in psychiatric diagnosis, and the impact of integrating expert-derived reasoning on their performance. Clinical case vignettes and associated diagnoses were retrieved from the DSM-5-TR Clinical Cases book. Diagnostic decision trees were retrieved from the DSM-5-TR Handbook of Differential Diagnosis and refined for LLM use. Three LLMs were prompted to provide diagnosis candidates for the vignettes either by directly prompting or using the decision trees. These candidates and diagnostic categories were compared against the correct diagnoses. The positive predictive value (PPV), sensitivity, and F ₁ statistic were used to measure performance.

Principal Results

When directly prompted to predict diagnoses, the best LLM by F ₁ statistic (gpt-4o) had sensitivity of 77.6% and PPV of 43.3%. When making use of the refined decision trees, PPV was significantly increased (65.3%) without a significant reduction in sensitivity (71.8%). Across all experiments, the use of the decision trees statistically significantly increased the PPV, significantly increased the F ₁ statistic in 5/6 experiments, and significantly reduced sensitivity only for the category-based evaluation in 2/3 experiments.

Major Conclusions

When used to predict psychiatric diagnoses from case vignettes, direct prompting of the LLMs yielded most true positive diagnoses but had significant overdiagnosis. Integrating expert-derived reasoning into the process using decision trees improved LLM performance, primarily by suppressing overdiagnosis with minimal negative impact on sensitivity. This suggests that the integration of clinical expert-derived reasoning could improve the performance of LLM-based tools in the behavioral health setting.

Version published to 10.1101/2025.07.19.25331840 on medRxiv
Jul 21, 2025

Large Language Models for Psychiatric Phenotype Extraction from Electronic Health Records

This article has 10 authors:
1. Clara Frydman-Gani
2. Alejandro Arias
3. Maria Perez Vallejo
4. John Daniel Londoño Martínez
5. Johanna Valencia-Echeverry
6. Mauricio Castaño
7. Alex A. T. Bui
8. Nelson B. Freimer
9. Carlos Lopez-Jaramillo
10. Loes M. Olde Loohuis
This article has no evaluationsLatest version Aug 12, 2025
clickBrick Prompt Engineering: Optimizing Large Language Model Performance in Clinical Psychiatry

This article has 10 authors:
1. F Gerrik Verhees
2. Fabian Huth
3. Vincent Meyer
4. Fabian Wolf
5. Michael Bauer
6. Andrea Pfennig
7. Philipp Ritter
8. Jakob N Kather
9. Isabella C Wiest
10. Pavol Mikolas
This article has no evaluationsLatest version Jun 30, 2025
A statistical framework for evaluating repeatability and reproducibility of large language models in diagnostic reasoning

This article has 8 authors:
1. Cathy Shyr
2. Boyu Ren
3. Chih-Yuan Hsu
4. Rory J. Tinker
5. Rizwan Hamid
6. Adam Wright
7. Bradley A. Malin
8. Hua Xu
This article has no evaluationsLatest version Aug 8, 2025

Listed in

Abstract

Purpose and Methods

Principal Results

Major Conclusions

Article activity feed

Related articles

Large Language Models for Psychiatric Phenotype Extraction from Electronic Health Records

clickBrick Prompt Engineering: Optimizing Large Language Model Performance in Clinical Psychiatry

A statistical framework for evaluating repeatability and reproducibility of large language models in diagnostic reasoning