Large Language Models for Psychiatric Phenotype Extraction from Electronic Health Records

Clara Frydman-Gani
Alejandro Arias
Maria Perez Vallejo
John Daniel Londoño Martínez
Johanna Valencia-Echeverry
Mauricio Castaño
Alex A. T. Bui
Nelson B. Freimer
Carlos Lopez-Jaramillo
Loes M. Olde Loohuis

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The accurate detection of clinical phenotypes from electronic health records (EHRs) is pivotal for advancing large-scale genetic and longitudinal studies in psychiatry. Free-text clinical notes are an essential source of symptom-level information, particularly in psychiatry. However, the automated extraction of symptoms from clinical text remains challenging.

Here, we tested 11 open-source generative large language models (LLMs) for their ability to detect 109 psychiatric phenotypes from clinical text, using annotated EHR notes from a psychiatric clinic in Colombia. The LLMs were evaluated both “out-of-the-box” and after fine-tuning, and compared against a traditional natural language processing (tNLP) method developed from the same data. We show that while base LLM performance was poor to moderate (0.2-0.6 macro-F1 for zero-shot; 0.2-0.74 macro-F1 for few shot), it improved significantly after fine-tuning (0.75-0.86 macro-F1), with several fine-tuned LLMs outperforming the tNLP method. In total, 100 phenotypes could be reliably detected (F1>0.8) using either a fine-tuned LLM or tNLP.

To generate a fine-tuned LLM that can be shared with the scientific and medical community, we created a fully synthetic dataset free of patient information but based on original annotations. We fine-tuned a top-performing LLM on this data, creating “Mistral-small-psych”, an LLM that can detect psychiatric phenotypes from Spanish text with performance comparable to that of LLMs trained on real EHR data (macro-F1=0.79).

Finally, the fine-tuned LLMs underwent an external validation using data from a large psychiatric hospital in Colombia, the Hospital Mental de Antioquia, highlighting that most LLMs generalized well (0.02-0.16 point loss in macro-F1). Our study underscores the value of domain-specific adaptation of LLMs and introduces a new model for accurate psychiatric phenotyping in Spanish text, paving the way for global precision psychiatry.

Version published to 10.1101/2025.08.07.25333172 on medRxiv
Aug 12, 2025

Simulated Reasoning and Self-Verification in Generalist Large Language Models for Psychiatric Diagnostic Performance: Cross-Sectional Study

This article has 6 authors:
1. Karthik V Sarma
2. Kaitlin E Hanss
3. Andrew J M Halls
4. Daniel F Becker
5. Anne L Glowinski
6. Andrew Krystal
This article has no evaluationsLatest version Sep 9, 2025
Characterizing Dementia Phenotypes from Unstructured EHR Notes with Generative AI and Interpretable Machine Learning

This article has 10 authors:
1. Alice S. Tang
2. Billy Z.D. Zeng
3. Katherine P. Rankin
4. Maria Luisa Giorno-Tempini
5. William W. Seeley
6. Howard J. Rosen
7. Gil D. Rabinovici
8. Tomiko T. Oskotsky
9. Marina Sirota
10. Pedro Pinheiro-Chagas
This article has no evaluationsLatest version Oct 2, 2025
Benchmarking large language models for cell-free RNA diagnostic biomarker discovery

This article has 6 authors:
1. Hunter A. Gaudio
2. Andrew Bliss
3. Conor J. Loy
4. Daniel Eweis-LaBolle
5. Anne E. Gardella
6. Iwijn De Vlaminck
This article has no evaluationsLatest version Aug 24, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Simulated Reasoning and Self-Verification in Generalist Large Language Models for Psychiatric Diagnostic Performance: Cross-Sectional Study

Characterizing Dementia Phenotypes from Unstructured EHR Notes with Generative AI and Interpretable Machine Learning

Benchmarking large language models for cell-free RNA diagnostic biomarker discovery