Comparative Evaluation of Pretrained Large Language Models for Suicide Risk Prediction from Clinical Notes in U.S. Veterans

Joshua Levy
Maxwell Levis
Monica Dimambro
Luke Rozema
Siamack Ayandeh
Alos Diallo
Yefan Zhou
Siting Li
Weiyi Wu
Brian Shiner
Jiang Gui

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Suicide remains a significant and potentially preventable cause of death among United States veterans. Predictive models based on structured electronic health record (EHR) data, including the U.S. Department of Veterans Affairs’ Recovery Engagement and Coordination for Health–Veterans Enhanced Treatment (REACH-VET) program, aim to identify individuals at elevated risk for enhanced monitoring and follow-up. Increasing evidence suggests that unstructured clinical narratives contain additional psychosocial information that may enhance risk prediction when analyzed using natural language processing (NLP). However, optimal approaches for representing clinical text remain uncertain. Recent advances in large language models (LLMs) enable contextual text representations that capture complex semantic relationships beyond traditional lexical methods.

Methods

We compared the predictive performance of pretrained LLMs with classical bag-of-words (BoW) representations for suicide risk prediction using clinical notes from 27,241 veterans receiving care in the Veterans Health Administration. Patients were stratified by REACH-VET risk tier (low, moderate, high), and models were evaluated across prediction windows defined by note look-back periods (<30, <90, and <270 days).

Results

LLM-based representations outperformed BoW approaches in seven of nine risk tier–time window combinations, achieving a maximum AUROC of 0.644 when solely considering text. Incorporating structured clinical variables further improved performance (AUROC=0.748). Model interpretation identified suicide-related language, especially in notes documented within 30 days of the outcome among patients classified as high risk.

Conclusions

Pretrained LLMs can extract clinically meaningful information from narrative documentation, providing a foundation for future work adapting to additional clinical contexts and nuanced temporal associations to improve suicide risk prediction.

Version published to 10.64898/2026.06.16.26355804 on medRxiv
Jun 18, 2026

Personalizing Suicide Risk Assessment: Machine Learning Extraction of Cross-Modal Interactions Between Psychosocial and Demographic Factors in Veterans ¹

This article has 11 authors:
1. Maxwell Levis
2. Brian Shiner
3. Monica Dimambro
4. Luke Rozema
5. Siamack Ayandeh
6. Alos Diallo
7. Yefan Zhou
8. Siting Li
9. Weiyi Wu
10. Jiang Gui
11. Joshua Levy
This article has no evaluationsLatest version Jun 18, 2026
Evidence-guided AI regularization for suicidal ideation prediction in pediatric bipolar disorder

This article has 7 authors:
1. Hammza Jabbar Abd Sattar Hamoudi
2. Mon-Ju Wu
3. Marsal Sanches
4. Giovana B. Zunta-Soares
5. Cesar A. Soutullo
6. Jair C. Soares
7. Benson Mwangi
This article has no evaluationsLatest version Jun 22, 2026
Silent Manipulation of Mental Health Treatment Recommendations from a Large Language Model

This article has 1 author:
1. Roy H. Perlis
This article has no evaluationsLatest version Jun 17, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusions

Article activity feed

Related articles

Personalizing Suicide Risk Assessment: Machine Learning Extraction of Cross-Modal Interactions Between Psychosocial and Demographic Factors in Veterans 1

Evidence-guided AI regularization for suicidal ideation prediction in pediatric bipolar disorder

Silent Manipulation of Mental Health Treatment Recommendations from a Large Language Model

Personalizing Suicide Risk Assessment: Machine Learning Extraction of Cross-Modal Interactions Between Psychosocial and Demographic Factors in Veterans ¹