Automated Detection Of Clinical High Risk Population Of Schizophrenia: Assessing The Generalizability Of NLP And LLM-Based Methods

Jiaee Cheong
Cheryl M. Corcoran
Kathryn E. Lewandowski
Ofer Pasternak
Sinead Kelly
Sylvain Bouix
Abraham Reichenberg
Carrie E. Bearden
Guillermo Cecchi
Justin T. Baker
Marek Kubicki
Tina Kapur
Daniel H. Mathalon
Kang-Ik K. Cho
Inge Winter-van Rossum
Michael J. Coleman
Tashrif Billah
Dheshan Mohandass
Yoonho Chung
Habiballah Rahimi Eichi
Youngsun T. Cho
Zailyn Tamayo
Jessica Hartmann
Patrick D. McGorry
Rene S. Kahn
John M. Kane
Scott W. Woods
Martha E. Shenton
Barnaby Nelson
John Torous

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background and Hypothesis : Research has indicated that linguistic features can be used for the early detection of schizophrenia. Given that traditional clinician-based assessment can be labor intensive and time-consuming, more research has turned towards the usage of automated means to extract and analyze the linguistic features of schizophrenia. However, most of these existing studies have chiefly focused on deploying the LLMs with no comparison against more well-established NLP-based methods. As a result, there is less insight on the utility of using LLMs and whether the benefits of using of LLMs for analysis outweighs the costs. Moreover, given LLM’s prompt sensitivity, there is also a lack of research investigating how different prompt engineering methods affect the different model’s output across different settings. Another longstanding open question within the field pertains to how best to objectively assess prodromal psychotic symptoms and how best to analyze the different transcripts. In this study, we systematically assess the efficacy of large language models (LLMs) and natural-language processing (NLP) methods to perform automated linguistic analysis of clinical high risk (CHR) psychotic symptoms. We seek to understand the reliability of using LLMs to analyze patient transcripts for the early identification of CHR individuals in comparison against more established NLP-based methods. Study Design : We trained models using a large international dataset of 374 patients, of which 331 are clinically high risk (CHR) and 43 are community controls (CC). Two types of interviews were conducted: an open-ended and a semi-structured interview based on the Positive SYmptoms and Diagnostic Criteria for the CAARMS [73] Harmonized with the SIPS [74] (PSYCHS) protocol [32]. Trained research assistants carried out these interviews which were audio and video-recorded across different sites prior to October 13 2024. We used two different feature extraction methods, the principal component analysis (PCA) and feature selection (FS), and conducted experiments using four different machine learning (ML) models and two large language models (LLMs), namely Llama and Qwen. For each of the LLM, we used three different prompting strategies: a neutral prompt, an NLP based prompt and a PSYCHS interview-based prompt to better understand each LLM performance under different reasoning setting. Results : Across both the open and PSYCHS-based transcripts, the NLP combined with ML-based methods, which relies on objective quantifiable metrics, demonstrated fairly consistent results within a range of 0.60 − 0.90. This is in contrast to LLM-based methods, which provided highly variable results depending on the interview format and prompt used, with the lowest being 0.320 and the highest being 0.880 across all experimental settings. In general, both categories of methods seem to produce more accurate results using the PSYCHS-based transcripts. Llama generally performs better than text-based methods, which require semantic reasoning (e.g. the PSYCHS based prompt), and yielded the highest accuracy and F1 of 0.880 and 0.930 when used on the PSYCHS-based interview transcripts. On the other hand, Qwen generally performed better than numerical-reasoning based tasks (e.g. the NLP-based prompt) and performed the best across the PSYCHS-based interview transcripts with an accuracy and F1 of 0.880 and 0.930. Conclusions : Overall, we find that NLP-based methods are more reliable and consistent. LLM-based methods are highly variable and do not demonstrate sufficient reliability. Their output differs greatly depending on the input transcript and prompt type provided. We suggest that more emphasis should be placed on developing interpretable and clinically grounded methods to automate linguistic analysis of schizophrenia. Further experiments need to be conducted before deploying such models for high-stakes use cases and for identifying more precise and automated methods to understand how clinical features of schizophrenia are expressed linguistically.

Version published to 10.21203/rs.3.rs-8777643/v2 on Research Square
Feb 4, 2026
Version published to 10.21203/rs.3.rs-8777643/v1 on Research Square
Feb 4, 2026

Automated Detection Of Clinical High Risk Population Of Schizophrenia: Assessing The Generalizability Of NLP And LLM-Based Methods

This article has 30 authors:
1. Jiaee Cheong
2. Cheryl M. Corcoran
3. Kathryn E. Lewandowski
4. Ofer Pasternak
5. Sinead Kelly
6. Sylvain Bouix
7. Abraham Reichenberg
8. Carrie E. Bearden
9. Guillermo Cecchi
10. Justin T. Baker
11. Marek Kubicki
12. Tina Kapur
13. Daniel H. Mathalon
14. Kang-Ik K. Cho
15. Inge Winter-van Rossum
16. Michael J. Coleman
17. Tashrif Billah
18. Dheshan Mohandass
19. Yoonho Chung
20. Habiballah Rahimi Eichi
21. Youngsun T. Cho
22. Zailyn Tamayo
23. Jessica Hartmann
24. Patrick D. McGorry
25. Rene S. Kahn
26. John M. Kane
27. Scott W. Woods
28. Martha E. Shenton
29. Barnaby Nelson
30. John Torous
This article has no evaluationsLatest version Feb 4, 2026
Diagnostic Performance and Cost-Efficiency of Large Language Models in Secondary Hypertension: A Blinded Comparative Study

This article has 4 authors:
1. Asena Gökçay Canpolat
2. Özge Baş Aksu
3. Rıfat Emral
4. Uğur Canpolat
This article has no evaluationsLatest version Mar 18, 2026
Evaluating Large Language Models’ Performance in FDA Regulatory Science

This article has 8 authors:
1. Khulud Bukhari
2. Rosa Rodriguez-Monguio
3. Beatriz Lopez-Bermudez
4. Jason Yamaki
5. Lawrence Brown
6. Richard Beuttler
7. Jasmine Chiat Ling Ong
8. Enrique Seoane-Vazquez
This article has no evaluationsLatest version Feb 6, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Automated Detection Of Clinical High Risk Population Of Schizophrenia: Assessing The Generalizability Of NLP And LLM-Based Methods

Diagnostic Performance and Cost-Efficiency of Large Language Models in Secondary Hypertension: A Blinded Comparative Study

Evaluating Large Language Models’ Performance in FDA Regulatory Science