Exploring the Generalizability and Explainability of LLMs in Detecting Suicidal Ideation: The Impact of Data Heterogeneity
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objectives With the recent advancement of artificial intelligence (AI) and large language models (LLMs), the use of text analysis to detect suicidal ideation can be a promising tool. However, the performance of such detection system could be influenced by the language use difference caused by individuals’ alexithymic characteristics (difficulties in expressing emotion with unique language pattern), resulting in the subgroup disparity. The current study aims to explore the capability of a detection system on a clinical sample of heterogeneous language use (i.e., systematic difference in language use as influenced by patient characteristics and the language context). Methods AI models (classifiers) were trained with 5-fold cross-validation using clinical transcripts of 299 individuals (n = 193 with major depressive disorder and 106 controls without psychiatric problems) to detect suicidal ideation. More specifically, the topic-general classifier was trained using full clinical transcripts while the topic-specific classifiers (i.e., factorization models) were trained using specific sections of the clinical transcripts, focusing on either mood-related or suicide-specific topics. The performance of the classifiers was assessed in both groups (alexithymia and non-alexithymia) and whole sample. Mediation analyses were conducted to further investigate the role of language features in explaining the subgroup disparity. Results Results showed subgroup disparity in topic-general classifier between alexithymia and non-alexithymia groups at which alexithymia group was associated with a decreased likelihood of true detection of suicidal ideation (OR = 0.31, p < .001) and unique language features, such as family-related words (p = .02), played a mediating/explanatory role. Furthermore, topic-specific classifiers demonstrated superior performance (AUC = 0.96) compared to topic-general classifier (AUC = 0.83) and the subgroup disparity was largely reduced. Conclusion Models trained on a heterogeneous clinical population may not be equitably effective in detecting suicidal ideation in patient groups with and without alexithymia. The development of a factorization model is pertinent to enhance generalizability and equity, especially when patient characteristics are inaccessible or confidential for model training. Meanwhile, clinicians should interpret model predictions with caution due to the influence that patient characteristics might have on the model performance.