Using large language models to create lexicons for interpretable text models with high content validity: the Suicide Risk Lexicon
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Researchers often want to measure a variety of constructs such as anxiety, discrimination, or loneliness in text data from surveys, interviews, social media, and electronic health records. Using large language models (LLMs) --while optimal for text classification-- remain infeasible for many researchers due to concerns around computational expertise, cost, privacy, and compute requirements. Therefore, some researchers prefer to use lightweight models for large datasets or interpretable models to avoid mistakes in high-stakes scenarios such as suicide risk detection. Lexicons offer simple baselines to LLMs by searching for relevant phrases --and can be used together with LLMs to guarantee capturing specific keywords in a deterministic way. However, building new lexicons is resource intensive. In this study, we found that GPT-4 turbo was able to automatically create a lexicon for 49 known risk factors for suicidal thoughts and behaviors, which we release as the Suicide Risk Lexicon. This approach quickly measures most constructs relevant for this application, resulting in high content validity. This lexicon was able to accurately predict risk in crisis counseling conversations. After validating the lexicon with clinical experts, the lexicon outperformed the LIWC lexicon --which has low content validity for mental illness-- and performed similarly to some black-box deep learning models. Due to using an interpretable approach with high content validity, we discovered that active suicidal ideation and direct self-injury were stronger indicators of imminent risk than passive suicidal ideation and depressed mood in this ecological setting. To simplify creating new lexicons for other research domains, we introduce a Python package, construct-tracker, that works with a variety of LLMs. In sum, while we recommend using LLMs for text classification, they remain out of reach for many researchers. Our work demonstrates that LLMs --despite being black-boxes that might be challenging to use-- can counterintuitively create interpretable models by generating lexicons, when this is preferred. Furthermore, we highlight the broader application of lexicons beyond measurement, including their use in benchmarking LLM performance.