Multimodal Speech and Text Models to Detect Suicidal Risks in Adolescents
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Early detection of suicide risk in adolescents is crucial but faces challenges including stigma, reluctance to disclose suicidal thoughts, and limited accessibility of mental health resources. Traditional assessment methods may miss at-risk populations, particularly in community settings. This study aimed to explore whether multimodal analysis combining acoustic and linguistic features can improve prediction of suicide risk in adolescents.
Methods
Voice recordings and transcribed text from 600 Chinese adolescents (aged 10-18 years) were collected from 47 schools in Guangdong, China. Suicide risk labels were derived from the Mini International Neuropsychiatric Interview for Children and Adolescents (MINI-KID). The dataset included three voice tasks: answering an open-ended question about emotional regulation, reading a standard passage, and describing a face with negative emotions. Features were extracted using pre-trained models (EMOTION2VEC for acoustic features, Paraformer for speech-to-text conversion, and Tongyi Qianwen’s text-embedding-v3 for text features). We applied various machine learning classifiers including Support Vector Machine, Multi-layer Perceptron, Random Forest, and XGBoost to develop both single-modal and multimodal prediction models. Front-end fusion (FF) and back-end fusion (BF) techniques were employed to combine acoustic and linguistic features.
Results
Fusion models combining both acoustic and linguistic features consistently outperformed individual models. The model with both front-end and back-end fusion achieved the highest performance with an accuracy of 0.73, precision of 0.70, recall of 0.80, and F1 score of 0.74. Front-end fusion alone achieved the highest Area Under the Receiver Operating Characteristic Curve (AUROC) of 0.767. Models performed equivalently across age groups but significantly better in females (AUROC = 0.72) compared to males (AUROC = 0.46).
Conclusions
Multimodal analysis combining acoustic and linguistic features significantly improves predictive accuracy for adolescent suicide risk detection compared to single-modal approaches. This approach offers a promising method for early identification of at-risk adolescents in community settings, potentially enabling timely intervention. Further external validation with larger samples is needed to optimize these models for clinical application.