Identifying clusters of people with Multiple Long-Term Conditions using Large Language Models: a population-based study
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Identifying clusters of people with similar patterns of Multiple Long-Term Conditions (MLTC) could help healthcare services to tailor management for each group. Large Language Models (LLMs) can utilise complex longitudinal electronic health records (EHRs) which may enable deeper insights into patterns of disease. Here, we develop a pipeline, incorporating an LLM, to generate gender-specific clusters using clinical codes recorded in EHRs.
Methods
In this population-based study, we used EHRs from individuals aged ≥50 years from Clinical Practice Research Datalink in the UK. Longitudinal sequences of medical histories including diagnoses, diagnostic tests and medications were used to pre-train an LLM based on DeBERTa. The LLM, called EHR-DeBERTa, includes embedding layers for age of diagnosis, calendar year of diagnosis, gender, and visit number with a diagnosis vocabulary of 3776 tokens, covering the entire ICD-10 hierarchy. We fine-tuned EHR-DeBERTa using contrastive learning and generated patient embeddings for all individuals. A bootstrapping clustering pipeline was applied separately for females and males and gender-specific patient clusters were characterised by disease prevalence, ethnicity and deprivation.
Findings
A total of 5,846,480 patients were included. We identified fifteen clusters in females and seventeen clusters in males, grouped into five categories: i) low disease burden; ii) mental health; iii) cardiometabolic diseases; iv) respiratory diseases, and v) mixed diseases. Cardiometabolic and mental health conditions showed the strongest separation across clusters. People in low disease burden and mental health clusters were younger, whereas those in cardiometabolic clusters were older, with females in cardiometabolic clusters older than their male counterparts.
Interpretation
Using an LLM applied to longitudinal EHRs, we generated interpretable and gender-specific clusters of diseases, providing insights into patterns of diseases. Extending these methods in future to incorporate clinical outcomes could enable identification of high-risk patients and support precision-medicine approaches for managing MLTC.