A reproducible open-source framework for defining type 1 and type 2 diabetes research cohorts in routinely collected electronic health record data
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Electronic health record data (EHR) data provide an increasingly important resource for studying people living with diabetes and their clinical outcomes, but robustly coding reproducible datasets is challenging. We aimed to develop a standardised data-processing framework for defining cohorts of people with type 1 and type 2 diabetes using EHR data.
Methods
We initially provide a standardised, generalisable procedure to develop clinically reviewed code lists to robustly define variables in EHR data. Using UK population-based data from primary care linked to hospital admission records (Clinical Practice Research Datalink [CPRD]), we develop and demonstrate a data-processing pipeline applicable to raw EHR data, using clinical code lists to define a population of individuals with diabetes and defining their diabetes diagnosis dates using the earliest recorded observation of diabetes (clinical code, high HbA1c test result, or prescription for glucose lowering therapy). Using a previously validated approach, we classify diabetes type (gold standard type 1, type 2) based on insulin prescriptions, diabetes type specific clinical codes, and age at diagnosis. Finally, we demonstrate how multiple research cohorts can be defined from this diabetes population based on a specific index date, including a range of baseline features (sociodemographic and lifestyle factors, biomarkers, comorbidities, medications) and key outcomes relevant to the research question.
Results
Application of the framework identified an incident cohort at diabetes diagnosis (type 1 diabetes (T1D): N = 10,480, mean age at diagnosis [SD] = 10.4 [4.8]; type 2 diabetes (T2D): N = 726,800, mean age at diagnosis [SD] = 60.5 [13.4]), a prevalent cohort actively registered with their GP practice on 01/02/2020 (T1D: N = 9,514, T2D: N = 559,905), and a T2D cohort initiating treatment with glucose- lowering therapies (N = 769,394 treatment initiations, considering 7 major medication classes). We publicly share our code lists and data processing code, making our research as transparent and reproducible as possible ( https://github.com/Exeter-Diabetes/CPRD-Cohort-scripts , https://github.com/Exeter-Diabetes/CPRD-Codelists/ ).
Conclusions
We have developed a flexible and reproducible framework to generate analysis-ready diabetes research cohorts in EHR data. The concepts of this framework are applicable to any EHR dataset and have been shared for use by other researchers. This approach could improve the quality and reproducibility of the diverse epidemiological and clinical diabetes studies using EHR worldwide.