Identification of Cohorts with Inflammatory Bowel Disease Amidst Fragmented Clinical Databases via Machine Learning

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Introduction : Inflammatory bowel disease (IBD) cohort identification typically relies primarily on read/billing codes, which may miss some patients. Attempts have been made to add medication records and other datasets to improve the cohort capture. However, a complete picture cannot typically be obtained because of database fragmentation and missingness. This study used novel cohort retrieval methods to identify the total IBD cohort from a large university teaching hospital with a specialist intestinal failure unit. Methods : Between 2008 and 2023, 11 clinical databases (ICD10 codes, OPCS4 codes, clinician-entry IBD registry, IBD patient portal, prescriptions, biochemistry, flare line calls, clinic appointments, endoscopy, histopathology, and clinic letters) were identified as having the potential to help identify local IBD patients. A gold-standard validation cohort was created through a manual chart review. A regex string search for normalised IBD terms was used on the three free-text databases (endoscopy, histopathology, and clinic letters) to identify patients more likely to have IBD. The 11 databases were compared statistically to assess cardinality and Jaccard Similarity in order to derive informed estimates of the total IBD population. A penalised logistic regression (LR) classifier was trained on 70% of the data and validated against a 30% holdout set to individually identify IBD patients. Results : The gold-standard validation cohort comprised 2,800 patients: 2,180(78%) with IBD and 619(22%) non-IBD cases. The precision for IBD ranged from 0.75-1 to 0.18-1. All the databases contained unique patients that were not covered by the Casemix ICD-10 database. The Jaccard similarity estimation predicted 18,594, but this represents an overestimation. The penalised LR model (AUROC: 0.85 - Validation set) confidently identified 8,060 patients with IBD (threshold: 0.586), although at lower thresholds (0.25), the model identified 12,760 patients with a higher recall of 0.92. By combining the true-positive cases from the LR model with likely true-positive IBD clinic letters, a final estimate of 12,998 patients with IBD was obtained. True positives from ICD 10 codes combined with medication (n = 8,048) covered only 61.6% of the total local IBD population, indicating that the present methods missed up to 38.4% of IBD patients. Conclusion : Diagnostic billing codes and medication data alone cannot accurately identify complete IBD cohorts. A multimodal cross-database model can partially compensate for this deficit. To improve this situation, more robust natural language processing (NLP)-based identification mechanisms are required to improve IBD cohort identification.

Article activity feed