Patient Deduplication in Uganda’s Electronic Medical Records System: A comparison of Three Classification Algorithms
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Duplicate patient records pose a significant challenge to healthcare registries and electronic medical record (EMR) systems in Uganda, primarily due to the absence of a national unique patient identifier. These duplicates lead to fragmented patient care, misallocation of resources, and inaccuracies in data reporting, which hinder effective monitoring of disease progression, disrupt continuity of care, and complicate efforts to track patient outcomes. Objective: To evaluate the performance of three classification algorithms in identifying duplicate records of people living with HIV (PLHIV) and to determine a combination of variables that can uniquely identify a PLHIV. Methods: The study used a six-step deduplication process involving dataset extraction, preprocessing, indexing, comparison, classification, and performance evaluation. Records of PLHIV who were active in care between June and November 2022 were extracted from the UgandaEMR system - an EMR installed at 15 public health facilities in six districts in the Rwenzori Region. The dataset included demographic variables, i.e., first name, middle name, last name, sex, age, date of birth, address, and phone number. Three classification algorithms were used to classify the client scores into matches, potential matches, and non-matches, namely i) a threshold-based algorithm, ii) a weighted average score-based algorithm, and iii) a decision tree. Due to the absence of a labeled dataset, the decision tree was trained on data labeled using the two rule-based methods and evaluated on a synthetic reference dataset. Performance of the algorithms was evaluated using sensitivity, specificity, and F-score metrics. Results: A total of 44,717 records for PLHIV active in care in the Rwenzori region from June to November 2022 were extracted. The weighted average score-based algorithm identified 447 (5.8%) records as duplicates and 2996 (10%) as potential duplicates. The threshold-based algorithm identified 118 (0.5%) duplicates and flagged 8560 (21.0%) as potential duplicates. The weighted average score-based algorithm achieved the highest performance: sensitivity (99.0%), specificity (98.8%), and F-score (98.9%); followed by the threshold-based classification: sensitivity (95.3%), specificity (89.1%), and F-score (92.1%); and the decision tree algorithm sensitivity (92.3%), specificity (93.9%, and F-score (93.1%). Conclusions: The weighted average score-based algorithm achieved the best performance. Findings highlight that a combination of a few demographic variables can be employed to differentiate PLHIV. However, improving duplicate record detection at scale will require training these algorithms on a larger dataset that can generalize the PLHIV population in Uganda.