Data Duplication and Errors in Large Medical Datasets: A Case Study in the IRIS® Registry

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Purpose:To investigate entry errors and data duplication within the IRIS® Registry utilizing cataract surgery (CS), YAG capsulotomy (YAG), macular degeneration (AMD), and diabetic retinopathy (DR) records. Design:Retrospective cohort study.Participants:Patients in the American Academy of Ophthalmology IRIS Registry (Intelligent Research in Sight).Methods:We collected records of CS and YAG with specified laterality within the IRIS Registry (years 2013 – 2023), identifying eyes having > 1 record and eyes having ≥ 1 record on a date after the first entry (different date duplication, Dd). Additionally, we collected records of DR and AMD and identified any eyes with either (1) eyes diagnosed with the more severe stage then reverting to the less severe stage or (2) eyes transitioning to the more severe stage before later being diagnosed with the less severe stage, defined as transition errors. We built classification models and evaluated permutation feature importance (PFI) to investigate any relationship between patient demographics, Dd, and transition errors. Main Outcome Measures:For CS and YAG, we measure the proportion of eyes having > 1 procedure record, having > 1 record only on the initial procedure date, and having ≥ 1 procedure record on a date after the first entry. For DR and AMD, we measure the proportion of eyes reverting to an earlier stage after starting at a later stage and the proportion reverting to an earlier stage after transitioning to a later stage.Results:Of 14,718,896 CS-treated eyes, 30.9% had duplicates, with 5.5% having Dd. For YAG, out of 5,113,679 eyes, 29.1% had duplicates, with 4.1% having Dd. For AMD and DR, 7.7% of eyes exhibited transition errors. Models captured a weak relationship between patient geolocation and the data errors under study, indicated by loss of 0.04 (Dd model), 0.014 (transition error model) to F1 score on average by PFI. Conclusions:Data duplication in large medical data sets can obscure the true instances of procedures and diagnoses. We did not find conclusive evidence of a relationship between demographic factors and transition errors or Dd, though model results indicate a weak relationship with patient geolocation.

Article activity feed