A generalized Data Quality Assessment Framework for Diverse Health Datasets with varied Contradiction Rules
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Data quality assessment (DQA) in health research is guided by rules that express several indicators such as completeness, conformance, and plausibility. However, contradiction rules vary with diverse health domains. For example, while in cardiology, it is established that diastolic blood pressure should not exceed corresponding systolic measurements, in the biobank, it is an anomaly for the blood sample to remain stateless. Despite efforts to harmonize data quality indicators, implementations of DQA are often limited by predefined rules and number of evaluated interdependent items which makes tool reusability difficult. A generalized DQA framework that allows definition of custom contradiction rules compatible with datasets from varied health domains is yet to be reported. Objective The aim of this work is to develop a generalized DQA tool that can assess the quality of diverse health datasets, with validation focused on two distinct datasets of different structures and rules from the cardiology domain. Specifically, the validation datasets include 1) HiGHmed use case sensor data, and 2) the Medical Information Mart for Intensive Care - Electrocardiograms (MIMIC-IV-ECG) dataset. Methods A DQA tool that decentralizes assessment rule generation was developed in R. Among the elicited requirements for the tool development are: 1) support for custom rule definition, 2) support for diverse datasets and formats, 3) detailed and traceable assessment report, and 4) interactive interface to aid usability of the tool. To test support for diverse data formats, the tool incorporates both Fast Healthcare Interoperability Resources (FHIR) and open Electronic Health Record (openEHR) parsers in R that decomposes the bundled resources within the FHIR documents and openEHR compositions in the HiGHmed sensorik study. The generalizability of the DQA tool was tested using custom rules that are specific to the HIGHmed sensor data and the MIMIC-IV-ECG dataset. For usability of the DQA tool, a Shiny R interface was implemented for an interactive DQ assessment with choice datasets and user-defined rules. Results All elicited requirements including those desired by target users were fulfilled. The generalized framework has two segments including an overarching function that allows custom rule generation and an inner function that evaluates the defined rules on target dataset. Dataset from the HiGHmed use case cardiology were first evaluated and subsequently, the tool was successfully reused for the MIMIC-IV-ECG datasets. Also, the tool evaluates contradictions using custom rules defined by study personnel and imported study data in the interactive module. Discussion A reusable DQA tool is offered that analyzes health data sourced from different data models to produce comparable results across multiple study sites. The interactive DQA interface would assist study personnel in monitoring study data to address emerging data quality concerns. The usability of the DQA tool means that domain experts have the freedom to define and integrate rules tailored to their domain.