Performance of Contradiction Rule Implementations in Health Data for Efficient Data Quality Assessments

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Introduction: Boolean rules are the building blocks for rule-based data quality assessment (DQA) in health research. While some DQA rules are generic, contradiction rules are guided by established facts supported by domain knowledge. A recent study reported performance degradation in infrastructure as DQA rules scale. Different implementation approaches can be used for DQA rules. In this study, we examine the performance of varied DQA rule implementations for contradictory dependencies in data items for cardiovascular disease assessment and propose an optimization method that integrates the strengths of different approaches. Methods: We implemented three Boolean rule implementations considered for contradiction assessment of 12 cardiovascular disease items used in the cross-sectoral platform of the national COVID-19 cohort: 1) raw domain rule-set joined using the Boolean-OR operator; 2) two minimal Boolean rules derived from the twelve raw domain rule-set through rule reduction; and 3) atomic Boolean rules representing each rule in the raw domain rule-set. The implementations are examined on speed of execution and memory utilization on the original dataset of about 2000 subjects amplified by factors of 2.5, 5, 10, 50, and 100. A two-step approach is adopted to integrate the implementation of the fastest and atomic contradiction rules. Results: The raw domain rule-set (1) was more than 100 times faster than the atomic rules (3) and 9 times faster than the minimal Boolean rules (2) with the largest employed dataset. It requires about 3 times more memory than the other implementations. All implementations show linear dependency on the dataset size, except for minimal Boolean rules (2) with a slower slope in memory utilization. Two-step rule processing reduced the speed gap between raw rule-set (1) and atomic rules (3) from 100 times faster to just 3 times. Discussion: Only atomic rules (3) support detailed and traceable results for DQA, required for further inspection of the contradictions. A combined rule processing can bridge the speed gap between raw rule-set and atomic rules by executing the fastest rules on entire dataset and atomic rules only on the fraction of data with contradictions, allowing for fast but detailed DQA.

Article activity feed