Advancing event log preparation with quality optimization for hospital process mining
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Time-dependent clinical events collected from the electronic health records (EHR), known as event logs, provided enriched information yet lack of systematic approach to address quality problems. A multi-layer approach has been proposed to enhance the quality of hospital event logs to assess unplanned readmission risk in patients with heart failure (HF). Method Eligible patients were identified from DREAM, a multi-site hospital dataset encompasses routinely collected EHR within a large metropolitan health system in Australia. At source level, the Weiskopf and Weng framework was adopted to evaluate the quality across five dimensions—currency, correctness, completeness, concordance, plausibility, alignment with the study objective, and at multiple analytical levels. Results were benchmarked against the publicly available Medical Information Mart for Intensive Care IV (MIMIC-IV) hospital database. A biodiversity framework has been employed to assess the quality at log level and compared to MIMIC-IV logs. Results Our findings showed that DREAM provided a timely, area-specific source of information with superior currency and source completeness compared to the benchmark database. The correctness and plausibility were comparable for both sources. Both datasets showed higher coverage than log completeness, with MIMIC-IV logs demonstrating greater complexity, reflected by higher diversity across all subpopulation groups. Conclusion This multi-layer approach aligned closely with the study objective, enabling domain-specific contextual awareness and mitigating bias at multiple levels. By incorporating an additional layer of event log quality evaluation based on biodiversity theory, the approach enhanced external validity and internal fairness, improving log comparability across data sources and within subpopulation groups.