Data Quality Verification Metrics in Medicine: Experiments and Evaluations from the Perspectives of Safety and Utility
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Medical data is crucial not only for research but also for clinical decision support systems (CDSS). However, its use is often limited by strict privacy concerns and data scarcity, particularly for fields with small patient cohorts like rare diseases. Synthetic data is emerging as a promising solution, yet universal and standardized quality metrics are still lacking.This study reviews and categorizes a range of metrics to evaluate the quality of medical synthetic data, followed by experimental validation using MIMIC-III admissions data (categorical) and AI-Hub dementia data (continuous). Safety was evaluated based on the risk of re-identification and membership inference attacks, while utility was assessed by measuring distributional similarity and the consistency of analytical results.The synthetic categorical data (Admissions) demonstrated high utility and safety across most metrics. However, a low Nearest Neighbor Adversarial Accuracy (NNAA) score suggested a significant risk of the model overfitting to the original data. Conversely, the continuous data (Dementia) exhibited low utility and safety, confirming that generation methods must be tailored to data characteristics to preserve quality.Ultimately, this study proposes a structured framework for evaluating medical synthetic data and highlights the critical need to select metrics appropriate for specific data types to ensure a reliable quality assessment.