Leveraging Unsupervised Learning for Automated Schema Matching and Data Harmonization in Multi-Source Electronic Health Record Integration
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The integration of electronic health records (EHRs) from disparate sources is a foundational prerequisite for large-scale clinical analytics, population health management, and precision medicine initiatives; however, this process is critically hindered by the labor-intensive bottleneck of schema matching and data harmonization, which requires establishing semantic correspondences between heterogeneous database structures and medical terminologies. Traditional rules-based or supervised learning approaches are often brittle, require extensive domain expertise, and fail to scale to the dynamic nature of healthcare data environments. This research investigates the application of unsupervised learning techniques to automate the discovery of semantic mappings across multi-source EHR schemas, developing a scalable and domain-agnostic framework that reduces manual intervention by leveraging the intrinsic structure and latent features of the clinical data itself.
The proposed hybrid methodology combines distributional semantics with structural analysis by first employing contextual embeddings, such as BERT-based models fine-tuned on biomedical text, to generate vector representations of schema elements including table names, attribute labels, and actual data instances like clinical notes and lab values. Subsequently, similarity clustering algorithms, specifically agglomerative hierarchical clustering and affinity propagation, are applied to group semantically related elements across different source systems without requiring pre-labeled training data, effectively resolving terminological discrepancies where, for example, "Myocardial Infarction" in one system corresponds to "Heart Attack" in another. By focusing on the inherent properties of the health records themselves rather than rigid external mappings, this approach autonomously identifies complex correspondences between diverse coding systems such as ICD-10 and SNOMED CT, enabling coherent data federation across hospital systems, research consortia, and public health databases. The findings demonstrate that unsupervised learning can significantly accelerate the interoperability lifecycle, reduce the overhead of manual schema curation, and enhance the quality of integrated clinical datasets, ultimately facilitating more robust secondary use of EHR data for improving patient outcomes and advancing biomedical discovery.