Leveraging Unsupervised Learning for Automated Schema Matching and Data Harmonization in Multi-Source Electronic Health Record Integration

Faith Harris
James Mcburnie
Mike Edwards

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The integration of electronic health records (EHRs) from disparate sources is a foundational prerequisite for large-scale clinical analytics, population health management, and precision medicine initiatives; however, this process is critically hindered by the labor-intensive bottleneck of schema matching and data harmonization, which requires establishing semantic correspondences between heterogeneous database structures and medical terminologies. Traditional rules-based or supervised learning approaches are often brittle, require extensive domain expertise, and fail to scale to the dynamic nature of healthcare data environments. This research investigates the application of unsupervised learning techniques to automate the discovery of semantic mappings across multi-source EHR schemas, developing a scalable and domain-agnostic framework that reduces manual intervention by leveraging the intrinsic structure and latent features of the clinical data itself.

The proposed hybrid methodology combines distributional semantics with structural analysis by first employing contextual embeddings, such as BERT-based models fine-tuned on biomedical text, to generate vector representations of schema elements including table names, attribute labels, and actual data instances like clinical notes and lab values. Subsequently, similarity clustering algorithms, specifically agglomerative hierarchical clustering and affinity propagation, are applied to group semantically related elements across different source systems without requiring pre-labeled training data, effectively resolving terminological discrepancies where, for example, "Myocardial Infarction" in one system corresponds to "Heart Attack" in another. By focusing on the inherent properties of the health records themselves rather than rigid external mappings, this approach autonomously identifies complex correspondences between diverse coding systems such as ICD-10 and SNOMED CT, enabling coherent data federation across hospital systems, research consortia, and public health databases. The findings demonstrate that unsupervised learning can significantly accelerate the interoperability lifecycle, reduce the overhead of manual schema curation, and enhance the quality of integrated clinical datasets, ultimately facilitating more robust secondary use of EHR data for improving patient outcomes and advancing biomedical discovery.

Version published to 10.14293/pr2199.003098.v1
Mar 6, 2026

Leveraging Unsupervised Learning for Automated Schema Matching and Data Harmonization in Multi-Source Electronic Health Record Integration

This article has 2 authors:
1. Claura Reid
2. Jim Wills
This article has no evaluationsLatest version Feb 28, 2026
From manual entry to machine precision: challenges and evolution of metadata schema development in collaborative research centers

This article has 7 authors:
1. Felix Engel
2. Claudia Giuliani
3. Manuel Watter
4. Aref Kalantari
5. Karin Schuller
6. Harald Binder
7. Klaus Kaier
This article has no evaluationsLatest version Feb 10, 2026
Construction of Personal Health Knowledge Graphs for Clinical Data Harmonization in Breast Cancer

This article has 16 authors:
1. Wenjie Liang
2. Rutger van Mierlo
3. Anne-Lore Bynens
4. Remzi Celebi
5. Ensar Erol
6. Ömer Durukan Kılıç
7. Isabelle de Zegher
8. Katerina Serafimova
9. Todor Primov
10. Svetla Boytcheva
11. Michaela Kargl
12. Mall Maasik
13. Cecile J.A. Wolfs
14. Aiara Lobo Gomes
15. Andre Dekker
16. Petros Kalendralis
This article has no evaluationsLatest version Feb 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Leveraging Unsupervised Learning for Automated Schema Matching and Data Harmonization in Multi-Source Electronic Health Record Integration

From manual entry to machine precision: challenges and evolution of metadata schema development in collaborative research centers

Construction of Personal Health Knowledge Graphs for Clinical Data Harmonization in Breast Cancer