Linking Patient Records at Scale with a Hybrid Approach Combining Contrastive Learning and Deterministic Rules
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Linking patient records across disparate healthcare systems is essential to create comprehensive views of patient health, yet this task is complicated by inconsistent identifiers, data quality issues, and privacy constraints. Although traditional deterministic and probabilistic methods have been widely used for record linkage, their performance is often limited in the presence of noisy or incomplete personally identifiable information (PII), and privacy-preserving variants commonly restrict matching to exact token equality. This work presents a hybrid record linkage approach, which integrates a deep embedding model with deterministic rules to leverage both the flexibility and noise-robustness of soft embeddings and reliably and predictable baseline performance from deterministic rules. Using a large-scale real-world dataset, a BERT-based embedding model is fine-tuned using a siamese network with contrastive loss to encode PII fields as numeric vectors. This system is implemented and evaluated on a commercial database consisting of 250 million PII records, showing the successful use of the system in a real-world healthcare setting.