Automated De-Identification, Consistent Obfuscation, and Regulatory Grade Validation of 2 Billion Patient Notes

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Rich Large, diverse collections of anonymous patient data—including text, numbers, and images— are essential to advancing a broad range of causes, from clinical decision support and real-world evidence to population health and hospital operations. This study presents a novel system used to automatically de-identify unstructured clinical text from 2 billion patient notes, using consistent obfuscation and tokenization to link them into a unified longitudinal dataset. To the best of our knowledge, this is the first such system to be externally certified for regulatory-grade accuracy on real-world data at this scale. The system is based on proprietary medical language models and the modified Spark NLP - a distributed computing NLP framework for efficient execution on large clusters of commodity hardware. It satisfies the Expert Determination de-identification criteria under HIPAA ( Health Insurance Portability and Accountability Act) Privacy Rules, establishing a baseline requirement of <5% PHI prevalence both in aggregate and per record. It achieves 99% Protected Health Information (PHI) obfuscation, and achieves 100% masking or shifting of target data fields. This level of accuracy surpasses even that of a triple manual review by 3 human annotators. Obfuscation adds another layer of protection by rendering PHI elements indistinguishable from missed elements. Name changes, date shifting, and tokenizing identifiers are done consistently across documents about the same patient. Equity analysis was performed to ensure the system is not biased across demographic groups for gender, age, ethnicity, and state. Finally, an independent audit including adversarial testing on 790 randomly selected patients was performed, in which a dedicated “red team” working for 3 months was not able to re-identify any of the patients.

Article activity feed