Benchmarking transformer-based models for medical record deidentification: A single centre, multi-specialty evaluation
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Robust de-identification is necessary to preserve patient confidentiality and maintain public acceptance of electronic health record (EHR) research. Manual redaction of personally identifiable information (PII) outside of structured data is time-consuming and expensive, limiting the scale of data-sharing possible. Automated de-identification (DeID) could alleviate this burden, with competing approaches including task-specific models and generalist large language models (LLMs). We aimed to identify the optimal strategy for PII redaction, evaluating a number of task specific transformer-architecture models and generalist LLMs using no- and low-adaptation techniques.
Methods
We evaluated the performance of four task-specific models (Microsoft Azure DeID service, AnonCAT, OBI RoBERTa & BERT i2b2 DeID) and five general-purpose LLMs (Gemma-7b-IT, Llama-3-8B-Instruct, Phi-3-mini-128k-instruct, GPT-3.5-turbo-0125, GPT-4-0125) at de-identifying 3650 medical records from a UK hospital group, split into general and specialised datasets. Records were dual-annotated by clinicians for PII. The primary outcomes were F1 score, precision, and recall for each comparator in classifying words as PII vs. non-PII. The secondary outcomes were performance per-PII-subtype per-dataset, and the Levenshtein distance as a proxy for hallucinations/addition of extra text. We report untuned performance for task-specific models and zero-shot performance for LLMs. To assess sensitivity to data shifts between hospital sites, we undertook concept alignment and fine-tuning of one task-specific model (AnonCAT), and performed few-shot (1, 5, and 10) in-context learning for each LLM using site-specific data.
Results
17496/479760 (3.65%) words were PII. Inter-annotator F1 for word-level PII was 0.977 (95%CI 0.957-0.991). The best performing redaction tool was the Microsoft Azure de-identification service: F1 0.939 (0.934-0.944), precision 0.928 (0.922-0.934), recall 0.950 (0.943-0.958). The next-best tools were fine-tuned-AnonCAT: F1 0.910 (0.905-0.914), precision 0.978 (0.973-0.982), recall 0.850 (0.843-0.858), and GPT-4-0125 (ten-shots): F1 0.898 (0.876-0.915), precision 0.874 (0.834-0.906), recall 0.924 (0.914-0.933). There was hallucinatory output in Phi-3-mini-128k-instruct and Llama-3-8B-Instruct at zero-, one-, and five-shots, and universally for Gemma-7b-IT. AnonCAT showed significant improvement in performance on fine-tuning (F1 increase from 0.851; 0.843-0.859 to 0.910; 0.905-0.914). Names/dates were consistently redacted by all comparators; there was variable performance for other categories. Fine-tuned-AnonCAT demonstrated the least performance shift across datasets.
Conclusion
Automated EHR de-identification using transformer models could facilitate large-scale, domain-agnostic record sharing for medical research alongside other safeguards to prevent reidentification. Low-adaptation strategies may improve the performance of generalist LLMs and task-specific models.