Benchmarking transformer-based models for medical record deidentification: A single centre, multi-specialty evaluation

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Robust de-identification is necessary to preserve patient confidentiality and maintain public acceptance of electronic health record (EHR) research. Manual redaction of personally identifiable information (PII) outside of structured data is time-consuming and expensive, limiting the scale of data-sharing possible. Automated de-identification (DeID) could alleviate this burden, with competing approaches including task-specific models and generalist large language models (LLMs). We aimed to identify the optimal strategy for PII redaction, evaluating a number of task specific transformer-architecture models and generalist LLMs using no- and low-adaptation techniques.

Methods

We evaluated the performance of four task-specific models (Microsoft Azure DeID service, AnonCAT, OBI RoBERTa & BERT i2b2 DeID) and five general-purpose LLMs (Gemma-7b-IT, Llama-3-8B-Instruct, Phi-3-mini-128k-instruct, GPT-3.5-turbo-0125, GPT-4-0125) at de-identifying 3650 medical records from a UK hospital group, split into general and specialised datasets. Records were dual-annotated by clinicians for PII. The primary outcomes were F1 score, precision, and recall for each comparator in classifying words as PII vs. non-PII. The secondary outcomes were performance per-PII-subtype per-dataset, and the Levenshtein distance as a proxy for hallucinations/addition of extra text. We report untuned performance for task-specific models and zero-shot performance for LLMs. To assess sensitivity to data shifts between hospital sites, we undertook concept alignment and fine-tuning of one task-specific model (AnonCAT), and performed few-shot (1, 5, and 10) in-context learning for each LLM using site-specific data.

Results

17496/479760 (3.65%) words were PII. Inter-annotator F1 for word-level PII was 0.977 (95%CI 0.957-0.991). The best performing redaction tool was the Microsoft Azure de-identification service: F1 0.939 (0.934-0.944), precision 0.928 (0.922-0.934), recall 0.950 (0.943-0.958). The next-best tools were fine-tuned-AnonCAT: F1 0.910 (0.905-0.914), precision 0.978 (0.973-0.982), recall 0.850 (0.843-0.858), and GPT-4-0125 (ten-shots): F1 0.898 (0.876-0.915), precision 0.874 (0.834-0.906), recall 0.924 (0.914-0.933). There was hallucinatory output in Phi-3-mini-128k-instruct and Llama-3-8B-Instruct at zero-, one-, and five-shots, and universally for Gemma-7b-IT. AnonCAT showed significant improvement in performance on fine-tuning (F1 increase from 0.851; 0.843-0.859 to 0.910; 0.905-0.914). Names/dates were consistently redacted by all comparators; there was variable performance for other categories. Fine-tuned-AnonCAT demonstrated the least performance shift across datasets.

Conclusion

Automated EHR de-identification using transformer models could facilitate large-scale, domain-agnostic record sharing for medical research alongside other safeguards to prevent reidentification. Low-adaptation strategies may improve the performance of generalist LLMs and task-specific models.

Article activity feed