Benchmarking transformer-based models for medical record deidentification: A single centre, multi-specialty evaluation

Rachel Kuo
Andrew A.S. Soltan
Ciaran O’Hanlon
Alan Hasanic
David A. Clifton
Collins Gary
Dominic Furniss
David W. Eyre

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Robust de-identification is necessary to preserve patient confidentiality and maintain public acceptance of electronic health record (EHR) research. Manual redaction of personally identifiable information (PII) outside of structured data is time-consuming and expensive, limiting the scale of data-sharing possible. Automated de-identification (DeID) could alleviate this burden, with competing approaches including task-specific models and generalist large language models (LLMs). We aimed to identify the optimal strategy for PII redaction, evaluating a number of task specific transformer-architecture models and generalist LLMs using no- and low-adaptation techniques.

Methods

We evaluated the performance of four task-specific models (Microsoft Azure DeID service, AnonCAT, OBI RoBERTa & BERT i2b2 DeID) and five general-purpose LLMs (Gemma-7b-IT, Llama-3-8B-Instruct, Phi-3-mini-128k-instruct, GPT-3.5-turbo-0125, GPT-4-0125) at de-identifying 3650 medical records from a UK hospital group, split into general and specialised datasets. Records were dual-annotated by clinicians for PII. The primary outcomes were F1 score, precision, and recall for each comparator in classifying words as PII vs. non-PII. The secondary outcomes were performance per-PII-subtype per-dataset, and the Levenshtein distance as a proxy for hallucinations/addition of extra text. We report untuned performance for task-specific models and zero-shot performance for LLMs. To assess sensitivity to data shifts between hospital sites, we undertook concept alignment and fine-tuning of one task-specific model (AnonCAT), and performed few-shot (1, 5, and 10) in-context learning for each LLM using site-specific data.

Results

17496/479760 (3.65%) words were PII. Inter-annotator F1 for word-level PII was 0.977 (95%CI 0.957-0.991). The best performing redaction tool was the Microsoft Azure de-identification service: F1 0.939 (0.934-0.944), precision 0.928 (0.922-0.934), recall 0.950 (0.943-0.958). The next-best tools were fine-tuned-AnonCAT: F1 0.910 (0.905-0.914), precision 0.978 (0.973-0.982), recall 0.850 (0.843-0.858), and GPT-4-0125 (ten-shots): F1 0.898 (0.876-0.915), precision 0.874 (0.834-0.906), recall 0.924 (0.914-0.933). There was hallucinatory output in Phi-3-mini-128k-instruct and Llama-3-8B-Instruct at zero-, one-, and five-shots, and universally for Gemma-7b-IT. AnonCAT showed significant improvement in performance on fine-tuning (F1 increase from 0.851; 0.843-0.859 to 0.910; 0.905-0.914). Names/dates were consistently redacted by all comparators; there was variable performance for other categories. Fine-tuned-AnonCAT demonstrated the least performance shift across datasets.

Conclusion

Automated EHR de-identification using transformer models could facilitate large-scale, domain-agnostic record sharing for medical research alongside other safeguards to prevent reidentification. Low-adaptation strategies may improve the performance of generalist LLMs and task-specific models.

Version published to 10.1101/2025.05.05.25326979v1 on medRxiv
May 6, 2025

MRQC-LLM: A Novel Large Language Model Framework for Enhancing Medical Record Quality Control

This article has 8 authors:
1. Zhenqi Zhang
2. Xuchen Yang
3. Xun Yao
4. Hao Yang
5. Shutong Zhang
6. Sikai Liu
7. Jing Wang
8. Rui Shi
This article has no evaluationsLatest version Jun 25, 2025
Temperature-Driven Variability in Emergency Diagnostic Accuracy by a Leading Language Model

This article has 12 authors:
1. Philip C. Jarrett
2. Jared Hill
3. Marshall Howell
4. Kristen Grabow Moore
5. Joby J. Thoppil
6. Laura Vargas Ortiz
7. Samuel T. Parnell
8. D. Mark Courtney
9. Samuel A. McDonald
10. Deborah B. Diercks
11. Andrew R. Jamieson
12. Dazhe Cao
This article has no evaluationsLatest version Jul 1, 2025
A synthetic data generation framework for scalable and resource-efficient medical AI assistants

This article has 10 authors:
1. Abdurrahim Yilmaz
2. Furkan Yuceyalcin
3. Rahmetullah Varol
4. Ece Gokyayla
5. Ozan Erdem
6. Donghee Choi
7. Ali Anil Demircali
8. Gulsum Gencoglan
9. Joram M. Posma
10. Burak Temelkuran
This article has no evaluationsLatest version May 18, 2025

Listed in

Abstract

Background

Methods

Results

Conclusion

Article activity feed

Related articles

MRQC-LLM: A Novel Large Language Model Framework for Enhancing Medical Record Quality Control

Temperature-Driven Variability in Emergency Diagnostic Accuracy by a Leading Language Model

A synthetic data generation framework for scalable and resource-efficient medical AI assistants