Comparative evaluation of large-language models and purpose-built software for medical record de-identification

Rachel Kuo
Andrew Soltan
Ciaran O’Hanlon
Alan Hasanic
David Clifton
Gary Collins
Dominic Furniss
David Eyre

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: Robust de-identification is necessary to preserve patient confidentiality and maintain public acceptability for electronic health record (EHR) research. Manual redaction of personally identifiable information (PII) is time-consuming and expensive, limiting the scale of data-sharing. Automated de-identification could alleviate this burden, but the best strategy is not clear. Advances in natural language processing (NLP) and the emergence of foundational large language models (LLMs) show promise in performing clinical NLP tasks with no, or limited training. Methods: We evaluated two task-specific (Microsoft Azure de-identification service, AnonCAT) and five general LLMs (Gemma-7b-IT, Llama-3-8B-Instruct, Phi-3-mini-128k-instruct, GPT3.5-turbo-base, GPT-4-0125) in de-identifying 3650 medical records from a UK hospital group, split into general and specialised datasets. Records were dual-annotated by clinicians for PII. Inter-annotator reliability was used to benchmark performance. The primary outcome was F1, precision (positive predictive value) and recall (sensitivity) for each comparator in classifying words as PII vs. non-PII. The secondary outcomes were performance per-PII-subtype, per-dataset, and the presence of LLM hallucinations. We report outcomes at zero- and few-shot learning for LLMs, and with/without fine-tuning for AnonCAT. Results: 17496/479760 (3.65%) words were PII. Inter-annotator F1 for word-level PII/non-PII was 0.977 (95%CI 0.957-0.991), precision 0.967 (0.923-0.993), and recall 0.986 (0.971-0.997). The best performing redaction tool was the Microsoft Azure de-identification service: F1 0.933 (0.928-0.938), precision 0.916 (0.930-0.922), recall 0.950 (0.942-0.957). The next-best were fine-tuned-AnonCAT: F1 0.873 (0.864-0.882), precision 0.981 (0.977-0.985), recall 0.787 (0.773-0.800), and GPT-4-0125 (ten-shots): F1 0.898 (0.876-0.915), precision 0.924 (0.914-0.933), recall 0.874 (0.834-0.905). There was hallucinatory output in Phi-3-mini-128k-instruct and Llama-3-8B-Instruct at zero-, one-, and five-shots, and universally for Gemma-7b-IT. Names/dates were consistently redacted by all comparators; there was variable performance for other categories. Fine-tuned-AnonCAT demonstrated the least performance shift across datasets. Conclusion: Automated EHR de-identification could facilitate large-scale, domain-agnostic record sharing for medical research, alongside other safeguards to prevent patient reidentification.

Version published to 10.21203/rs.3.rs-4870585/v1 on Research Square
Oct 14, 2024

Design and implementation of a natural language processing system at the point of care: MiADE (Medical information AI Data Extractor)

This article has 17 authors:
1. Jennifer Jiang-Kells
2. James Brandreth
3. Leilei Zhu
4. Jack Ross
5. Yogini Jani
6. Enrico Costanza
7. Maisarah Amran
8. Zeljko Kraljevic
9. Xi Bai
10. Roberto Cresta
11. M.M.N.S. Dilan
12. Jayathri Wijayarathne
13. Ravi Wickramaratne
14. Folkert W. Asselbergs
15. Richard J.B. Dobson
16. Wai Keong Wong
17. Anoop D. Shah
This article has no evaluationsLatest version Sep 27, 2024
Improving clinical expertise in large language models using electronic medical records

This article has 11 authors:
1. Lifeng Zhu
2. Jingping Liu
3. Jiacheng Wang
4. Weiyan Zhang
5. Sihang Jiang
6. Hai Yang
7. Chao Wang
8. Qi Ye
9. Tong Ruan
10. Xinkai Rui
11. Huajun Chen
This article has no evaluationsLatest version Oct 30, 2024
Large Language Model Benchmarks in Medical Tasks

This article has 11 authors:
1. Lawrence K.Q. Yan
2. Ming Li
3. Yichao Zhang
4. Caitlyn Heqi Yin
5. Cheng Fei
6. Benji Peng
7. Ziqian Bi
8. Pohsun Feng
9. Keyu Chen
10. Junyu Liu
11. Qian Niu
This article has no evaluationsLatest version Oct 22, 2024

Listed in

Abstract

Article activity feed

Related articles

Design and implementation of a natural language processing system at the point of care: MiADE (Medical information AI Data Extractor)

Improving clinical expertise in large language models using electronic medical records

Large Language Model Benchmarks in Medical Tasks