Comparative evaluation of large-language models and purpose-built software for medical record de-identification
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Robust de-identification is necessary to preserve patient confidentiality and maintain public acceptability for electronic health record (EHR) research. Manual redaction of personally identifiable information (PII) is time-consuming and expensive, limiting the scale of data-sharing. Automated de-identification could alleviate this burden, but the best strategy is not clear. Advances in natural language processing (NLP) and the emergence of foundational large language models (LLMs) show promise in performing clinical NLP tasks with no, or limited training. Methods: We evaluated two task-specific (Microsoft Azure de-identification service, AnonCAT) and five general LLMs (Gemma-7b-IT, Llama-3-8B-Instruct, Phi-3-mini-128k-instruct, GPT3.5-turbo-base, GPT-4-0125) in de-identifying 3650 medical records from a UK hospital group, split into general and specialised datasets. Records were dual-annotated by clinicians for PII. Inter-annotator reliability was used to benchmark performance. The primary outcome was F1, precision (positive predictive value) and recall (sensitivity) for each comparator in classifying words as PII vs. non-PII. The secondary outcomes were performance per-PII-subtype, per-dataset, and the presence of LLM hallucinations. We report outcomes at zero- and few-shot learning for LLMs, and with/without fine-tuning for AnonCAT. Results: 17496/479760 (3.65%) words were PII. Inter-annotator F1 for word-level PII/non-PII was 0.977 (95%CI 0.957-0.991), precision 0.967 (0.923-0.993), and recall 0.986 (0.971-0.997). The best performing redaction tool was the Microsoft Azure de-identification service: F1 0.933 (0.928-0.938), precision 0.916 (0.930-0.922), recall 0.950 (0.942-0.957). The next-best were fine-tuned-AnonCAT: F1 0.873 (0.864-0.882), precision 0.981 (0.977-0.985), recall 0.787 (0.773-0.800), and GPT-4-0125 (ten-shots): F1 0.898 (0.876-0.915), precision 0.924 (0.914-0.933), recall 0.874 (0.834-0.905). There was hallucinatory output in Phi-3-mini-128k-instruct and Llama-3-8B-Instruct at zero-, one-, and five-shots, and universally for Gemma-7b-IT. Names/dates were consistently redacted by all comparators; there was variable performance for other categories. Fine-tuned-AnonCAT demonstrated the least performance shift across datasets. Conclusion: Automated EHR de-identification could facilitate large-scale, domain-agnostic record sharing for medical research, alongside other safeguards to prevent patient reidentification.