Uncertainty Aware Llm Deidentification and Anonymization of Clinical Notes

Adrienne Kline
Nafiseh Ghaffar Nia
Tyler J. Smith
David Leibowitz
Douglas Johnston

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The increasing demand for privacy-preserving access to clinical data has catalyzed the development of synthetic Protected Health Information (PHI) corpora for evaluating Named Entity Recognition (NER) systems. In this study, we introduce a large-scale, high-fidelity synthetic clinical note dataset generated via prompt-based interactions with a lightweight large language model (ChatGPT4-mini). The dataset captures structural and semantic variability across nine distinct clinical note types and includes realistic PHI entities such as patient identifiers, institutional affiliations, and temporal markers. We systematically benchmarked a diverse set of transformer-based NER models—including domain-specific encoders (\texttt{Bio\_ClinicalBERT}, \texttt{PubMedBERT}), general-purpose architectures (\texttt{BERT}, \texttt{RoBERTa}, \texttt{DeBERTa}), and decoder-only models adapted via parameter-efficient fine-tuning (\texttt{Phi-3-mini}, \texttt{DeBERTa-LORA}). Custom archiectural additions were made to each of these to render them suitable for an NER-based task. Model training employed data augmentation, label alignment via token-character mapping, and mixed precision optimization. Our results demonstrate that domain-pretrained models (\texttt{Bio\_ClinicalBERT}, \texttt{Phi-3-mini}) outperform general-purpose counterparts, achieving F1-scores above 0.988, while compact models like \texttt{DeBERTa-LORA} maintain strong performance with reduced computational overhead. Extensive ablation studies reveal the critical role of self-attention mechanisms in contextual encoding and validate the utility of mixed precision for resource efficiency. Despite operating on synthetic data, model outputs exhibited high accuracy and generalizability, affirming the utility of synthetic corpora for model prototyping and evaluation. Future directions include domain adaptation to real-world datasets such as MIMIC-III and deployment of lightweight, privacy-aware NER models in clinical NLP workflows. Uncertainty-aware large language model (LLM) deidentification enables safer, more reliable anonymization of clinical notes by identifying and flagging ambiguous entities, enhancing patient privacy while preserving data utility for research and care innovation.

Version published to 10.21203/rs.3.rs-8959510/v1 on Research Square
Mar 19, 2026

Membership Disclosure Evaluation for Synthetic Clinical Text Generated by LLM via Dynamic Few-Shot In-Context Learning

This article has 3 authors:
1. Sen Li
2. Fida K. Dankar
3. Khaled El Emam
This article has no evaluationsLatest version Mar 30, 2026
Zero-Downtime Hardening of Legacy Patient Identifiers Against Transcription Errors: A Table-Free Verhoeff Formulation

This article has 1 author:
1. Csaba Balogh
This article has no evaluationsLatest version Apr 28, 2026
AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative Scaffolding

This article has 2 authors:
1. Moiz Sadiq Awan
2. Maryam Raza
This article has no evaluationsLatest version Apr 14, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Membership Disclosure Evaluation for Synthetic Clinical Text Generated by LLM via Dynamic Few-Shot In-Context Learning

Zero-Downtime Hardening of Legacy Patient Identifiers Against Transcription Errors: A Table-Free Verhoeff Formulation

AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative Scaffolding