Uncertainty Aware Llm Deidentification and Anonymization of Clinical Notes

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The increasing demand for privacy-preserving access to clinical data has catalyzed the development of synthetic Protected Health Information (PHI) corpora for evaluating Named Entity Recognition (NER) systems. In this study, we introduce a large-scale, high-fidelity synthetic clinical note dataset generated via prompt-based interactions with a lightweight large language model (ChatGPT4-mini). The dataset captures structural and semantic variability across nine distinct clinical note types and includes realistic PHI entities such as patient identifiers, institutional affiliations, and temporal markers. We systematically benchmarked a diverse set of transformer-based NER models—including domain-specific encoders (\texttt{Bio\_ClinicalBERT}, \texttt{PubMedBERT}), general-purpose architectures (\texttt{BERT}, \texttt{RoBERTa}, \texttt{DeBERTa}), and decoder-only models adapted via parameter-efficient fine-tuning (\texttt{Phi-3-mini}, \texttt{DeBERTa-LORA}). Custom archiectural additions were made to each of these to render them suitable for an NER-based task. Model training employed data augmentation, label alignment via token-character mapping, and mixed precision optimization. Our results demonstrate that domain-pretrained models (\texttt{Bio\_ClinicalBERT}, \texttt{Phi-3-mini}) outperform general-purpose counterparts, achieving F1-scores above 0.988, while compact models like \texttt{DeBERTa-LORA} maintain strong performance with reduced computational overhead. Extensive ablation studies reveal the critical role of self-attention mechanisms in contextual encoding and validate the utility of mixed precision for resource efficiency. Despite operating on synthetic data, model outputs exhibited high accuracy and generalizability, affirming the utility of synthetic corpora for model prototyping and evaluation. Future directions include domain adaptation to real-world datasets such as MIMIC-III and deployment of lightweight, privacy-aware NER models in clinical NLP workflows. Uncertainty-aware large language model (LLM) deidentification enables safer, more reliable anonymization of clinical notes by identifying and flagging ambiguous entities, enhancing patient privacy while preserving data utility for research and care innovation.

Article activity feed