A False Sense of Privacy: Evaluating the Limitsof Textual Data Sanitization for Privacy Protection

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The widespread use of textual data sanitization techniques,such as identifier removal and synthetic data generation, has raised ques-tions about their effectiveness in preserving individual privacy. This studyintroduced a comprehensive evaluation framework designed to measureprivacy leakage in sanitized datasets at a semantic level. The frameworkoperated in two stages: linking auxiliary information to sanitized recordsusing sparse retrieval and evaluating semantic similarity between orig-inal and matched records using a language model. Experiments wereconducted on two real-world datasets, MedQA and WildChat, to assessthe privacy-utility trade-off across various sanitization methods. Resultsshowed that traditional PII removal methods retained significant privateinformation, with over 90% of original claims still inferable. Syntheticdata generation demonstrated improved privacy performance, especiallywhen enhanced with differential privacy, though often at the cost ofdownstream task utility. The evaluation also revealed that text coher-ence and the nature of auxiliary knowledge significantly influenced re-identification risks. These findings emphasized the limitations of currentsurface-level sanitization practices and highlighted the need for robust,context-aware privacy mechanisms that balance utility and protection insensitive textual data releases.

Article activity feed