LGPD Benchmark: A Legal Text Corpus for Evaluating Personal Data Pseudonymization in Brazilian Portuguese

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Compliance with data protection laws, such as Brazil's General Data Protection Law (LGPD), requires automated tools capable of identifying and processing personal information in legal texts. However, there are still no public benchmarks designed for the systematic evaluation of such solutions in the Brazilian context. This work introduces the LGPD Benchmark, the first foundational corpus for evaluating textual pseudonymization techniques in Portuguese legal language. The benchmark consists of 120 synthetic documents covering nine areas of law, annotated according to LGPD-based guidelines. We evaluate large language models (LLMs), such as GPT, Gemini, Claude, and the Brazilian model Sabiá, on tasks involving the recognition of personal and sensitive entities, using classical NER metrics with an emphasis on Recall as a measure of privacy protection. The results indicate that international models achieve higher overall coverage, while the Brazilian model demonstrates competitiveness in formal and structured domains. The LGPD Benchmark provides a public and reproducible baseline for research on text anonymization and regulatory compliance, fostering the development of ethical and transparent solutions aligned with the LGPD.

Article activity feed