LegiSubjects-Br: A classification corpus for estimating subjects of Brazilian legislative bills

Rafael Oleques Nunes
Andre Suslik Spritzer
Carla Maria Dal Sasso Freitas
Dennis Giovani Balreira

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This paper proposes a new corpus for estimating subjects for bills proposed at the lower house of Brazil's legislature, the Chamber of Deputies. Politics and legislation inherently include a vast amount of jargon-filled documents. The legislative process, in particular, also involves several steps that bills have to go through before they become law. This necessarily produces a large volume of human-entered data that often lacks important information. Considering the Brazilian Chamber of Deputies from 1991 to 2022, around 75% of the bills do not have subject classification in their associated metadata. However, due to the sheer amount of bills, this scenario is well-suited for a solution based on machine learning and natural language processing. We present a solution that introduces and compares two BERT models adapted for the Portuguese language using bill summaries and keywords, which consist of brief descriptions or overviews of the main points of the documents. We obtained our best results using the BERTimbau model variation, achieving 81.59% of the weighted F1 score and 75.06% of the macro F1 score. To the best of our knowledge, this is the first work to propose a corpus and a model for predicting the subjects of bills proposed at the Brazilian Chamber of Deputies. Our approach encourages researchers to explore similar techniques for other legal documents and potentially aids political scientists in conducting a more robust data analysis than what was possible with the previous data due to the frequent absence of metainformation.

Version published to 10.21203/rs.3.rs-7024267/v1 on Research Square
Sep 1, 2025

LGPD Benchmark: A Legal Text Corpus for Evaluating Personal Data Pseudonymization in Brazilian Portuguese

This article has 2 authors:
1. Marcelo Anselmo de Souza Filho
2. Bruno César Ribas
This article has no evaluationsLatest version Dec 12, 2025
CCF Database: A Machine-Learning-Annotated Corpus of 266,271 Canadian Climate Articles (1978–2024)

This article has 3 authors:
1. Antoine Claude Lemor
2. Alizée Pillod
3. Matthew Taylor
This article has no evaluationsLatest version Jan 27, 2026
Random forests in corpus research: A systematic review

This article has 1 author:
1. Lukas Sönning
This article has no evaluationsLatest version Jan 17, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

LGPD Benchmark: A Legal Text Corpus for Evaluating Personal Data Pseudonymization in Brazilian Portuguese

CCF Database: A Machine-Learning-Annotated Corpus of 266,271 Canadian Climate Articles (1978–2024)

Random forests in corpus research: A systematic review