LegiSubjects-Br: A classification corpus for estimating subjects of Brazilian legislative bills
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper proposes a new corpus for estimating subjects for bills proposed at the lower house of Brazil's legislature, the Chamber of Deputies. Politics and legislation inherently include a vast amount of jargon-filled documents. The legislative process, in particular, also involves several steps that bills have to go through before they become law. This necessarily produces a large volume of human-entered data that often lacks important information. Considering the Brazilian Chamber of Deputies from 1991 to 2022, around 75% of the bills do not have subject classification in their associated metadata. However, due to the sheer amount of bills, this scenario is well-suited for a solution based on machine learning and natural language processing. We present a solution that introduces and compares two BERT models adapted for the Portuguese language using bill summaries and keywords, which consist of brief descriptions or overviews of the main points of the documents. We obtained our best results using the BERTimbau model variation, achieving 81.59% of the weighted F1 score and 75.06% of the macro F1 score. To the best of our knowledge, this is the first work to propose a corpus and a model for predicting the subjects of bills proposed at the Brazilian Chamber of Deputies. Our approach encourages researchers to explore similar techniques for other legal documents and potentially aids political scientists in conducting a more robust data analysis than what was possible with the previous data due to the frequent absence of metainformation.