Jabuticaba: The largest commercial corpus for LLMs in Portuguese

Marcellus Amadeus
William Alberto Cruz Castaneda
José Roberto Homeli da Silva
Rodrigo Scotti

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large Language Models provide a step towards intelligent communication systems by harnessing large repositories or datasets of written human knowledge to better predict and understand the world. However, Artificial Intelligence sovereignty is all about quality data because datasets serve as the foundational infrastructure that sustains the development of LLMs. Thus, this paper presents the Jabuticaba dataset, the most extensive Portuguese language corpus for LLMs with a total data size of 669 GB and over 139 billion tokens consisting of clean, deduplicated words ready for use, including commercial use. Furthermore, Jabuticaba achieves a size comparable to and exceeding some state-of-the-art (SOTA) datasets in other languages. This paper outlines the methodological pipeline details used to build it to serve as a comprehensive reference for the research community in academia and industry in this field, as well as contributing to future studies. Resources are freely available at HuggingFace: https://huggingface.co/datasets/soberania/jabuticaba.

Version published to 10.1590/scielopreprints.12696 on SciELO Preprints
Aug 5, 2025

LGPD Benchmark: A Legal Text Corpus for Evaluating Personal Data Pseudonymization in Brazilian Portuguese

This article has 2 authors:
1. Marcelo Anselmo de Souza Filho
2. Bruno César Ribas
This article has no evaluationsLatest version Dec 12, 2025
Variability in Low-Resource Machine Translation Evaluation: Authentic vs. LLM-Generated Training Corpora

This article has 3 authors:
1. Sofía García González¹
2. German Rigau Claramunt²
3. Jose Ramom Pichel Campos
This article has no evaluationsLatest version Jan 21, 2026
Best Practices for Using Large Language Models at Scale

This article has 5 authors:
1. Bhargavee Kannikanti
2. Arjun Coimbatore Nagarasan
3. Alberto Rosas
4. Sriram Kothandaraman
5. Sravan Kumar Kannuri
This article has no evaluationsLatest version Dec 12, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

LGPD Benchmark: A Legal Text Corpus for Evaluating Personal Data Pseudonymization in Brazilian Portuguese

Variability in Low-Resource Machine Translation Evaluation: Authentic vs. LLM-Generated Training Corpora

Best Practices for Using Large Language Models at Scale