Efficient Deployment of a 685B-Parameter Open-Source LLM on the Brazilian Santos Dumont Supercomputer

Leon Sulfierry Corrêa Costa
Matheus Müller Pereira da Silva
Fábio Lima Custódio
José Renato Duarte Fajardo
Bruno Alves Fagundes
Marcelo Monteiro Galheigo
Vívian Medeiros
André Ramos Carneiro
Wagner Vieira Léo
Fábio André Machado Porto
Fábio Borges De Oliveira
Antônio Tadeu Azevedo Gomes
Laurent Emmanuel Dardenne

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This brief communication presents the deployment of open-source large language models (LLMs) on the Santos Dumont Brazilian Supercomputer to support the Laborat´orio Nacional de Computa¸c˜ao Cient´ıfica (LNCC) academic community. The solution, named Carcar´a, leverages key optimizations such as dynamic quantization techniques to enable an accessible, scalable, and cost-effective implementation. As a result, a single-instance stateof- the-art independent model was deployed on a single node equipped with 4 × NVIDIA H100 GPUs. The quantized models fit entirely within the node’s VRAM, enabling horizontal scaling without the need for inter-node synchronization. In this way, we could provide access to these models to the entire LNCC academic community using only four computational nodes, demonstrating the efficiency and scalability of this approach. Crucially, our approach ensures data sovereignty, allowing LNCC researchers and postgraduate students to utilize AI for sensitive research topics with full control over their data, free from the privacy risks associated with proprietary solutions. This initiative strengthens national scientific autonomy while providing secure and efficient AI tools for academic and research advancements. The code to deploy this solution is openly available, and we encourage other institutions to adapt it and support their own communities. Discussions with the Brazilian Ministry of Science, Technology and Innovation are underway to expand this strategic solution to other research centers and universities in Brazil.

Version published to 10.21203/rs.3.rs-6655308/v1 on Research Square
May 14, 2025

Best Practices for Using Large Language Models at Scale

This article has 5 authors:
1. Bhargavee Kannikanti
2. Arjun Coimbatore Nagarasan
3. Alberto Rosas
4. Sriram Kothandaraman
5. Sravan Kumar Kannuri
This article has no evaluationsLatest version Dec 12, 2025
Parallel Architectures for Large - Scale Document Processing:Integrating OCR and RAG Pipelines

This article has 4 authors:
1. Alejandro Jaime
2. Veronica Gil-Costa
3. Marcelo Errecalde
4. Leticia Cagnina
This article has no evaluationsLatest version Jan 19, 2026
FlashServe: Cost-Efficient Serverless Inference Scheduling for Large Language Models via Tiered Memory Management and Predictive Autoscaling

This article has 1 author:
1. Bolin Chen
This article has no evaluationsLatest version Dec 22, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Best Practices for Using Large Language Models at Scale

Parallel Architectures for Large - Scale Document Processing:Integrating OCR and RAG Pipelines

FlashServe: Cost-Efficient Serverless Inference Scheduling for Large Language Models via Tiered Memory Management and Predictive Autoscaling