Toward a GPU-enabled billionaire SVD in pyLOM

Arnau Miró
Benet Eiximeno
Lucas Gasparino
Nathan Kutz
Ivette Rodriguez
Oriol Lehmkuhl

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We develop and implement an accelerated high-performance and open-source computing environment for model order reduction in fluid dynamics called pyLOM. It contains singular value decomposition-based algorithms implemented for massively parallel GPU architectures. The library is profiled in detail under the MareNostrum V supercomputer. The largest case has been computed under 20 s with 100 GPUs and consisted of a billion nodes by a thousand snapshots matrix. A hybrid CPU-GPU parallel randomized QR factorization has been found to be able to leverage such large matrices. The largest speedup factor of 83 has been found on the QR factorization, while the matrix–matrix multiplication has shown a speedup factor of about 2. Additionally, two examples of application are provided in the flow around a cylinder and the Windsor body, whose POD is computed under 3 s with 100 GPUs. This showcases the efficiency of GPUs, resulting in a 97% reduction in energy to solution and a reduction of 0.11 kg of C O 2 emissions. The scalability and efficiency achieved suggest that this framework can play a key role in reducing the energy demands and environmental impact of large-scale data analysis and model order reduction across a wide range of applications.

Version published to 10.1007/s00707-025-04621-1
Jan 17, 2026
Version published to 10.21203/rs.3.rs-7678279/v1 on Research Square
Oct 10, 2025

GPU-NTT and Karatsuba Co-Optimization forHigh-Throughput Polynomial MultiplicationAcceleration

This article has 4 authors:
1. Ruwei Huang
2. xiaolong Tang
3. Junjie Wang
4. Xuezheng Qin
This article has no evaluationsLatest version Jan 19, 2026
GPU-accelerated modeling of biological regulatory networks

This article has 7 authors:
1. Joyce Reimer
2. Pranta Saha
3. Chris Chen
4. Neeraj Dhar
5. Brook Byrns
6. Steven Rayan
7. Gordon Broderick
This article has no evaluationsLatest version Jan 5, 2026
Parallel Architectures for Large - Scale Document Processing:Integrating OCR and RAG Pipelines

This article has 4 authors:
1. Alejandro Jaime
2. Veronica Gil-Costa
3. Marcelo Errecalde
4. Leticia Cagnina
This article has no evaluationsLatest version Jan 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

GPU-NTT and Karatsuba Co-Optimization forHigh-Throughput Polynomial MultiplicationAcceleration

GPU-accelerated modeling of biological regulatory networks

Parallel Architectures for Large - Scale Document Processing:Integrating OCR and RAG Pipelines