Parallel Architectures for Large - Scale Document Processing:Integrating OCR and RAG Pipelines

Alejandro Jaime
Veronica Gil-Costa
Marcelo Errecalde
Leticia Cagnina

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This paper shows that enterprise-scale OCR processing is achievable using consumer-grade hardware and open-source software, eliminating dependence on expensive cloud services. We present three parallel architectures for massive PDF document processing: (1) a Ray-based distributed pipeline with integrated RAG capabilities achieving 24.3x speedup with fault tolerance, (2) a local multi-process architecture using ProcessPoolExecutor that achieves 69.9x speedup---reducing processing time from 5 hours to 4.3 minutes for 11,368 pages, and (3) a hybrid design combining Ray orchestration with optimized local workers, projecting 199x speedup ( 1.5 minutes) with three GPUs. Experiments on banking documents using an Intel Core i9 with dual RTX 4090 GPUs (\$5,000-7,000 USD) demonstrate super-linear scaling efficiency up to 1,531% in CPU+GPU configurations. Quality evaluation against Azure Document Intelligence establishes a 24.78% Character Error Rate for the open-source pipeline (PaddleOCR + fuzzy reconstruction), quantifying the fundamental speed-quality trade-off between 100 and 300 DPI processing. These results democratize capabilities previously exclusive to commercial cloud services, enabling organizations to process large document corpora at enterprise throughput without per-page API costs or vendor lock-in.

Version published to 10.21203/rs.3.rs-8602947/v1 on Research Square
Jan 19, 2026

Beyond All-Reduce: Event-Driven Model Parallelism Without Collective Communication Primitives (EBD2N)

This article has 4 authors:
1. Ernesto Leite
2. Fabrice Mourlin
3. Youakim Badr
4. Pierre Paradinas
This article has no evaluationsLatest version Mar 5, 2026
Study and evaluation of many-core CPU offloading for computing processes in environment adaptive software

This article has 1 author:
1. Yoji Yamato
This article has no evaluationsLatest version Apr 12, 2026
I/O for LLM Inference: A Survey of Storage and Memory Bottlenecks

This article has 1 author:
1. Rajarshi Chowdhury
This article has no evaluationsLatest version Mar 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Beyond All-Reduce: Event-Driven Model Parallelism Without Collective Communication Primitives (EBD2N)

Study and evaluation of many-core CPU offloading for computing processes in environment adaptive software

I/O for LLM Inference: A Survey of Storage and Memory Bottlenecks