Parallel Architectures for Large - Scale Document Processing:Integrating OCR and RAG Pipelines

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This paper shows that enterprise-scale OCR processing is achievable using consumer-grade hardware and open-source software, eliminating dependence on expensive cloud services. We present three parallel architectures for massive PDF document processing: (1) a Ray-based distributed pipeline with integrated RAG capabilities achieving 24.3x speedup with fault tolerance, (2) a local multi-process architecture using ProcessPoolExecutor that achieves 69.9x speedup---reducing processing time from 5 hours to 4.3 minutes for 11,368 pages, and (3) a hybrid design combining Ray orchestration with optimized local workers, projecting 199x speedup ( 1.5 minutes) with three GPUs. Experiments on banking documents using an Intel Core i9 with dual RTX 4090 GPUs (\$5,000-7,000 USD) demonstrate super-linear scaling efficiency up to 1,531% in CPU+GPU configurations. Quality evaluation against Azure Document Intelligence establishes a 24.78% Character Error Rate for the open-source pipeline (PaddleOCR + fuzzy reconstruction), quantifying the fundamental speed-quality trade-off between 100 and 300 DPI processing. These results democratize capabilities previously exclusive to commercial cloud services, enabling organizations to process large document corpora at enterprise throughput without per-page API costs or vendor lock-in.

Article activity feed