Parallel Architectures for Large - Scale Document Processing:Integrating OCR and RAG Pipelines
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper shows that enterprise-scale OCR processing is achievable using consumer-grade hardware and open-source software, eliminating dependence on expensive cloud services. We present three parallel architectures for massive PDF document processing: (1) a Ray-based distributed pipeline with integrated RAG capabilities achieving 24.3x speedup with fault tolerance, (2) a local multi-process architecture using ProcessPoolExecutor that achieves 69.9x speedup---reducing processing time from 5 hours to 4.3 minutes for 11,368 pages, and (3) a hybrid design combining Ray orchestration with optimized local workers, projecting 199x speedup ( 1.5 minutes) with three GPUs. Experiments on banking documents using an Intel Core i9 with dual RTX 4090 GPUs (\$5,000-7,000 USD) demonstrate super-linear scaling efficiency up to 1,531% in CPU+GPU configurations. Quality evaluation against Azure Document Intelligence establishes a 24.78% Character Error Rate for the open-source pipeline (PaddleOCR + fuzzy reconstruction), quantifying the fundamental speed-quality trade-off between 100 and 300 DPI processing. These results democratize capabilities previously exclusive to commercial cloud services, enabling organizations to process large document corpora at enterprise throughput without per-page API costs or vendor lock-in.