The Joinless Partition Pattern: A Novel Approach to Efficient Multi-Dimensional Aggregations in Apache Spark

Jaime Ruben Alejandro

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large-scale data aggregation pipelines in Apache Spark typically require multiple join operations to enrich fact tables with dimensional attributes before computing aggregates. Each join induces a costly shuffle operation that redistributes the entire fact table across the cluster. This paper introduces the \textit{Joinless Partition Pattern}, a novel optimization technique that eliminates intermediate join shuffles by leveraging broadcast variables and partition-local lookups within \texttt{mapPartitions}. We demonstrate that for workloads involving multiple small-to-medium dimension tables (10MB--1GB each), our approach achieves 3--12× speedup and 20--400× reduction in shuffle volume compared to traditional join-based approaches. We provide formal correctness proofs, complexity analysis, and experimental validation on both synthetic benchmarks and three production workloads running on Azure Databricks. Our results show consistent performance improvements across varying data scales and cluster configurations, with particularly strong benefits for aggregation-heavy analytical queries. The pattern has been deployed in production for 6+ months with demonstrated annual savings of \$15,000--\$20,000 in compute costs.

Version published to 10.21203/rs.3.rs-8620327/v1 on Research Square
Jan 23, 2026

Computation-Proximate Architecture: An Actor-Based Pattern for Real-Time Calculations Over Billion-Row Datasets

This article has 1 author:
1. Alejandro Jaime
This article has no evaluationsLatest version Feb 27, 2026
Beyond All-Reduce: Event-Driven Model Parallelism Without Collective Communication Primitives (EBD2N)

This article has 4 authors:
1. Ernesto Leite
2. Fabrice Mourlin
3. Youakim Badr
4. Pierre Paradinas
This article has no evaluationsLatest version Mar 5, 2026
Parallel Architectures for Large - Scale Document Processing:Integrating OCR and RAG Pipelines

This article has 4 authors:
1. Alejandro Jaime
2. Veronica Gil-Costa
3. Marcelo Errecalde
4. Leticia Cagnina
This article has no evaluationsLatest version Jan 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Computation-Proximate Architecture: An Actor-Based Pattern for Real-Time Calculations Over Billion-Row Datasets

Beyond All-Reduce: Event-Driven Model Parallelism Without Collective Communication Primitives (EBD2N)

Parallel Architectures for Large - Scale Document Processing:Integrating OCR and RAG Pipelines