Mneme: A Parallel Preprocessing Framework for Large Tabular Datasets

Argiris Sofotasios
Dimitris Metaxakis
Panagiotis Hadjidoukas

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid expansion of Machine Learning (ML) applications, especially within its subfield of Deep Learning (DL), has created an increasing demand for efficient preprocessing of large tabular datasets that surpass the available memory capacity of single-node systems. This paper introduces a parallel framework, developed as a Python library, designed to efficiently preprocess large-scale tabular datasets for training Deep Neural Networks (DNNs). The library supports various data transformations, including normalization, categorical encoding, and missing value imputation, leveraging parallel computing and chunk-based processing to efficiently handle massive datasets. By distributing preprocessing tasks across multiple cores and facilitating the parallel loading and processing of data chunks without altering the original data file, the proposed library significantly reduces the time required for data preparation, which often represents a critical bottleneck in modern ML pipelines. Experimental evaluation demonstrates substantial performance gains over conventional sequential approaches and state-of-the-art (SOTA) solutions.Furthermore, the library integrates seamlessly with widely adopted DL frameworks, providing a scalable and flexible High-Performance Computing (HPC) tool for data preprocessing in contemporary ML workflows.

Version published to 10.21203/rs.3.rs-7692811/v1 on Research Square
Nov 10, 2025

SymTensor: Symbolic and Adaptive Tensor Partitioning by Unified Parallelism for Deep Learning

This article has 4 authors:
1. Hongxing Wang
2. Zhengdao Yu
3. Chong Li
4. Serge Petiton
This article has no evaluationsLatest version Nov 10, 2025
3CBench: A Unified Benchmarking Framework for the Computing Capacity of Heterogeneous AI Clusters

This article has 10 authors:
1. Weixing Zhang
2. Xizhi Wang
3. Jun Yan
4. Jiasun Feng
5. Yiying Liu
6. Haiyan Li
7. Qun Chen
8. Zhe Tang
9. Xin Cui
10. Fei Yang
This article has no evaluationsLatest version Oct 9, 2025
Optimizing Deep Learning Architectures forEnhanced Computational Efficiency

This article has 2 authors:
1. Ying Wang
2. Hui Li
This article has no evaluationsLatest version Sep 22, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

SymTensor: Symbolic and Adaptive Tensor Partitioning by Unified Parallelism for Deep Learning

3CBench: A Unified Benchmarking Framework for the Computing Capacity of Heterogeneous AI Clusters

Optimizing Deep Learning Architectures forEnhanced Computational Efficiency