Mneme: A Parallel Preprocessing Framework for Large Tabular Datasets
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The rapid expansion of Machine Learning (ML) applications, especially within its subfield of Deep Learning (DL), has created an increasing demand for efficient preprocessing of large tabular datasets that surpass the available memory capacity of single-node systems. This paper introduces a parallel framework, developed as a Python library, designed to efficiently preprocess large-scale tabular datasets for training Deep Neural Networks (DNNs). The library supports various data transformations, including normalization, categorical encoding, and missing value imputation, leveraging parallel computing and chunk-based processing to efficiently handle massive datasets. By distributing preprocessing tasks across multiple cores and facilitating the parallel loading and processing of data chunks without altering the original data file, the proposed library significantly reduces the time required for data preparation, which often represents a critical bottleneck in modern ML pipelines. Experimental evaluation demonstrates substantial performance gains over conventional sequential approaches and state-of-the-art (SOTA) solutions.Furthermore, the library integrates seamlessly with widely adopted DL frameworks, providing a scalable and flexible High-Performance Computing (HPC) tool for data preprocessing in contemporary ML workflows.