Advanced Data Cleaning Pipelines for Big Data Analytics

Arimondo Scrivano

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

In the era of big data, the analysis of vast and complex datasets hasbecome paramount for extracting valuable insights across diverse scientificdomains. A critical component of the data analytics pipeline is data clean-ing—an intricate process aimed at enhancing data quality through therectification of inaccuracies and inconsistencies. This review focuses onthe advanced methodologies involved in data cleaning, with an emphasison exploration techniques, handling of missing values, and feature selec-tion. Effective data cleaning pipelines are indispensable for ensuring thereliability and accuracy of downstream analytical processes. We explorecontemporary strategies for data exploration that facilitate the discoveryof data patterns and anomalies, enhancing the overall understanding ofdatasets. The review further discusses sophisticated techniques for manag-ing missing data, emphasizing both imputation methods and model-basedapproaches. Additionally, we analyze methodologies for effective featureselection, describing how they can be leveraged to improve model per-formance by reducing dimensionality and eliminating redundant features.Through a comprehensive review of these advanced data cleaning tech-niques, this article highlights the necessity of robust cleaning strategies inthe context of big data analytics, providing a roadmap for researchers andpractitioners to enhance data quality and optimize analytical outcomes.

Version published to 10.20944/preprints202507.1524.v1
Jul 18, 2025

Advancing Object-Centric Process Mining with Multi-Dimensional Data Operations

This article has 3 authors:
1. Shahrzad Khayatbashi
2. Najmeh Miri
3. Amin Jalali
This article has no evaluationsLatest version Jan 21, 2026
INTEGRATION OF DATA LAKES AND DATA WAREHOUSES FOR AI-DRIVEN HEALTHCARE ANALYTICS

This article has 6 authors:
1. Monalisa Dike
2. Chidera Theola Onuh
3. Felix Ikpoki Acha
4. Ebenezer Oseneboh
5. Tobiloba Johnson Ojo
6. Okungbowa Olayemi
This article has no evaluationsLatest version Dec 27, 2025
Ten Quick Tips for Biomedical Federated Learning

This article has 8 authors:
1. Kyle Ellrott
2. Venkat S. Maladi
3. Jean-Christophe Bélisle-Pipon
4. Emek Demir
5. Yael Bensoussan
6. Serghei Mangul
7. Alex A. T. Bui
8. Paul C. Boutros
This article has no evaluationsLatest version Jan 27, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Advancing Object-Centric Process Mining with Multi-Dimensional Data Operations

INTEGRATION OF DATA LAKES AND DATA WAREHOUSES FOR AI-DRIVEN HEALTHCARE ANALYTICS

Ten Quick Tips for Biomedical Federated Learning