Foundations of Intelligence: A Review of Data Preprocessing Pipelines in Machine Learning from Classical ML to LLMs, Agentic AI, and Multimodal Systems

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Modern AI systems often draw attention for their model architectures or massive parameter scales, yet a consistent pattern emerges across decades of progress: the quality of data preprocessing remains one of the strongest determinants of performance, robustness, and downstream reliability. This review examines how preprocessing practices have evolved from expert- driven feature engineering in classical machine learning to large- scale automated curation for LLMs and adaptive multimodal pipelines for agentic AI systems. By synthesizing insights from over 50 recent works, we highlight how preprocessing shapes every major AI milestone from early statistical methods to today’s transformer-based multi- agent systems. We trace this evolution across four eras, discuss the persistent challenges that continue to reappear in different forms, and identify recurring architectural principles that guide practical pipeline design. The result is a consolidated perspective on how to build cleaner, more scalable, and more responsible preprocessing systems that support trustworthy AI in real-world conditions. Index Terms - Data preprocessing, machine learning, deep learning, LLMs, agentic AI, multimodal systems, data pipelines, feature engineering, automation.

Article activity feed