Foundations of Intelligence: A Review of Data Preprocessing Pipelines in Machine Learning from Classical ML to LLMs, Agentic AI, and Multimodal Systems

VASHISHT BRAHMBHATT

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Modern AI systems often draw attention for their model architectures or massive parameter scales, yet a consistent pattern emerges across decades of progress: the quality of data preprocessing remains one of the strongest determinants of performance, robustness, and downstream reliability. This review examines how preprocessing practices have evolved from expert- driven feature engineering in classical machine learning to large- scale automated curation for LLMs and adaptive multimodal pipelines for agentic AI systems. By synthesizing insights from over 50 recent works, we highlight how preprocessing shapes every major AI milestone from early statistical methods to today’s transformer-based multi- agent systems. We trace this evolution across four eras, discuss the persistent challenges that continue to reappear in different forms, and identify recurring architectural principles that guide practical pipeline design. The result is a consolidated perspective on how to build cleaner, more scalable, and more responsible preprocessing systems that support trustworthy AI in real-world conditions. Index Terms - Data preprocessing, machine learning, deep learning, LLMs, agentic AI, multimodal systems, data pipelines, feature engineering, automation.

Version published to 10.14293/pr2199.002361.v1
Nov 17, 2025

Large Language Models: A Survey of Architectures, Training Paradigms, and Alignment Methods

This article has 5 authors:
1. Deepshikha Bhati
2. Fnu Neha
3. Devi Sri Bandaru
4. Matthew Weber
5. Ishan Dilipbhai Gajera
This article has no evaluationsLatest version Jan 15, 2026
Automated AI Model Development: a Systematic Literature Review

This article has 2 authors:
1. Álvaro Sánchez Pérez
2. María Cruz Gaya Lopez
This article has no evaluationsLatest version Dec 31, 2025
Small Language Models: Architecture, Evolution, and the Future of Artificial Intelligence

This article has 5 authors:
1. Ankit Parag Shah
2. Mohammad-Parsa Hosseini
3. Su Min Park
4. Connie Miao
5. Wei Wei
This article has no evaluationsLatest version Jan 13, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Large Language Models: A Survey of Architectures, Training Paradigms, and Alignment Methods

Automated AI Model Development: a Systematic Literature Review

Small Language Models: Architecture, Evolution, and the Future of Artificial Intelligence