Bridging the Gap Between Data Engineering and ML Operations: A Scalable Framework for Feature Curation, Discovery, and High-Throughput Serving

Bakhtiiar Tashbolotov
Burul Shambetova

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The transition of machine learning (ML) from experimental models to production-ready systems is hindered by the complexities of managing high-dimensional data and mitigating "train-serve skew." This paper presents an architectural framework for a high-performance Feature Store, designed as a centralized "missing data layer" that unifies feature engineering across the ML lifecycle. Utilizing a microservices approach, the system leverages Go for low-latency serving and Apache Spark for scalable distributed aggregations. We propose a dual-layer storage strategy integrating DragonflyDB for sub-millisecond online retrieval and Apache Iceberg for transactional offline persistence and historical time-travel. Experimental results demonstrate that this architecture achieves a p99 latency of less than 0.85ms at 50,000 requests per second while maintaining 100 percent data consistency. Finally, the research addresses the emerging shift toward embedding-centric pipelines, outlining the evolution required to manage high-dimensional vector spaces and drift in self-supervised models.

Version published to 10.20944/preprints202512.2561.v1
Dec 29, 2025

Practical Event-Driven Microservices: A Database-Centric Alternative to Message Brokers An Architectural Framework for Moderate-Scale Systems

This article has 1 author:
1. Olatunji Ajayi
This article has no evaluationsLatest version Dec 11, 2025
The Joinless Partition Pattern: A Novel Approach to Efficient Multi-Dimensional Aggregations in Apache Spark

This article has 1 author:
1. Jaime Ruben Alejandro
This article has no evaluationsLatest version Jan 23, 2026
Predictive-LoRA: A Proactive and Fragmentation-Aware Serverless Inference System for LLMs

This article has 5 authors:
1. Yinan Ni
2. Xiao Yang
3. Zhimin Qiu
4. Chen Wang
5. Tingzhou Yuan
This article has no evaluationsLatest version Dec 24, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Practical Event-Driven Microservices: A Database-Centric Alternative to Message Brokers An Architectural Framework for Moderate-Scale Systems

The Joinless Partition Pattern: A Novel Approach to Efficient Multi-Dimensional Aggregations in Apache Spark

Predictive-LoRA: A Proactive and Fragmentation-Aware Serverless Inference System for LLMs