Bridging the Gap Between Data Engineering and ML Operations: A Scalable Framework for Feature Curation, Discovery, and High-Throughput Serving
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The transition of machine learning (ML) from experimental models to production-ready systems is hindered by the complexities of managing high-dimensional data and mitigating "train-serve skew." This paper presents an architectural framework for a high-performance Feature Store, designed as a centralized "missing data layer" that unifies feature engineering across the ML lifecycle. Utilizing a microservices approach, the system leverages Go for low-latency serving and Apache Spark for scalable distributed aggregations. We propose a dual-layer storage strategy integrating DragonflyDB for sub-millisecond online retrieval and Apache Iceberg for transactional offline persistence and historical time-travel. Experimental results demonstrate that this architecture achieves a p99 latency of less than 0.85ms at 50,000 requests per second while maintaining 100 percent data consistency. Finally, the research addresses the emerging shift toward embedding-centric pipelines, outlining the evolution required to manage high-dimensional vector spaces and drift in self-supervised models.