Automated Deployment and Performance Benchmarking of Machine Learning Workloads on Hadoop Clusters Using Ansible

Rameez Rahaman

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

With the growing adoption of big data frameworks in cloud computing, ensuring efficient deployment and performance evaluation of distributed systems has become crucial. This project focuses on automating the deployment of machine learning benchmarks on Hadoop clusters using Ansible, a leading DevOps tool for IT automation. We leverage HiBench, a comprehensive benchmarking suite developed by Intel, to evaluate the runtime and throughput of Naïve Bayes Classification and K-Means Clustering workloads implemented in Apache Mahout. The deployment process involves setting up a virtual cluster, installing necessary software packages, and configuring Hadoop and Spark environments through Ansible playbooks. The study presents benchmarking results across different workload scales, analyzing execution time and throughput per node. By automating the entire setup, this work simplifies the evaluation of large-scale machine learning tasks, making it easier to assess and optimize Hadoop-based distributed computing environments.

Version published to 10.21203/rs.3.rs-7003490/v1 on Research Square
Jul 2, 2025

Enhancing HPC Job Run Time Predictions leveraging Machine Learning, Historical Job Data, and Metaheuristic Optimization

This article has 4 authors:
1. Suja Ramachandran
2. M. L. Jayalal
3. M. Vasudevan
4. R. Jehadeesan
This article has no evaluationsLatest version Dec 15, 2025
Design and Implementation of a Scalable Cloud-Based Management System Using AWS

This article has 2 authors:
1. Micheal Williams
2. Jack Wilson
This article has no evaluationsLatest version Jan 8, 2026
A Scalable Big Data Framework for Real-Time Predictive Maintenance in Industrial IoT

This article has 3 authors:
1. Muhammad Aasad
2. Yaser Alhasawi
3. Abderahman Rejeb
This article has no evaluationsLatest version Jan 20, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Enhancing HPC Job Run Time Predictions leveraging Machine Learning, Historical Job Data, and Metaheuristic Optimization

Design and Implementation of a Scalable Cloud-Based Management System Using AWS

A Scalable Big Data Framework for Real-Time Predictive Maintenance in Industrial IoT