Correlation Aware Dynamic Solution for Merging Small Files based on Similarity and Clustering in HDFS

Hanène Chettaoui
Farah Hkiri

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

In distributed processing environments such as Hadoop, Hadoop Distributed File System (HDFS) is apopular open-source solution for storing and managing a massive number of files. HDFS faces several issues whenit comes to handling a large amount of small files. To overcome this drawback, we propose a new dynamic strategyCHAC (for Correlation and Hierarchical Ascending Clustering) dealing with the small files problem. The majorcontribution of our strategy is that considering correlations between small files is the criterion used in the mergingprocess into large files. For this purpose, several criteria such as file size, requests number and requesting clientsare taken into account through analyzing the user access pattern. In the current version of our proposal, a formulainspired from information retrieval is used to quantify the weight of each small file. Moreover, a Cosine similaritybasedmethod coupled with a hierarchical ascending clustering algorithm is used as a shrewd grouping tool ofcorrelated small files. To demonstrate the effectiveness of CHAC, a series of experiments were performed over tendistinct datasets. Compared with other solutions, the obtained results highlight that our proposal offers interestingperformances for reducing the number of obtained large files as well as the NameNode memory consumption. CHAC also optimizes the use of the DataNodes storage space by increasing the average disk utilization of datablocks. Moreover, it reduces the storage time taken to store large files. As our solution is based on file correlations, an in-depth analysis of the quality of the obtained large files is carried out highlighting the effectiveness of CHAC.

Version published to 10.21203/rs.3.rs-6944075/v1 on Research Square
Aug 27, 2025

A Scalable System for Software Repository Analysis and Retrieva

This article has 1 author:
1. Shruti Hardia
This article has no evaluationsLatest version Sep 8, 2025
Ranking Methods for Skyline Queries

This article has 2 authors:
1. Mickaël Martin Nevot
2. Lotfi Lakhal
This article has no evaluationsLatest version Aug 27, 2025
Group Querying in Tridimensional Social Networks

This article has 3 authors:
1. Pedro Henrique B. Ruas
2. Rokia Missaoui
3. Mohamed Hamza Ibrahim
This article has no evaluationsLatest version Sep 12, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Scalable System for Software Repository Analysis and Retrieva

Ranking Methods for Skyline Queries

Group Querying in Tridimensional Social Networks