Correlation Aware Dynamic Solution for Merging Small Files based on Similarity and Clustering in HDFS
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In distributed processing environments such as Hadoop, Hadoop Distributed File System (HDFS) is apopular open-source solution for storing and managing a massive number of files. HDFS faces several issues whenit comes to handling a large amount of small files. To overcome this drawback, we propose a new dynamic strategyCHAC (for Correlation and Hierarchical Ascending Clustering) dealing with the small files problem. The majorcontribution of our strategy is that considering correlations between small files is the criterion used in the mergingprocess into large files. For this purpose, several criteria such as file size, requests number and requesting clientsare taken into account through analyzing the user access pattern. In the current version of our proposal, a formulainspired from information retrieval is used to quantify the weight of each small file. Moreover, a Cosine similaritybasedmethod coupled with a hierarchical ascending clustering algorithm is used as a shrewd grouping tool ofcorrelated small files. To demonstrate the effectiveness of CHAC, a series of experiments were performed over tendistinct datasets. Compared with other solutions, the obtained results highlight that our proposal offers interestingperformances for reducing the number of obtained large files as well as the NameNode memory consumption. CHAC also optimizes the use of the DataNodes storage space by increasing the average disk utilization of datablocks. Moreover, it reduces the storage time taken to store large files. As our solution is based on file correlations, an in-depth analysis of the quality of the obtained large files is carried out highlighting the effectiveness of CHAC.