Hybrid Deep Learning for Fail Slow Disk Detection in the FSA Benchmark
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Fail-slow disks, where performance degrades gradually before an outright failure, are increasingly common in large-scale cloud storage systems. Our work builds upon the FSA-benchmark dataset (PERSEUS), which contains approximately 100 billion data points collected from over 300,000 disks across 25 clusters. Initial experiments with traditional machine learning models such as XGBoost, Random Forest, and shallow time-series methods like LSTM and SVM have shown moderate success in detecting fail-slow conditions (failure rates ranging from 3.33% for Autoencoder to 96.67% for SVM). However, these approaches struggle to capture the complex, high-frequency correlations in disk metrics that precede a fail-slow event. This research proposes a hybrid deep learning framework that combines convolutional-recurrent layers with self-attention mechanisms to better model both spatial and temporal dependencies in the 15-second-interval performance metrics. The proposed architecture ingests multivariate time windows (look back periods of 1-15 days) and outputs real-time probabilities of impending fail-slow conditions. We evaluate our approach on the same Cluster A and B splits used in the original PERSEUS study, using precision, recall, AUC-ROC, and Time-to-Alert as key metrics. Preliminary experiments demonstrate promising results, with the LSTM model achieving a 28% failure rate and the Autoencoder showing exceptional specificity (3.33% failure rate). The proposed hybrid architecture builds upon these foundations by integrating transformer-based mechanisms to better capture long-range dependencies in disk performance data.