Hard Disk Drive Failure Analysis and Prediction: An Industry View

IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

Abstract

Storage media devices are fundamental to Meta’s hardware infrastructure, which supports a diverse family of applications such as Facebook, Instagram, and WhatsApp. Understanding the factors that impact the reliability of storage devices is important for setting application expectations on specifications such as throughput, latency, and read/write success rate. Improving hardware reliability helps us meet those expectations.

In this paper, we examine the impact that age and workload have on the annualized failure rate (AFR) of Hard Disk Drives (HDDs), one of the most used types of storage devices for Meta’s applications. We analyze the correlation based on data collected from our production hardware fleet. In our datacenter environment, we observe that HDD AFR increases as either age or lifetime cumulative workload increases. We discuss the difference between the AFR curves and the projections that manufacturers make using statistical modeling. Additionally, we use a decision tree-based predictive machine learning (ML) model, XGBoost, for analyzing the correlation between the SMART (Self-Monitoring, Analysis, and Reporting Technology) metrics and the health of HDDs. Through this study, we observe that age and workload- related SMART parameters are most correlated to the health of a drive based on the trained ML model. More so, we identify that the difference of SMART metrics over a 30-day time window could improve the prediction performance for HDD failures.

Featured Publications