Understanding SSD Reliability in Large-Scale Cloud Systems

<span class="var-sub_title">Understanding SSD Reliability in Large-Scale Cloud Systems</span> SC18 Proceedings

PDSW-DISCS: Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems

Understanding SSD Reliability in Large-Scale Cloud Systems

Authors:

Abstract: Modern datacenters increasingly use flash-based solid state drives (SSDs) for high performance and low energy cost. However, SSDs introduce more complex failure modes compared to traditional hard disks. While great efforts have been made to understand the reliability of SSDs itself, it remains unclear how the device-level errors may affect upper layers, or how the services running on top of the storage stack may affect the SSDs.

In this paper, we take a holistic view to examine the reliability of SSD-based storage systems in Alibaba's datacenters, which covers about half-million SSDs under representative cloud services over three years. By vertically analyzing the error events across three layers (i.e., SSDs, OS, and the distributed file system), we discover a number of interesting correlations. For example, SSDs with UltraDMA CRC errors, while seems benign at the device level, are nearly 3 times more likely to lead to OS-level error events. As another example, different cloud services may lead to different usage patterns of SSDs, some of which are detrimental from the devices perspective.

Archive Materials

Back to PDSW-DISCS: Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems Archive Listing

Back to Full Workshop Archive Listing