RAID Disk Failure Calculator from Memset

This tool is provided for the purpose of understanding risk associated with disk failure in commonly used RAID configurations.

Total number of disks in the raid array (including parity and hot spares)

Size of each disk in GB

MB/s the raid controller can devote to rebuilding

Average lifetime of a disk in years

Days for a disk to be replaced

Disk
Size
(GB)
Rebuild
Time
Time
between
disk
failures
raid0 raid5 raid5 with 1 hotspare raid5 with 2 hotspares raid5 with 3 hotspares raid6 raid6 with 1 hotspare raid6 with 2 hotspares raid6 with 3 hotspares
Size DLO/y Size DLO/y Size DLO/y Size DLO/y Size DLO/y Size DLO/y Size DLO/y Size DLO/y Size DLO/y
250 6 hours, 56 minutes 1 month, 2 weeks 6 1 in 1.0 5.8 1 in 2.3 5.5 1 in 18.6 5.3 1 in 38.3 5 1 in 40.1 5.5 1 in 33.7 5.3 1 in 802.0 5 1 in 10960.9 4.8 1 in 18413.8
500 13 hours, 53 minutes 1 month, 2 weeks 12 1 in 1.0 11.5 1 in 2.2 11 1 in 12.6 10.5 1 in 19.9 10 1 in 20.4 11 1 in 31.3 10.5 1 in 646.6 10 1 in 3902.3 9.5 1 in 4686.2
1000 1 day, 3 hours 1 month, 2 weeks 24 1 in 1.0 23 1 in 2.1 22 1 in 7.8 21 1 in 10.4 20 1 in 10.5 22 1 in 27.3 21 1 in 402.2 20 1 in 1112.4 19 1 in 1187.4
2000 2 days, 7 hours 1 month, 2 weeks 48 1 in 1.0 46 1 in 2.0 44 1 in 4.6 42 1 in 5.6 40 1 in 5.6 44 1 in 21.5 42 1 in 175.8 40 1 in 294.8 38 1 in 303.2

Size: Array size (TB)
DLO/y: Data loss odds (per year)

Assumptions

  • We assume that the probability of failure is distributed evenly over the disk lifetime (not a brilliant assumption!)
  • We assume disk failures are independent - also not good assumption since disks are likely from the same manufacturer

Method

  • Calculate how many disks need to fail in order to cause data loss (1 for raid0, 2 for raid5 etc)
  • Calculate the critical period in which this must happen to cause data loss. For arrays with no hot spares this is the time to replace + rebuild time, for arrays with hot spares this is the rebuild time
  • Assuming the probability of disk failure is distributed evenly over the disk lifetime and that disk failures are not correlated, use the Poisson distribution to calculate the probability of that many disks failing in any given critical period
  • Now work out how many critical periods there are in a year, and use the Poisson distribution again to work out the probability of data loss in any given year
  • For arrays with hotspares, the calculation is done again assuming all the parity disks, all the hotspares and one more fail during the time to replace + rebuild time and this is combined with the probability calculated above. This probability is usually small compared to the first calculation, but can become significant.

There are several assumptions and approximations in the above method. However its results show clearly the effect on probability of data loss of the disk sizes, disks in the array and raid types