Hydra 1303


All News Items

Calculating vSAN Availability: Part II

from the Hydra High Council May 19th 2020


For this next part, let's use the specs posted by a vendor for a specific enterprise SSD to calculate single device availability. We have chosen the Kingston DC450R Enterprise SSD.
Note: Part I of this series can be found here: https://www.hydra1303.com/series/vsan-availability/

Incorporating the MTBF value of a drive into the equation

We're looking for their posted MTBF value. The spec sheet we used can be found here: https://www.kingston.com/us/ssd/dc450-data-center-solid-state-drive

To calculate the availability for this single drive, we use the formula Availability = Uptime / (Uptime+Downtime)

In this example, U equals the MTBF value listed we see in the online spec sheet of 2,000,000 hours.

For D, we have to consider how much time it would take to have another disk installed and its data restored from backup. To be conservative, let's say that you're relying on a legacy tape backup. The tape is stored in a warehouse 50 miles away and altogether, to rebuild the drive it takes a full 24 hours. Therefore, D = 24.

Plugging the values into the formula we get this.

Then, we divide the bottom from the top.

Factoring in other operational dependencies

We've calculated availability for this capacity drive, but keep in mind that vSAN combines compute and storage into a single host. That host is contained in a rack and has a controller. Plus, there's an SSD for cache. All of these components are necessary for the data on the drive to remain available.

Therefore, you would need to do the same availability calculation for each component.

Let's say we calculated the following:

The host has an availability of three nines at 0.9997
The rack itself is calculated to have an availability of four nines at 0.99996

The controller has an availability of three nines at 0.9995

The SSD cache is an identical Kingston SSD, so it has the same availability as the capacity drive at four nines with 0.99998

The SSD capacity drive is what we originally calculated at 0.99998

We can determine the combined probability by multiplying the availability of each of these components together.

Keep in mind that this is not taking into consideration any vSAN protections. What we've calculated is based on FTT = 0 and you will remember from Part I of this series that FTT = 0 means that there is only this one copy of data. We can't tolerate a single failure with this design.

Yet even without those protections, we're at three nines (0.9991). Again referring to Part I, this is equivalent to 8.77 hours of downtime a year.

Better to lean conservative in estimations

When calculating availability, it is best to be a bit conservative. One trick for doing that is to only consider the number of nines and to throw away the remainders. For example, 0.9997 was the availability calculated for the host. Let's get rid of the 7, leaving us with a value of 0.999

If we do that for all five of the values, we get this.

Now we're down to two nines which is equivalent to 3.69 days of downtime per year.
That's not so good.

vSAN to the rescue: FTT=1

Let's consider an FTT of 1. For this to happen, we need two copies plus one witness. The witness is an availability mechanism keeping an eye on the two copies in the cluster. It's super tiny (2MB) and only contains metadata. The copies and the witness are each placed on a separate hosts. This way, if any of the three hosts fail, a full copy is still available.

FTT=1 means we can tolerate one failure, but not two. If we consider the copies and the witness as objects, we have a total of three objects. In order for the data to become unavailable with this level of protection, two of these objects would have to fail.

We conservatively calculated 0.997 as the availability for one object. There is a 99.7 chance that the object will be available. 0.997 availability means there is a 0.003 (.3% chance) that the data will be unavailable when calculated for one object. We get this by subtracting. 1 - 0.997 = 0.003

For 1 object, the chance it will be unavailable is 0.003

However, we're trying to determine the availability. This is giving us the probability of failure.

No problem, we simply need to subtract our answer from 1 to get the answer.

So with an FTT of 0, we had two nines availability, but with the vSAN FTT protection level of 1, we are up to five nines and this is with a conservative estimate.

In this example, we based our availability calculation on each VM having one object. If we assume most VMs will contain more than one object, we can account for this as well.

Let's say a VM has 8 objects.
We simply take the availability we calculated (0.999991) and take it to the 8th power.

This basically means that depending on how many objects there are per VM, you'll get between 4 nines and 5 nines.

FTT=2

Let's do one more. This time, we'll use an FTT of 2. When this vSAN protection level is used, there are 3 copies and 2 witnesses per object. This means we can tolerate 2 failures, but not 3.

Recall from earlier, we conservatively calculated 0.997 as the availability for one object, making the probability for failure for one object, 0.003.

But remember, that's based on each VM containing a single object. By taking the availability number (0.99999998) to the 8th power, representing 8 objects per VM, we get this.

So with vSAN, even when we go conservative and even when accounting for multiple objects per VM, we still get an availability of six nines when FTT is set to 2.