Hydra 1303


All News Items

Calculating vSAN Availability: Part I

from the Hydra High Council Mar 24th 2020

VMware vSAN is software-defined storage that combines storage and compute. The devices are directly attached storage, but unlike other attached solutions, it is distributed across a cluster, creating a shared datastore.

Beyond the cost savings of vSAN compared to Fibre Channel, customers often ask how the different levels of vSAN protection relate to availability (percentage of uptime).

The Math Behind the 9s

When designing a storage solution, it's important to know how much downtime you can sustain. Is being up 99% (also referred to as two 9s) of the time acceptable? It may seem so until you do the math. When calculated for a year, 99% uptime means about three and half days of downtime. For most organizations, that's unacceptable. In comparison, availability at 99.9999% (six 9s) equates to about 30 seconds of downtime per year.

According to published marketing numbers, vSAN achieves six 9s when choosing an FTT of 2. (Don't worry if FTT is a new term, we'll cover that in a bit.) Instead of accepting this statistic as fact or rejecting it as simply marketing, the best way to know for sure is to do the underlying math yourself. You can then determine with precision how your choices of drives and selected level of vSAN protection translates to four 9s (99.99% uptime), five 9s (99.999% uptime), six 9s (99.9999% uptime), and so forth.

In this series, Calculating vSAN Availability, we'll dive into the details, explaining what everything means along the way. By the end, you will be able to plug in your own numbers, based on the vSAN drives you choose and the level of protection that meets your business requirements, to determine how your data center storage design matches up.

vSAN vs Traditional Storage

A vSAN differs from traditional SANs in a number of ways. A traditional SAN, such as a Fibre Channel array, is an external storage solution. With vSAN, the storage is local to the ESXi host and is tightly integrated with vSphere. It's also simpler to allocate and maintain. So much so that the tasks typically requiring a SAN administrator can be easily accomplished by the vSphere admin without her having to learn what a LUN is or how iSCSI works.

With vSAN, the physical storage on an ESXi host along with the physical storage on other ESXi hosts is pooled together and distributed. That common pool allows the administrator to carve out what is needed per VM or application. In terms of security, once policies are defined, they are automatically applied to every newly created VM. Adding capacity is also a snap. By simply adding another ESXi server, your compute and storage automatically scales and behind the scenes and vSAN rebalances the distributed data automatically, improving performance. Rack it and you're done.

vSAN Failure To Tolerate (FTT)

Before getting into the math, it's important to understand the selectable levels of protection. vSAN defines the degree of protection as the FTT (Failure to Tolerate). The FTT number is simply how many disks or hosts can fail while still having all of your data available to access.

For example, if you stored your data on a single drive and that drive failed, the number of drives you can tolerate to fail is zero. None! There is only one copy. One replica. Saying copy or replica sounds like there is a an original and one more, but it's not. If you were to mirror your data, we would say in total, there are two copies or two replicas.

  • FTT=0 means that there is only one copy, so we can't tolerate a single failure.
  • FTT=1 means that there are two copies. We could lose one and we still have access to the data.
  • FTT=2 means that there are three copies. We could lose two concurrently and still have access to the data.

So when you see an FTT value, it tells you two things:
Number of nodes that can fail = N
Number of copies = N+1

For example, FTT 2 w/ mirror.
How many nodes can fail allowing you to still access data? Answer: 2
How many copies of the data exist? Answer: 3

Mean Time Between Failures (MTBF)

Does this mean that we can't calculate availability if we have FTT=0? No. It's still possible to calculate. It's based on the MTBF (Mean Time Between Failures) value on the drive storing the data. When purchasing a drive, you can find the MTBF number in the documentation. The MTBF is a statistical value indicating how long the device is likely to work before failing (in hours).

For example, a drive with an MTBF of 62,000 hours translates to approximately 7 years.

62,000 (MTBF) / 24 (hours) = the number of days: 2583.
2583 (days) / 365 (days in a year) = the number of years: 7.07

However, MTBF numbers for enterprise HDDs and SSDs are typically between 1,000,000 or 2,000,000 MTBF which is a range between 114 and 228 years. The overall goal here is to be able to take into consideration the MTBF of the drive, the selected FTT level of protection, and calculate the uptime probability in terms of how many 9s.

Five 9s, Six 9s... Before the Math, A Peek at the Answers

Two 9s (being up 99% of the time) is 3.69 days of downtime a year.
Three 9s (up 99.9% of the time) is 8.77 hours of downtime a year.
Four 9s (up 99.99% of the time) is 52.57 minutes of downtime a year.
Five 9s (up 99.999% of the time) is 5.26 minutes of downtime a year.
Six 9s (up 99.9999% of the time) is 31.54 seconds of downtime a year.

Breaking It Down: Plug in Downtime, Figure the 9s

To calculate these availability numbers, it's easiest to understand with an example. Let's say that you determine that you can accept 5 minutes of downtime per year.

Availability = total possible uptime / (total possible uptime + total acceptable downtime) or to say the same thing in shorthand:

"Availability is total UP divided by (total UP + total DOWN)" or
expressed even shorter in a formula:

Let's figure out each variable.
First, for U (total possible uptime) we need to figure out how many minutes there are total in a year.

Now we have a value for Total Possible Uptime: U = 525,600 minutes

Next, we need a number for total acceptable downtime. We said that for this example, we could tolerate 5 minutes of downtime per year. Total Acceptable Downtime: D = 5 minutes.

Those are all the variables we need to plug it into the formula.

Expressed as a percentage, that would be 99.999% availability or five 9s.

Breaking It Down: Plug in Desired 9s, Figure the Downtime

If you want to instead calculate how many minutes of downtime we can figure out that formula with some 9th grade algebra.
A = U/(U+D) but we want to solve for D (downtime).

So first, we multiply both sides by (U+D).

That can be expressed as AU+AD=U.

Next, to get AD by itself, subtract AU from both sides.

Then finally, to get D by itself, divide both sides by A.

And then to simplify so that we don't have to enter the value for U twice, we can use the distributive property (at least that's what I remember its being called from 9th grade) to get: D = U(1-A)/ A

Checking the Math

Let's plug in some numbers to see if that shakes out.
If you wanted to figure out how many minutes of downtime for five 9s of availability, the variables are:
U= 525,600 (total possible uptime)
A= .99999 (availability)
D = amount of downtime (what we're trying to calculate)

Downtime = 525,600(1-.99999)/.99999
Downtime = 525,600(.00001)/.99999
Downtime = 5.256/.99999
Downtime = 5.256052560525605 minutes per year. Yep, it checks out.

Five 9s (up 99.999% of the time) is 5.26 minutes of downtime a year.

If you were to substitute the five 9s here with six 9s (.999999), you will get .5256 minutes. To change that to seconds, just multiply it by 60.
.5256 x 60 = 31.53603153603154 seconds. Yep, that checks out, too.

VMware Marketing Material: Can We Prove It?

When you look up vSAN availability, the marketing literature says that if you select an FTT of 2 (meaning that we will have 3 copies and can therefore lose 2 concurrently while still having access to the data), it equates to six 9s (99.9999%) of availability. If having data unavailable for 31.54 seconds per year is acceptable, you can design around this solution when using vSAN. We haven't yet gone into the details of FTT, so for now, just keep these numbers in mind and stay tuned for Calculating vSAN Availability - Part 2 where we will check their math.

What's next?

Here's the to do list for our upcoming articles in this series: (spoiler: it involves more math!)

  • Incorporating the MTBF value of a drive into the equation
  • Incorporating MTBF values for other components that can fail, leading to the drive being inaccessible (rack failure, host failure, controller failure + taking into consideration a cache drive failure or a capacity drive failure)
  • Incorporating the number of stored objects for a VM
  • Calculating for FTT=0, FTT=1, FTT=2
  • Entering an MTBF of an example enterprise SDD you can buy today
  • Entering the MTBF of your storage device along with an FTT value to determine the availability