Thoughts on Availability Design

A few recent engagements have thrown up some interesting conversations around design and particularly availability design, which I thought it would be worthwhile capturing here.

First of all there are a couple of principles which should be considered:

  1. Any design is only as good as the requirements which it is built on. Taking time to properly capture and validate all of the requirements with the correct stakeholders pays dividends in the end.
  2. Any design decision should ultimately be based on a business need or requirement, rather than simply a showcase of product features or functionality, no matter how impressive they may be! Think ‘why do I need to do this’?

In the VMware space in particular, clustering has become more accessible than it once was. For example, not so long ago stretched clusters were expensive and complex because of the need for specialised storage solutions and more than likely involving multiple parts of the organisation, such as storage and networks as well as virtualisation. With vSAN this is no longer the case. Stretched clusters can be easily set up with aggregated local storage. Further up the stack, vCenter can be protected through native clustering where once Microsoft Failover Clustering might have been needed to achieve this which, again, was much more complicated to configure and support.

Does that mean that these technologies should now be considered a ‘default’ option? In my view, no. Although simpler the in the past, clustering inevitably involves some additional complexity which needs to be weighed up against the advantages it brings (or requirements it helps to meet).

One of the requirements (in most cases) of clustering that is often misunderstood (or at least not fully appreciated) is the requirement for a witness to enable a majority in the event of the loss of a node. The first question to ask is where this witness will be located and most often the answer is at the primary or secondary site as these are the only available sites.


In the example above, what happens if site A fails, does everything seamlessly continue in site B. Nope. Even taking mitigating steps such as placing the witness on a different host/cluster/rack (delete as appropriate) does not protect against the disaster scenario it might be intended to. It sounds simple but it is often not accounted for.

What happens if someone accidentally messes up a cluster? If that cluster is stretched are potentially both sites in that single entity affected? What about common dependencies, such as vCenter? How do these scenarios weigh up against a non-clustered solution with redundant, non-dependent components? Do they outweigh other design goals, such as operational simplicity?

This is not to say that these technologies do not have their place, simply that all implications needs to be considered – there is no such thing as a free lunch, right?!

If a group of applications require multiple 9’s uptime, and this can’t be achieved at the application level then a stretched cluster solution may well be appropriate for that use case. If a cloud portal requires provisioning or management activities are available at all times, then vCenter HA may well contribute to that solution but otherwise there are easier alternatives and as long as they can be justified, will probably result is less pain in the long run!

This entry was posted in Architecture. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s