HA Cluster: The State of Play

This blog post is going to summarise the current state of my HA (High Availability) cluster project, as well as detailing my plans, and the scope of the project.

The cluster has 4 types of node: web servers, which serve the content of this blog; load-balancers, which sit in front of the web servers and distribute requests between them; mail servers, which send and receive electronic mail for the cluster; and code repositories, which store any custom packages required for the cluster to build nodes. Currently the cluster has one of each of these node types. To remove all single points of failure, at least two of each node type will be required.

Puppet logo

I am using puppet for configuration management, and a git repository to store the puppet configuration. This allows me to boot up DigitalOcean droplets from a snapshot that will, on boot, automatically pull in and apply the puppet configuration. This allows specialised nodes to be created, automatically, from a single snapshot. I am using a pre-defined hostname to role mapping to determine the configuration for each node.

The plan

Stage 1: No single point of failure

The requirements for stage one are that any one node can be down without any significant service disruption, and that the cluster can automatically recover to a state where it is permissible for a subsequent node to go down. There are four important things that will be needed in order to meet these requirements. Firstly, there will need to be at least two copies of each node. Secondly, there will need to be a mechanism to allow automatic fail-over to the backup of each node type when a failure is detected. Thirdly, for node types that require it, there will need to be a mechanism for synchronising data between nodes�it should be possible to erase a droplet completely without losing any data or state. Finally, the backup node of each type needs to be able to restore the functionality of the primary node.

To test that the requirements for this stage have been met, I will write a script to automatically destroy a node at random on a daily basis.

Stage 2: Recovery from a single node

The requirements for stage two are that the cluster can automatically recover from the simultaneous destruction of all-but-one node; it is not a requirement that there is no downtime in that scenario, but the recovery should be reasonably prompt. There are two primary capabilities that the system will need to meet these requirements. Firstly, every node will need to be able to monitor and restore the entire cluster. Secondly, as it will be possible for all copies of a given node to be destroyed, rebuilt nodes will need to be able to restore data from remote backups.

Stage 3: The beachhead

The requirements for stage three are that the cluster automatically rebuild the destruction of every node, and can automatically rebuild itself in a new location�with a different hosting provider.

This will require an external node with a different hosting provider. This node would monitor the cluster and detect if it has been completely destroyed. In case of total destruction, it would then attempt to build a new cluster with the primary hosting provider, and failing that, would build a new cluster locally.

In future blog posts, I will go into further details about the constituent parts of the cluster.