Cluster
The cluster implementation of the Software Appliance uses Wireguard connections for all cluster communication. This means that the cluster nodes do not need to be physically located close to each other as long as they have good network connectivity. However, this also means that a node cannot distinguish between the failure of another node and an interrupted network connection to the other node. To avoid cluster nodes operating independently and receiving different data sets (a so-called split-brain situation), the cluster nodes coordinate and stop operating if they do not belong to the majority of connected nodes. This ensures that only one data set can be updated at a time. In the event of a temporary network outage, the unconnected nodes can easily synchronize their data with the majority data set and continue to operate.
Definition of Availability
The options on the Software Appliance Cluster page allow you to add cluster nodes, monitor an existing cluster, and manage its cluster nodes. You can find detailed information about the cluster members and their current status. In addition, an easy-to-use locking function prevents editing conflicts.
The availability is defined as the ability to keep the service running with full data integrity for the applications running on the Software Appliance.
Levels of Availability
Stand-alone instance
This is a basic single node installation of the Software Appliance. In case of a node failure, a new Software Appliance needs to be reinstalled from a backup. All data between the time of the last backup and the failure will be lost. If no cold standby a (spare) Software Appliance is available, the time of provisioning the new VM must be taken into account when calculating the acceptable downtime.
Hot standby with manual fail-over
In this configuration, two nodes are connected to form a cluster. The first installed node has a higher quorum vote than the second node.
If the second node fails, the first node continues to operate. The second node is set to the maintenance state. If the first node fails, the second node stops operating and is set into maintenance mode.
To bring the second node back into operation, manual interaction via the Software Appliance‘s administrative interface (WebConf) is required.
Manual intervention is also required to avoid data loss. The second node should only be Forced into Primary if the first node really is dead and cannot be recovered.
High Availability with automatic fail-over
This is a setup with three or more nodes. If one node fails, the remaining nodes can still form a cluster by a majority quorum vote and continue operation. If the Software Appliance that has failed is still switched on it will be set into maintenance.
To ensure that quorum votes never result in a tie, all nodes are assigned a unique quorum voting weight according to their assigned node number (Weight=128−NodeNumber).
In a setup where an even number of nodes N are evenly distributed equally between two sites, the site that is to remain Active when connectivity between the sites fails should have a larger sum of quorum vote weights than the other site.
Since cluster nodes with lower node numbers have a higher weighting, you should deploy nodes 1 to N/2 at the primary site.