High Availability and Clustering

For any PKI with availability requirements, whether on the CA, VA, or RA end, some form of redundancy needs to be factored in to ensure that the failure of a single instance does not result in downtime. While redundancy for VAs and RAs are usually achieved using multiple standalone instances, for CAs, redundancy is mainly achieved using some kind of database clustering.

Clustering in EJBCA is defined as multiple, independent, EJBCA nodes accessing the same (highly available) database. EJBCA stores all operation data in the database so the central availability component is the database while EJBCA nodes are replaceable and only need appropriate TLS certificates in order to access web resources on the nodes.

When load balancing is used in front of multiple EJBCA nodes, it is recommended to use sticky web connections, as web sessions to the Admin UI and RA UI may experience issues when switched between independent EJBCA nodes.

Using a Disaster Recovery Site With Manual or Automatic Failover

This configuration only requires two datacenters and no load balancer, though it does usually requires a manual failover and only provides a single layer of redundancy.

In this configuration:

All traffic is lead to the primary site during normal operation while the disaster recovery site is online but idle.
Failover is handled manually, where traffic is redirected to the disaster recovery site if the primary site fails.
The two sites are geographically separated.
The HSMs on each site are functionally identical, either operating in a cluster or manually kept in sync by creating a backup from the primary site and restoring it on the disaster recovery site.
The databases on each site are connected and configured as one primary and one replica, meaning that a write on the primary site is replicated to the disaster recovery site. The synchronization between the sites can be either synchronous or asynchronous. PrimeKey recommends synchronous replication between the sites to prevent data loss after a failover to the disaster recovery site.
The failover can be manual or automatic depending on the database cluster technology being used. An automatic failover or an alarm can be triggered using EJBCA Healthcheck.

Using Multiple Sites With Load Balancing

A more advanced but more versatile solution is to perform load balancing between multiple sites that all are active at the same time. This can provide multiple layers of redundancy and performance benefits.

In this configuration:

The database used by each node can perform both reads and writes to the cluster. A common database cluster technology used with EJBCA is Galera. If Galera is used, PrimeKey recommends three database nodes, where each node is located in its own datacenter.
The sites are geographically separated.
A load balancer in front of the database cluster can perform load balancing (e.g. round-robin) between the nodes. If one node fails, the load balancer can automatically evict traffic from the node using EJBCA Healthcheck.
The HSMs on each site are functionally identical, either operating in a cluster or manually kept in sync by creating a backup from one site and restoring it on the other site.

For a solution that is straightforward to set up, the EJBCA Hardware Appliance comes with Galera clustering built-in and is easy to configure.

Validation and Registration Authorities Using Peer Systems

ENTERPRISE

Redundant validation and registration authorities are typically set up using multiple standalone instances operating in parallel. By spreading out validation authorities over multiple geographical locations, very high availability and performance can be achieved, which is often crucial for OCSP responders. PrimeKey recommends the use of multiple validation and registration authorities, each connected to a single CA cluster using peer connectors.

In this configuration:

Each VA or RA is connected to a standalone database.
The load is spread across the VA or RA instances using a load balancer, DNS and/or anycast.
RA requests are processed are fetched from the RA instance and processed by the CA.
Revocation information is published from the CA to each VA individually using peer systems.
A monitoring system can automatically evict traffic from faulty VA or RA instances using EJBCA Healthcheck.

For very large installations, clustering may be used between VAs located on the same site. EJBCA can then publish to only a single VA on each site using a Multi Group Publisher to minimize the amount of outbound traffic.