6.2. Managing a high availability PSM cluster

High availability (HA) clusters can stretch across long distances, such as nodes across buildings, cities or even continents. The goal of HA clusters is to support enterprise business continuity by providing location-independent failover and recovery.

To set up a high availability cluster, connect two PSM units with identical configurations in high availability mode. This creates a master-slave (active-backup) node pair. Should the master node stop functioning, the slave node takes over the IP addresses of the master node's interfaces. Gratuitous ARP requests are sent to inform hosts on the local network that the MAC addresses behind these IP addresses have changed.

The master node shares all data with the slave node using the HA network interface (labeled as 4 or HA on the PSM appliance). The disks of the master and the slave node must be synchronized for the HA support to operate correctly. Interrupting the connection between running nodes (unplugging the Ethernet cables, rebooting a switch or a router between the nodes, or disabling the HA interface) disables data synchronization and forces the slave to become active. This might result in data loss. You can find instructions to resolve such problems and recover a PSM cluster in Section 23.7, Troubleshooting a PSM cluster.

Note

HA functionality was designed for physical PSM units. If PSM is used in a virtual environment, use the fallback functionalities provided by the virtualization service instead.

The Basic Settings > High Availability page provides information about the status of the HA cluster and its nodes.

Figure 6.4. Basic Settings > High Availability — Managing a high availability cluster

Basic Settings > High Availability — Managing a high availability cluster

The following information is available about the cluster:

  • Status: Indicates whether the PSM nodes recognize each other properly and whether those are configured to operate in high availability mode. For details, see Section 23.7.1, Understanding PSM cluster statuses.

  • Current master: The MAC address of the high availability interface (4 or HA) of the primary node. This address is also printed on a label on the top cover of the PSM unit.

  • HA UUID: A unique identifier of the HA cluster. Only available in High Availability mode.

  • DRBD status: Indicates whether the PSM nodes recognize each other properly and whether those are configured to operate in high availability mode. For details, see Section 23.7.1, Understanding PSM cluster statuses.

  • DRBD sync rate limit: The maximum allowed synchronization speed between the primary and the secondary node. For details, see Section 6.2.1, Adjusting the synchronization speed.

The active (master) PSM node is labeled as This node, this unit inspects the SSH traffic and provides the web interface. The PSM unit labeled as Other node is the slave node that is activated if the master node becomes unavailable.

The following information is available about each node:

  • Node ID: The MAC address of the HA interface of the node. This address is also printed on a label on the top cover of the PSM unit.

    For PSM clusters, the IDs of both nodes are included in the internal log messages of PSM. Note that if the central log server is a syslog-ng server, the keep-hostname option should be enabled on the syslog-ng server.

  • Node HA state: Indicates whether the PSM nodes recognize each other properly and whether those are configured to operate in high availability mode. For details, see Section 23.7.1, Understanding PSM cluster statuses.

  • Node HA UUID: A unique identifier of the cluster. Only available in High Availability mode.

  • DRBD status: The status of data synchronization between the nodes. For details, see Section 23.7.1, Understanding PSM cluster statuses.

  • RAID status: The status of the RAID device of the node. If it is not Optimal, there is a problem with the RAID device. For details, see Section 23.8, Understanding PSM RAID status.

  • Boot firmware version: Version number of the boot firmware.

    The boot firmware boots up PSM, provides high availability support, and starts the core firmware. The core firmware, in turn, handles everything else: provides the web interface, manages the connections, and so on.

  • HA link speed: The maximum allowed speed between the master and the slave node. The HA link's speed must exceed the DRBD sync rate limit, else the web UI might become unresponsive and data loss can occur.

  • Interfaces for Heartbeat: Virtual interface used only to detect that the other node is still available. This interface is not used to synchronize data between the nodes (only heartbeat messages are transferred).

    You can find more information about configuring redundant heartbeat interfaces in Procedure 6.2.2, Redundant heartbeat interfaces.

  • Next hop monitoring: IP addresses (usually next hop routers) to continuously monitor from both the master and the slave nodes using ICMP echo (ping) messages. If any of the monitored addresses becomes unreachable from the master node while being reachable from the slave node (in other words, more monitored addresses are accessible from the slave node), then it is assumed that the master node is unreachable and a forced takeover occurs – even if the master node is otherwise functional. For details, see Procedure 6.2.3, Next-hop router monitoring.

The following configuration and management options are available for HA clusters:

  • Set up a high availability cluster: You can find detailed instructions for setting up a HA cluster in Procedure 3.2, Installing two PSM units in HA mode in The Balabit’s Privileged Session Management, Shell Control Box 5 F3 Installation Guide.

  • Adjust the DRBD (master-slave) synchronization speed: You can change the limit of the DRBD synchronization rate. Note that this does not change the speed of normal data replication. For details, see Section 6.2.1, Adjusting the synchronization speed.

  • Configure redundant heartbeat interfaces: You can configure virtual interfaces for each HA node to monitor the availability of the other node. For details, see Procedure 6.2.2, Redundant heartbeat interfaces.

  • Configure next-hop monitoring: You can provide IP addresses (usually next hop routers) to continuously monitor from both the master and the slave nodes using ICMP echo (ping) messages. If any of the monitored addresses becomes unreachable from the master node while being reachable from the slave node (in other words, more monitored addresses are accessible from the slave node) then it is assumed that the master node is unreachable and a forced takeover occurs – even if the master node is otherwise functional. For details, see Procedure 6.2.3, Next-hop router monitoring.

  • Reboot the HA cluster: To reboot both nodes, click Reboot Cluster. To prevent takeover, a token is placed on the slave node. While this token persists, the slave node halts its boot process to make sure that the master node boots first. Following reboot, the master removes this token from the slave node, allowing it to continue with the boot process.

    If the token still persists on the slave node following reboot, the Unblock Slave Node button is displayed. Clicking the button removes the token, and reboots the slave node.

  • Reboot a node: Reboots the selected node.

    When rebooting the nodes of a cluster, reboot the other (slave) node first to avoid unnecessary takeovers.

  • Shutdown a node: Forces the selected node to shutdown.

    When shutting down the nodes of a cluster, shut down the other (slave) node first. When powering on the nodes, start the master node first to avoid unnecessary takeovers.

  • Manual takeover: To activate the other node and disable the currently active node, click Activate slave.

    Activating the slave node terminates all connections of PSM and might result in data loss. The slave node becomes active after about 60 seconds, during which the protected servers cannot be accessed.