23.7.3. Recovering from a split brain situation

A split brain situation is caused by a temporary failure of the network link between the cluster nodes, resulting in both nodes switching to the active (master) role while disconnected. This might cause new data (for example, audit trails) to be created on both nodes without being replicated to the other node. Thus, it is likely in this situation that two diverging sets of data have been created, which cannot be trivially merged.

Warning

Hazard of data loss! In a split brain situation, valuable audit trails might be available on both PSM nodes, so special care must be taken to avoid data loss.

The nodes of the PSM cluster automatically recognize the split brain situation once the connection between the nodes is reestablished, and do not perform any data synchronization to prevent data loss. When a split brain situation is detected, it is visible on the PSM system monitor, in the system logs (Split-Brain detected, dropping connection!), on the Basic Settings > High Availability page, and PSM sends an alert as well.

Once the network connection between the nodes has been re-established, one of the nodes will become the active (master) node, while the other one will be passive (the slave node). This means that one node is providing services similar to normal operation, and the other one is kept passive to avoid network interferences. Note that there is no synchronization between the nodes at this stage.

To recover a PSM cluster from a split brain situation, complete the following steps.

Warning

Do NOT shut down the nodes.

Data recovery

Purpose: 

In the procedure described here, data will be saved from the host currently acting as the slave host. This is required because data on this host will later be overwritten by the data available on the current master.

Note

During data recovery, there will be no service provided by PSM.

Steps: 

  1. Log in to the master node. If no Console menu is showing up after login, then this is the slave node. Try the other node.

  2. Select Shells > Boot Shell.

  3. Enter /usr/share/heartbeat/hb_standby. This will change the current slave node to master and the current master node to slave (HA failover).

  4. Exit the console.

  5. Wait a few seconds for the HA failover to complete.

  6. Log in on the other host. If no Console menu is showing up, the HA failover has not completed yet. Wait a few seconds and try logging in again.

  7. Select Shells > Core Shell.

  8. Issue the systemctl stop zorp-core.service command to disable all traffic going through PSM.

  9. Save the files from /var/lib/zorp/audit that you want to keep. Use scp or rsync to copy data to your remote host.

    Tip

    To find the files modified in the last n*24 hours, use find . -mtime -n.

    To find the files modified in the last n minutes, use find . -mmin -n .

  10. Enter:

    pg_dump -U scb -f /root/database.sql

    Back up the /root/database.sql file.

  11. Exit the console.

  12. Log in again, and select Shells > Boot Shell.

  13. Enter /usr/share/heartbeat/hb_standby. This will change the current slave node to master and the current master node to slave (HA failover).

  14. Exit the console.

  15. Wait a few minutes to let the failover happen, so the node you were using will become the slave node and the other node will become the master node.

    The nodes are still in a split-brain state but now you have all the data backed up from the slave node, and you can synchronize the data from the master node to the slave node, which will turn the HA state from "Split-brain" to "HA". For details on how to do that, see Section HA state recovery.

HA state recovery

Purpose: 

In the procedure described here, the "Split-brain" state will be turned to the "HA" state. Keep in mind that the data on the current master node will be copied to the current slave node and data that is available only on the slave node will be lost (as that data will be overwritten).

Steps — Swapping the nodes (optional): 

Note

If you completed the procedure described in Section Data recovery, you do not have to swap the nodes. You can proceed to the steps about data synchronization.

If you want to swap the two nodes to make the master node the slave node and the slave node the master node, perform the following steps:

  1. Log in to the master node. If no Console menu is showing up after login, then this is the slave node. Try the other node.

  2. Select Shells > Boot Shell.

  3. Enter /usr/share/heartbeat/hb_standby. This will output:

    Going standby [all]
  4. Exit the console.

  5. Wait a few minutes to let the failover happen, so the node you were using will become the slave node and the other node will be the master node.

Steps — Initializing data synchronization: 

To initialize data synchronization, complete the following steps:

  1. Log in to the slave node. If the Console menu is showing up, then this is the master node. Try logging in to the other node.

  2. Enter the following commands. These commands will make the slave node discard the data available only here, on this node.

    drbdadm secondary r0
    drbdadm connect --discard-my-data r0
  3. Log out of the slave node.

  4. Log in to the master node.

  5. Select Shells > Boot Shell.

  6. Enter:

    drbdadm connect r0
  7. Exit the console.

  8. Check the High Availability state on the web interface of PSM, in the Basic Settings > High Availability > Status field. During synchronization, the status will say Degraded Sync, and after the synchronization completes, it will say HA.