23.7.3. Procedure – Recovering from a split brain situation

Purpose: 

A split brain situation is caused by a temporary failure of the network link between the cluster nodes, resulting in both nodes switching to the active (master) role while disconnected. This might cause that new data (for example audit trails) is created on both nodes without being replicated to the other node. Thus, it is likely in this situation that two diverging sets of data have been created, which cannot be trivially merged.

Warning

Hazard of data loss! In a split brain situation, valuable audit trails might be available on both PSM nodes, so special care must be taken to avoid data loss.

The nodes of the PSM cluster automatically recognize the split brain situation once the connection between the nodes is reestablished, and do not perform any data synchronization to prevent data loss. When a split brain situation is detected, it is visible on the PSM system monitor, in the system logs (Split-Brain detected, dropping connection!), and PSM sends an alert as well.

To recover a PSM cluster from a split brain situation, complete the following steps.

Warning

Do NOT shut down the nodes.

Steps: 

  1. Temporarily disable all traffic going through PSM. Navigate to Basic Settings > System > Traffic control and click Stop in the All services field.

    If the web interface is not accessible or unstable, complete the following steps:

    1. Login to PSM as root locally (or remotely using SSH) to access the Console menu.

    2. Select Shells > Core Shell, and issue the systemctl stop zorp-core.service command.

    3. Issue the date and check the system date and time. If it is incorrect (for example it displays 2000 January), replace the system battery. For details, see the hardware manual of the appliance.

    4. Repeat the above steps on the other PSM node.

  2. Optional step for data recovery: Check the audit trails saved on the PSM nodes.

    1. Login to the node from a local console.

    2. Select Shells > Core Shell and enter cd /var/lib/zorp/audit. The audit trails are located under this directory.

    3. Find which files were modified since the split brain situation occurred. Use the find . -mtime -n to find the files modified during the last n*24 hours, or the find . -mmin -n to find the files modified during the last n minutes.

  3. Decide which node should be the master node from now on, then perform the following steps on the to-be-slave node:

    1. Login to the to-be-slave node from a local console.

    2. Optional step for data recovery: Select Shells > Core shell and enter cd /var/lib/zorp/audit. The audit trails are located under this directory.

    3. Optional step for data recovery: Backup the audit trails that were modified since the split brain situation occurred.

      Warning

      This data will be deleted from the PSM node when the split-brain situation is resolved. There is no way to import this data back into the database of PSM, it will be available only for offline use.

    4. Optional step for data recovery: To save the corresponding information that can be seen on the Search page, export the connection database using the pg_dump -U scb -f /root/database.sql command, then backup the /root/database.sql file.

      After you have backed up the /root/database.sql file, delete it from PSM using the rm -fv /root/database.sql command.

    5. Optional step for data recovery: Type exit to return to the console menu.

    6. Select Shells > Boot shell. If the to-be-slave node is not already the slave node, fail over the cluster to the other node manually by issuing the /usr/share/heartbeat/hb_standby command.

    7. Stop the core firmware. Issue the systemctl stop boot-xcb.service command.

    8. Invalidate the DRBD. Issue the following commands:

      drbdadm secondary r0

      drbdadm connect --discard-my-data r0

      ssh scb-other

      drbdadm connect r0

  4. Now the cluster should be in Degraded Sync state, the master being SyncSource and the slave being SyncTarget. The master node should start synchronizing its data to the slave node. Depending on the amount of data, this can take a long time. To adjust the speed of the synchronization, see Section 6.2.1, Adjusting the synchronization speed.