Total cluster failure in the lab

Taledo

Active Member
Nov 20, 2020
81
10
28
54
Hey all,
Currently testing CEPH and Proxmox HA on a multi datacenter configuration (with black fiber, so sub ms latency.) I tried to recreate the worst scenario possible : what if all nodes lose the corosync layer? This should never happen, but that's what the lab is for.

I'm using Proxmox 8.2.2 and Ceph latest version.

Upon removing the corosync link, all nodes rebooted as expected and were left in a no quorum state. Now upon restarting the corosync link, one of my node straight up refused to connect back to the cluster. Not an issue as I can quit and rejoin the cluster. However, This leads me to discover another issue : one of the two remaining nodes is flooding the other one with corosync packets, causing packet loss and BIG cluster instability. (I'm seeing about 16000 packets in a 10 second tcpdump session).

Any idea on what exactly is happening here?

Cheers,

Taledo
 
Well, this didn't last long, as the watchdogs did their job... The whole thing rebooted and decided it wanted to behave again. Cluster is now up and running again.

Weird behaviour, though I kind of expected it before unplugging the whole thing.
 
Yes, that's the idea. A complete corosync should in theory never happen, but I've learned in this line of work that you don't say "If it happens" but "when it happens". So testing out in the lab is the best way to prepare for unplanned network dissasembly at 4am on a saturday :D
 
  • Like
Reactions: Kingneutron