Total cluster failure in the lab

Taledo · May 2, 2024

Hey all,
Currently testing CEPH and Proxmox HA on a multi datacenter configuration (with black fiber, so sub ms latency.) I tried to recreate the worst scenario possible : what if all nodes lose the corosync layer? This should never happen, but that's what the lab is for.

I'm using Proxmox 8.2.2 and Ceph latest version.

Upon removing the corosync link, all nodes rebooted as expected and were left in a no quorum state. Now upon restarting the corosync link, one of my node straight up refused to connect back to the cluster. Not an issue as I can quit and rejoin the cluster. However, This leads me to discover another issue : one of the two remaining nodes is flooding the other one with corosync packets, causing packet loss and BIG cluster instability. (I'm seeing about 16000 packets in a 10 second tcpdump session).

Any idea on what exactly is happening here?

Cheers,

Taledo

Taledo · May 2, 2024

Well, this didn't last long, as the watchdogs did their job... The whole thing rebooted and decided it wanted to behave again. Cluster is now up and running again.

Weird behaviour, though I kind of expected it before unplugging the whole thing.

LnxBil · May 3, 2024

Do you have multiple black fibers and could use a redundant corosync (bonded or completely seperated) setup?

Taledo · May 3, 2024

Yes, that's the idea. A complete corosync should in theory never happen, but I've learned in this line of work that you don't say "If it happens" but "when it happens". So testing out in the lab is the best way to prepare for unplanned network dissasembly at 4am on a saturday

Search

Search

Total cluster failure in the lab

Taledo

Well-Known Member

Taledo

Well-Known Member

LnxBil

Distinguished Member

Taledo

Well-Known Member

We value your privacy