8 node cluster, all went offline at the same time

DLinkOZ

New Member
Sep 2, 2024
3
0
1
I installed this cluster about a year and a half ago. It hasn't been patched (don't yell at me, not my call), but has been rock solid. Today, that all changed. It all went offline, no links on the NICs (bonded par per server). All other hardware is fine: storage server, firewalls, etc, it's all on the same switch and not experiencing any issues. We've tried knocking the network config down to a single interface (removing the bond) and breaking LAGs on the switch. Bringing up the NICs manually does cause them to show UP, but they're not really up and the vmbr0 never comes online.

A sample from one of them (excuse the screenshots, this is from a really old iDRAC):

1767926101094.png

On boot:

1767926074763.png

1767926122203.png

1767926187856.png

1767926226944.png
 
Rebooting always shows this type of error:

View attachment 94708
Any usb devices hooked up? :oops: Check your hardware, maybe udev has issues removing a device.

Well, looks like a classical issue of NOT HAVING SEPARATED YOUR COROSNYC LINKS. Link saturation (amongst others) may result in loss of quorum and with HA enabled the node will self-fence.

Is blanace-rr a result of the recovery effort or has it always been like that? This mode is good for sending over both links, but not good for receiving traffic. You'd need to use balance-alb or better just active-backup (simpler) [0]. That balance-rr mode may have contributed to the failure.

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_network_bond