4.4 no-subscription cluster quorum lost , nodes red

locusofself

Member
Mar 29, 2016
39
4
8
41
Hello,

I have a cluster running 4.4 no subscription. Last night my NFS backup failed (all nodes backing up at once, on same network as cluster communications). Now from each node's web UI, all other nodes are showing red, and I cannot edit the backup job because /etc/pve/ permission denied.

All the nodes/VMs are working separately, and I can ping between hosts. It seems the cluster communication is just broken down currently.



This was working fine for sometime but I think I reached a tipping point and I know I need to separate the storage for backups to a separate network. I use all local storage for actual VMs.



Some of the nodes are on slightly different versions because I have been waiting a time when I can reboot.

pve-manager/4.4-5/c43015a5 (running kernel: 4.4.35-2-pve)
pve-manager/4.4-12/e71b7a74 (running kernel: 4.4.35-2-pve)
pve-manager/4.4-12/e71b7a74 (running kernel: 4.4.21-1-pve)
pve-manager/4.4-12/e71b7a74 (running kernel: 4.4.40-1-pve)
pve-manager/4.4-12/e71b7a74 (running kernel: 4.4.21-1-pve)
pve-manager/4.4-12/e71b7a74 (running kernel: 4.4.35-2-pve)
pve-manager/4.4-12/e71b7a74 (running kernel: 4.4.35-2-pve)
pve-manager/4.4-12/e71b7a74 (running kernel: 4.4.35-2-pve)



Help *much* appreciated!
 
Alright, so this is resolved for now. Last week , we changed network switches. Everything was working fine, but after the backup failed I could not get quorum again. After looking at the page about multicast on the wiki I asked my coworker to check the "IGMP" settings and when he changed them the nodes all went green a moment later.

No idea why this was working for a week before failing. I was working on the cluster a bunch, so it wasnt that I just didnt notice it.

I had tried to restart pve-cluster, pvestatd, even restarted all the nodes after shutting down VMs to no effect before we checked this IGMP setting on switch.