I have 9 nodes in a cluster: pve1 through pve9.
"pvecm nodes" on all members show all 9 members as "Members".
"pvecm status" on all members show the membership as "Cluster-Member".
"ceph -s" shows the 9-member CEPH cluster as being healthy.
In the GUI and in pvesh, however, I see a different story.
pve1,2,4,6,7 & 8 show pve1,2,4,6,7 & 8 as online (green) and pve3,5,9 as offline (red).
pve3, 5 and 9 all show *only themselves* as online in the GUI.
All the VMs that are running continue to run normally, but now I can't migrate VMs, among other things.
So... firstly, what process controls whether the GUI thinks a member is in the cluster or not? Obviously pvecm and the GUI have two completely different views of reality.
Secondly, how do I cause the two views of reality to re-sync?
I'm reluctant to reboot the offending nodes, because usually rebooting a node is what *causes* this situation, and I have reboot every single node in the cluster to fully recover. Or sometimes it just recovers spontaneously on its own overnight.
Obviously this is a bug or defect somewhere - but is it in in pvecm, the GUI, or what? Or is there some intermediate layer that has no visibility whatsoever?
I have completely disabled IGMP; I can even see the IGMP packets at my workstation - which, obviously is not ideal, but that's what I had to do to make the cluster work reliably.
There is ONE anomaly I can see... "pvecm nodes", while showing all nodes as online and members, does not always agree on the "Inc" column. But I can't find any documentation telling me what that value means. (The value for the node where I'm running it is always dramatically lower than the others... but that's even on the subset of nodes that are still talking to each other properly.)
Help!
Thanks,
-Adam Thompson
athompso@athompso.net
"pvecm nodes" on all members show all 9 members as "Members".
"pvecm status" on all members show the membership as "Cluster-Member".
"ceph -s" shows the 9-member CEPH cluster as being healthy.
In the GUI and in pvesh, however, I see a different story.
pve1,2,4,6,7 & 8 show pve1,2,4,6,7 & 8 as online (green) and pve3,5,9 as offline (red).
pve3, 5 and 9 all show *only themselves* as online in the GUI.
All the VMs that are running continue to run normally, but now I can't migrate VMs, among other things.
So... firstly, what process controls whether the GUI thinks a member is in the cluster or not? Obviously pvecm and the GUI have two completely different views of reality.
Secondly, how do I cause the two views of reality to re-sync?
I'm reluctant to reboot the offending nodes, because usually rebooting a node is what *causes* this situation, and I have reboot every single node in the cluster to fully recover. Or sometimes it just recovers spontaneously on its own overnight.
Obviously this is a bug or defect somewhere - but is it in in pvecm, the GUI, or what? Or is there some intermediate layer that has no visibility whatsoever?
I have completely disabled IGMP; I can even see the IGMP packets at my workstation - which, obviously is not ideal, but that's what I had to do to make the cluster work reliably.
There is ONE anomaly I can see... "pvecm nodes", while showing all nodes as online and members, does not always agree on the "Inc" column. But I can't find any documentation telling me what that value means. (The value for the node where I'm running it is always dramatically lower than the others... but that's even on the subset of nodes that are still talking to each other properly.)
Help!
Thanks,
-Adam Thompson
athompso@athompso.net
Last edited: