Hi,
today my script that reconfigures 80 VMs with 80 concurrent
After using the script successfully several times, this time (all?) pvesh commands hung, as well as the web GUI. I obseverd a "100% CPU" corosync (in strace, I see heaps of
In contrast I observed that after the node rebooted, it found it way back in its old ring and gain both contained 6 nodes. I repeated booting with the same effect. I power-cut a node to now really force something, but after reboot again I had 2 x 6 node rings.
So I rebooted every node of one of the rings.
I dare to say that now I'm facing four rings (1 with 1 node, 2 with 3 nodes, 1 with 5 nodes)...
How to break a split-brain? Or better: how to rejoin nodes into single one?
After some jobs are supposed to be done tonight I will forcibly reboot every node at the same time in the hope that this works, but it feels wrong.
Where can I learn more about the ring management?
today my script that reconfigures 80 VMs with 80 concurrent
pvesh create
commands got stuck. At this time, one node was offline or right back online since a few minutes due to a little "ohh what kind of cable is this" issue of a team mate.After using the script successfully several times, this time (all?) pvesh commands hung, as well as the web GUI. I obseverd a "100% CPU" corosync (in strace, I see heaps of
epoll_wait(4, [], 12, 0) = 0
calls without delay (according to strace, a call each ~ 0.015ms). Not having a better idea I first restarted corosync on the node I accidentally was logged in anyway, which reduced CPU load (at least for a moment, didn't check after). I noticed that instead of the expected 12 node cluster I'm facing 2 x 6 node clusters... I thought it would be easy to break this deadlock by simply rebooting any node: the ring it was in would have a node less, and all its nodes will join to the now bigger ring and everything would be fine. In contrast I observed that after the node rebooted, it found it way back in its old ring and gain both contained 6 nodes. I repeated booting with the same effect. I power-cut a node to now really force something, but after reboot again I had 2 x 6 node rings.
So I rebooted every node of one of the rings.
I dare to say that now I'm facing four rings (1 with 1 node, 2 with 3 nodes, 1 with 5 nodes)...
How to break a split-brain? Or better: how to rejoin nodes into single one?
After some jobs are supposed to be done tonight I will forcibly reboot every node at the same time in the hope that this works, but it feels wrong.
Where can I learn more about the ring management?
Last edited: