I've been having some super odd problems lately, which I thought were a one-off, but now, there is something seriously wrong.
Current state: I have one node newly added to a 5 node cluster. If I start corosync and pve-cluster on that node, within 60 seconds, it starts spamming TOTEM retransmit messages, and then all nodes reboot, and go into a mad reboot loop where they rejoin, spam totem messages and then crash. If I just power off the new node, the other 4 nodes go back to being happy.
Trying not to write a novel history:
A few months ago, upgraded all nodes to 8.1, ceph to Reef. Everything was happy. Purchased a new server, and replaced an existing node in the cluster with it. When I did the assisted join, immediately all nodes in the cluster rebooted, and then rebooted again, and then after about 20 minutes, everything stabilized, and I thought I had just maybe made a mistake somewhere?
Fast forward to yesterday. This time, much more carefully, because I assumed I had made a mistake somewhere that caused that chaos, I went to replace another node in the cluster.
First step was to shut the node down, and delete it from the cluster. 60 seconds later, all nodes reboot. Waited a bit and checked things, but everything looked ok. So I added the new node, and now, boom, everything is unstable and explody.
I've checked the links, all good. When I bring up corosync, I can see it communicating on the primary and secondary links just fine. but it just instantly starts freaking out, and causing all the other nodes to freak out, and they all crash. I've checked all the cables, pinged everything, validated the corosync.conf files, etc etc, but I have no idea at this point. Currently I have corosync and pve-cluster on the new node disabled, to stop it from crashing the cluster.
Any ideas?
Current state: I have one node newly added to a 5 node cluster. If I start corosync and pve-cluster on that node, within 60 seconds, it starts spamming TOTEM retransmit messages, and then all nodes reboot, and go into a mad reboot loop where they rejoin, spam totem messages and then crash. If I just power off the new node, the other 4 nodes go back to being happy.
Trying not to write a novel history:
A few months ago, upgraded all nodes to 8.1, ceph to Reef. Everything was happy. Purchased a new server, and replaced an existing node in the cluster with it. When I did the assisted join, immediately all nodes in the cluster rebooted, and then rebooted again, and then after about 20 minutes, everything stabilized, and I thought I had just maybe made a mistake somewhere?
Fast forward to yesterday. This time, much more carefully, because I assumed I had made a mistake somewhere that caused that chaos, I went to replace another node in the cluster.
First step was to shut the node down, and delete it from the cluster. 60 seconds later, all nodes reboot. Waited a bit and checked things, but everything looked ok. So I added the new node, and now, boom, everything is unstable and explody.
I've checked the links, all good. When I bring up corosync, I can see it communicating on the primary and secondary links just fine. but it just instantly starts freaking out, and causing all the other nodes to freak out, and they all crash. I've checked all the cables, pinged everything, validated the corosync.conf files, etc etc, but I have no idea at this point. Currently I have corosync and pve-cluster on the new node disabled, to stop it from crashing the cluster.
Any ideas?