All nodes crash when new node is brought online

garbled · Feb 25, 2024

I've been having some super odd problems lately, which I thought were a one-off, but now, there is something seriously wrong.

Current state: I have one node newly added to a 5 node cluster. If I start corosync and pve-cluster on that node, within 60 seconds, it starts spamming TOTEM retransmit messages, and then all nodes reboot, and go into a mad reboot loop where they rejoin, spam totem messages and then crash. If I just power off the new node, the other 4 nodes go back to being happy.

Trying not to write a novel history:

A few months ago, upgraded all nodes to 8.1, ceph to Reef. Everything was happy. Purchased a new server, and replaced an existing node in the cluster with it. When I did the assisted join, immediately all nodes in the cluster rebooted, and then rebooted again, and then after about 20 minutes, everything stabilized, and I thought I had just maybe made a mistake somewhere?

Fast forward to yesterday. This time, much more carefully, because I assumed I had made a mistake somewhere that caused that chaos, I went to replace another node in the cluster.

First step was to shut the node down, and delete it from the cluster. 60 seconds later, all nodes reboot. Waited a bit and checked things, but everything looked ok. So I added the new node, and now, boom, everything is unstable and explody.

I've checked the links, all good. When I bring up corosync, I can see it communicating on the primary and secondary links just fine. but it just instantly starts freaking out, and causing all the other nodes to freak out, and they all crash. I've checked all the cables, pinged everything, validated the corosync.conf files, etc etc, but I have no idea at this point. Currently I have corosync and pve-cluster on the new node disabled, to stop it from crashing the cluster.

Any ideas?

garbled · Feb 25, 2024

Here are some logs, from this morning, when the new node rebooted for some reason, and rejoined the cluster. Causing the expected chaos:

Code:

Feb 25 03:22:55 altair corosync[11414]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Feb 25 03:22:55 altair corosync[11414]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 25 03:22:55 altair corosync[11414]:   [QUORUM] Sync members[5]: 1 2 3 4 5
Feb 25 03:22:55 altair corosync[11414]:   [QUORUM] Sync joined[1]: 1
Feb 25 03:22:55 altair corosync[11414]:   [TOTEM ] A new membership (1.7f3) was formed. Members joined: 1
Feb 25 03:22:55 altair corosync[11414]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Feb 25 03:22:55 altair corosync[11414]:   [QUORUM] Members[5]: 1 2 3 4 5
Feb 25 03:22:55 altair corosync[11414]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 25 03:22:56 altair corosync[11414]:   [KNET  ] rx: host: 1 link: 1 is up
Feb 25 03:22:56 altair corosync[11414]:   [KNET  ] link: Resetting MTU for link 1 because host 1 joined
Feb 25 03:22:56 altair corosync[11414]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 25 03:22:56 altair corosync[11414]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Feb 25 03:22:57 altair pmxcfs[11418]: [dcdb] notice: members: 1/37207, 2/1596, 3/1449, 4/2414, 5/11418
Feb 25 03:22:57 altair pmxcfs[11418]: [dcdb] notice: starting data syncronisation
Feb 25 03:22:57 altair pmxcfs[11418]: [dcdb] notice: received sync request (epoch 1/37207/00000001)
Feb 25 03:22:57 altair pmxcfs[11418]: [status] notice: members: 1/37207, 2/1596, 3/1449, 4/2414, 5/11418
Feb 25 03:22:57 altair pmxcfs[11418]: [status] notice: starting data syncronisation
Feb 25 03:22:59 altair pveproxy[2209899]: Clearing outdated entries from certificate cache
Feb 25 03:23:03 altair pmxcfs[11418]: [status] notice: received sync request (epoch 1/37207/00000001)
Feb 25 03:23:03 altair pmxcfs[11418]: [dcdb] notice: received all states
Feb 25 03:23:03 altair pmxcfs[11418]: [dcdb] notice: leader is 2/1596
Feb 25 03:23:03 altair pmxcfs[11418]: [dcdb] notice: synced members: 2/1596, 3/1449, 4/2414, 5/11418
Feb 25 03:23:03 altair pmxcfs[11418]: [dcdb] notice: all data is up to date
Feb 25 03:23:03 altair pmxcfs[11418]: [dcdb] notice: dfsm_deliver_queue: queue length 5
Feb 25 03:23:04 altair corosync[11414]:   [TOTEM ] Retransmit List: 30 31 32
Feb 25 03:23:06 altair corosync[11414]:   [TOTEM ] Retransmit List: 30 31 32
Feb 25 03:23:09 altair corosync[11414]:   [TOTEM ] Retransmit List: 30 31
Feb 25 03:23:11 altair corosync[11414]:   [TOTEM ] Retransmit List: 30 31
Feb 25 03:23:14 altair corosync[11414]:   [TOTEM ] Retransmit List: 30 31
Feb 25 03:23:16 altair corosync[11414]:   [TOTEM ] Retransmit List: 30 31 41
Feb 25 03:23:17 altair corosync[11414]:   [TOTEM ] Retransmit List: 30 31 47
(about 30-40 of these, and then)
Feb 25 03:23:54 altair watchdog-mux[955]: client watchdog expired - disable watchdog updates
Feb 25 03:23:55 altair corosync[11414]:   [TOTEM ] Retransmit List: 30 31
Feb 25 03:23:56 altair corosync[11414]:   [TOTEM ] Retransmit List: 30 31

(looks like it rebooted here)

Feb 25 03:28:49 altair systemd-modules-load[575]: Inserted module 'vhost_net'
Feb 25 03:28:49 altair systemd[1]: Starting systemd-journal-flush.service - Flush Journal to Persistent Storage...
Feb 25 03:28:49 altair dmeventd[601]: dmeventd ready for processing.
Feb 25 03:28:49 altair lvm[564]:   1 logical volume(s) in volume group "ceph-fdb65610-4f5c-4a63-88c5-9a13457f8107" monitored
(more stuff elided)

Feb 25 03:29:01 altair corosync[1419]:   [QUORUM] Sync members[5]: 1 2 3 4 5
Feb 25 03:29:01 altair corosync[1419]:   [QUORUM] Sync joined[4]: 1 2 3 4
Feb 25 03:29:01 altair corosync[1419]:   [TOTEM ] A new membership (1.808) was formed. Members joined: 1 2 3 4
Feb 25 03:29:01 altair corosync[1419]:   [QUORUM] This node is within the primary component and will provide service.
Feb 25 03:29:01 altair corosync[1419]:   [QUORUM] Members[5]: 1 2 3 4 5
Feb 25 03:29:01 altair corosync[1419]:   [MAIN  ] Completed service synchronization, ready to provide service.

Feb 25 03:29:02 altair pmxcfs[1317]: [status] notice: update cluster info (cluster name  Hydra, version = 23)
Feb 25 03:29:02 altair pmxcfs[1317]: [status] notice: node has quorum
Feb 25 03:29:02 altair pmxcfs[1317]: [dcdb] notice: members: 1/37207, 2/1544, 3/1439, 4/2425, 5/1317
Feb 25 03:29:02 altair pmxcfs[1317]: [dcdb] notice: starting data syncronisation
Feb 25 03:29:02 altair pmxcfs[1317]: [status] notice: members: 1/37207, 2/1544, 3/1439, 4/2425, 5/1317
Feb 25 03:29:02 altair pmxcfs[1317]: [status] notice: starting data syncronisation
Feb 25 03:29:02 altair pmxcfs[1317]: [dcdb] notice: received sync request (epoch 1/37207/00000006)
Feb 25 03:29:02 altair pmxcfs[1317]: [status] notice: received sync request (epoch 1/37207/00000006)

Feb 25 03:29:30 altair corosync[1419]:   [KNET  ] rx: host: 1 link: 0 is up
Feb 25 03:29:30 altair corosync[1419]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Feb 25 03:29:30 altair corosync[1419]:   [KNET  ] rx: host: 4 link: 0 is up
Feb 25 03:29:30 altair corosync[1419]:   [KNET  ] link: Resetting MTU for link 0 because host 4 joined
Feb 25 03:29:30 altair corosync[1419]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 25 03:29:30 altair corosync[1419]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Feb 25 03:29:30 altair corosync[1419]:   [KNET  ] rx: host: 3 link: 0 is up
Feb 25 03:29:30 altair corosync[1419]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
Feb 25 03:29:30 altair corosync[1419]:   [KNET  ] rx: host: 2 link: 0 is up
Feb 25 03:29:30 altair corosync[1419]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Feb 25 03:29:30 altair corosync[1419]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 25 03:29:30 altair corosync[1419]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 25 03:29:30 altair corosync[1419]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Feb 25 03:29:30 altair corosync[1419]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Feb 25 03:29:30 altair corosync[1419]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
Feb 25 03:29:30 altair corosync[1419]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Feb 25 03:29:30 altair corosync[1419]:   [KNET  ] pmtud: Global data MTU changed to: 1397

(looks like it rebooted here again)
Feb 25 03:30:36 altair pve-ha-lrm[4882]: Task 'UPID:altair:00001316:00001549:65DB168B:vzstart:103:root@pam:' still active, waiting
Feb 25 03:35:28 altair systemd[1]: Finished systemd-remount-fs.service - Remount Root and Kernel File Systems.

And then the machine went into a reboot loop for the next few hours until I woke up, and systemctl disabled corosync on the new box.

esi_y · Aug 30, 2024

garbled said:
Purchased a new server, and replaced an existing node in the cluster with it. When I did the assisted join, immediately all nodes in the cluster rebooted, and then rebooted again, and then after about 20 minutes, everything stabilized, and I thought I had just maybe made a mistake somewhere?

garbled said:
Fast forward to yesterday. This time, much more carefully, because I assumed I had made a mistake somewhere that caused that chaos, I went to replace another node in the cluster.

First step was to shut the node down, and delete it from the cluster. 60 seconds later, all nodes reboot. Waited a bit and checked things, but everything looked ok. So I added the new node, and now, boom, everything is unstable and explody.

Did you ever resolve this?

Search

Search

All nodes crash when new node is brought online

garbled

Member

garbled

Member

esi_y

Renowned Member