[SOLVED] Cluster with redundant Corosync networks reboots as soon as I join a new node

godzilla · Oct 19, 2022

Related to this other thread, tagging @fabian as requested.

I currently have a cluster with 13 nodes running. Everything is updated to the latest versions (except for the kernel which is pinned to 5.13.19-6-pve on all nodes because of some issues with live-migration on different CPUs). All the nodes are HPE DL360gen9 or DL360gen10.

Yesterday I added the 14th node to the cluster and as soon as I clicked "Join cluster" every other node rebooted, bringing down hundreds of VMs.

Of course I shutdown the new node, and after the reboot all the nodes restarted working like nothing happened. Even Ceph realigned within seconds (phew!).

Today, as soon as I reconnected the management interface of the new node (vmbr0 on eno1), all the other nodes rebooted again.

I'm using dedicated network switches for management interfaces and the average ping between nodes is < 1ms. Also, corosync itself is configured on redundant links, so even if all the management interfaces would be down they should be able to see each other through the two other interfaces. Shouldn't they?

So I can't really figure out what's happening. For me, it's like something completely broke the network stack on the nodes but why?

Someone with the exact same problem suggested [reddit.com] to remove the management IP from the default bridge and set it directly on the NIC, and I can do it, but does this really make sense?

Consider that:

nodes are number proxnode01 to proxnode11 and proxnode16 to proxnode18 (because physically installed on two distinct racks, sized for 15 nodes each). So the corosync node #14 is actually proxnode18
I'm confident that there are no hardware issues with the new node because I repurposed it from another rack where it's been doing virtualization for two years, up to two days ago
ceph networks (172.27.0.x "ceph-public" and 172.28.0.x "ceph-cluster") are managed by two 10G switches using LACP and dedicated stacking on 100G link, so they may be considered as one large redundant switch; for this reason I'm using VLANs on a LACP bond

I'm attaching corosync.conf (cluster name was redacted) and a few other config files/logs as requested by @fabian on the other thread. The node that breaks everything is called proxnode18 and is currently offline. This is why I attach a screenshot of the iLO interface for proxnode18 pveversion output.
As you can see, all the nodes broke connectivity with the cluster at 09:19:45 and rebooted two minutes later.

Edit: please ignore the errors "500 Can't connect to 172.29.0.193:8007 (No route to host)" in the logs as I purposely detached the switch for the backup network in order to avoid any risk of network loops.

Thank you!

fabian · Oct 19, 2022

I really need more of the logs (you can limit them to just the corosync and pve-cluster units via journalctl if you also indicate the times when each node got fenced!) starting a bit earlier, and logs of *all* the nodes. thanks!

alex_ca · Oct 19, 2022

@godzilla this looks really similar to what happened in my cluster.....

godzilla · Oct 19, 2022

Thanks @fabian ! You can find all the logs attached for both corosync and pve-cluster on all nodes. Let me know if you need any further information

@alexmolon yes, indeed! how strange

godzilla · Oct 19, 2022

it looks like I can't attach more than 10 files to a single reply, so here are the remaining ones

godzilla · Oct 19, 2022

By examining the logs we noticed that from time to time some corosync links are going down and up after a few seconds, without any reason. Network interfaces never really go down.
It looks like it started happening yesterday after the first incident.

I fear the problem is corosync side but I have no clue.

fabian · Oct 19, 2022

corosync marking links as down means that either heartbeat packets were not answered in time or sending data over that link encountered an error that indicates it doesn't work. if it self corrects after a few seconds it might indicate your network is overloaded (which is very much possible since you share corosync which is latency sensitive with storage traffic).

in any case, your logs and setup are similar (bonds, links with different MTU), but also different enough (MTU is identical for each link across all nodes AFAICT, corosync sync happens and token times out afterwards, ..) that I think it might be a similar, but different issue.

godzilla · Oct 19, 2022

hi,

ok I understand but how did it work for the first 13 nodes without any hassle? and how come I'm having missing heartbeat packets only in the last two days, with the exact same configuration I had for the last two years?

fabian · Oct 19, 2022

each additional node increases the amount of traffic/load (each node has to talk to all other nodes, everything that needs to be acked has one more participant, ..), which might be a potential source. but I can't tell you for sure, there's nothing in the logs that jumps out except the MTU part which is a bit uncommon. you can try setting netmtu to 1397 in corosync.conf (see https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_edit_corosync_conf ) to avoid corosync sending big packets.

godzilla · Oct 19, 2022

thank you @fabian, I'll give it a try as soon as I can. I have to restart corosync on all nodes, am I right?

fabian · Oct 20, 2022

yes, netmtu is not runtime reloadable unfortunately (corosync will warn about that though when you copy the new config in place).

godzilla · Nov 14, 2022

Hi @fabian, sorry for the delay but I was out of office for a few weeks.

I'm ready to apply the new configuration, my /etc/pve/corosync.conf.new looks like this (config_version increased by 1):

Code:

totem {
  netmtu: 1397
  config_version: 15
  ...
}

So next steps are:

cp /etc/pve/corosync.conf.new /etc/pve/corosync.conf
systemctl restart corosync on the current node
systemctl restart corosync on all the other nodes

Can you confirm it's correct? The cluster is supposed to stay up, isn't it?

Also, once I'm done I'll replace the configuration on the offline node, restart the service on that, and try to re-join the cluster. Correct?

Thank you

fabian · Nov 14, 2022

I would add "disarm HA" to the steps (before copying the config file), like this:

First, stop the local resource manager "pve-ha-lrm" on each node. Only after they have been stopped, also stop the cluster resource manager "pve-ha-crm" on each node; use the GUI (Node -> Services) or the CLI by running the following command on each node:

systemctl stop pve-ha-lrm

Only after the above was done for all nodes (ha-manager status should report lrm on all nodes to be in "restart mode", and the resources to be in "freeze" state, and the pve-ha-lrm service should have reported "watchdog closed (disabled)" if the watchdog was previously armed on the node), run the following on each node:

systemctl stop pve-ha-crm

once the cluster changes are done, start the services again in reverse order (first the CRM on all nodes, then the LRM on all nodes).

godzilla · Nov 14, 2022

Hi @fabian,

ok, understood. Thank you. So, while HA is disarmed and corosync is restarted everything is supposed to stay up except if a node goes down the VMs will not auto-migrate to another node, right?

Also, I was considering removing the new node (which is still offline) from the cluster and rebuilding it from scratch, since it's been off for a month now and I fear it has skewed clock etc.

Is this an advisable option or is it better (as I said in the previous post) just checking the system clock, aligning the corosync configuration, putting it back online and letting it join the cluster as if nothing had happened?

fabian · Nov 14, 2022

godzilla said:
ok, understood. Thank you. So, while HA is disarmed and corosync is restarted everything is supposed to stay up except if a node goes down the VMs will not auto-migrate to another node, right?

basically, with HA disarmed no watchdog is active so no fencing, and obviously, also no HA recovery/.. functionality.

godzilla said:
Also, I was considering removing the new node (which is still offline) from the cluster and rebuilding it from scratch, since it's been off for a month now and I fear it has skewed clock etc.

removing and reinstalling is of course also an option.

godzilla said:
Is this an advisable option or is it better (as I said in the previous post) just checking the system clock, aligning the corosync configuration, putting it back online and letting it join the cluster as if nothing had happened?

both are equally valid choices, the underlying issue is either still there or not in either case

godzilla · Nov 14, 2022

Ok, thank you again. I'll provide feedback as soon as possible.

godzilla · Nov 14, 2022

Hi @fabian,

I followed the procedure and successfully restarted corosync and the HA daemons on all nodes as stated. So far, so good.

Now I tried to edit /etc/pve/corosync.conf on the offline node but /etc/pve is in read-only mode.

I suppose that it will automatically update itself when I put the node back online, since the new corosync.conf has a bigger version number. Can you please confirm?

Thank you!

Edit: I think I should follow the instructions at https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_separate_node_without_reinstall

That is, on the separated node (called "proxnode18"):

Code:

systemctl stop pve-cluster
systemctl stop corosync

pmxcfs -l

rm /etc/pve/corosync.conf
rm -r /etc/corosync/*
killall pmxcfs
systemctl start pve-cluster

And on another node in the cluster:

Code:

pvecm delnode proxnode18
rm -rf /etc/pve/nodes/proxnode18

Finally, cleanup authorized_keys and add back the node into the cluster

By any chance, do you have something to add?

Thanks!

fabian · Nov 15, 2022

no, that looks okay. but if you just want to update corosync.conf, the following should be enough as well (on node 18, with nothing more done on the rest of the cluster)

Code:

systemctl stop pve-cluster
systemctl stop corosync

pmxcfs -l

cp good-copy-of-corosync.conf /etc/corosync/corosync.conf
cp good-copy-of-corosync.conf /etc/pve/corosync.conf

killall pmxcfs
systemctl start pve-cluster corosync

where "good-copy-of-corosync.conf" is the version-bumped, fixed config file.

godzilla · Nov 15, 2022

Hi @fabian , thank you so much. I was able to put it back online out of the cluster.
Anyway, because of several misaligned configurations involving Ceph (among other things) I prefer reinstalling the node from scratch, to be 100% sure there are no leftovers.

I'll keep you updated.
Thanks again

godzilla · Nov 15, 2022

Hi @fabian ,

just to be on the safe side, do you think I might disable the HA daemons on the existing nodes (or at least some of them) before trying to add the new node to the cluster? So only some of them will reboot in case things go wrong? What's your advice?

Thank you

[SOLVED] Cluster with redundant Corosync networks reboots as soon as I join a new node

Active Member

Attachments

Proxmox Staff Member

New Member

Active Member

Attachments

Active Member

Attachments

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Active Member

We value your privacy