unexpected restart of all cluster nodes

woodstock · Sep 28, 2021

Hi everybody,

Yesterday we started installing updates on our 7 node cluster (PVE 7). We installed the updates on one node after the other. After the third node had finished installing all 7 nodes restarted unexpected and without clean shutdown. I think (but did not find evidence) watchdog and fencing did something unexpected here. I also saw that new corosync packages have been included in the list of updates.

When all servers had been online again the filesystems of every single virtual machine (about 100) where broken beyond repair. We had to restore all of them (Proxmox Backup Server was a huge help in that). All the disk images are stored on our external ceph cluster with caching mode writeback.

Questions are:

1. Any ideas on why all cluster nodes have been killed and restarted at once or any hints for tracking this issue down? And even more important: how can this be prevented?

2. Should we stop PVE-HA-LRM and PVE-HA-CRM (to close watchdog) before upgrading corosync?

3. Should we remove caching on the virtual disk images?

Thanks in advance for your help.

Version info:

proxmox-ve: 7.0-2 (running kernel: 5.11.22-3-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-6
pve-kernel-helper: 7.0-6
pve-kernel-5.4: 6.4-5
pve-kernel-5.11.22-3-pve: 5.11.22-7
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
ceph-fuse: 15.2.14-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-6
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-9
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.2-4
pve-ha-manager: 3.3-1
pve-i18n: 2.4-1
pve-qemu-kvm: 6.0.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-13
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

aaron · Sep 30, 2021

This sounds indeed like there was something happening with the HA watchdog. If all nodes restarted at the same time, it sounds like the corosync communication between all nodes broke down, thus each node fenced itself.

How is your corosync configuration? Do you have more than one link? Does corosync have a dedicated physical network for itself or are there only links where other services are on the same physical network available?

If the corosync communication breaks down cluster wide, it could be because you have only one link, and the switch used for that has problems. Another likely cause is, if corosync is sharing the physical network with other services that could start using up all the bandwidth, thus the latency for the corosync packets is going up and might be too high for too long -> fencing. Such services can be for example Ceph or other networked storages as well as backups being sent over the same network.

elterminatore · Oct 28, 2021

Hey guys,
nearly exact the same happend here. During the upgrade from pve 6 to 7 of the first node in a three node cluster, all nodes rebooted at the same time. I thought, this is an upgrad issue. But two days ago, all three nodes rebooted again at exactly the same time without a warning and without any logfile entry.

i can't reproduce, if there was any network related issues. the connected switches are up and running. this cluster was running about two years without any problems and after the upgrade to pve7 this happend two times in two weeks. and at the second time definitely no backups or similar were running, which could cause a lot of network traffic. corosync problems would be a good explanation, but why there are no logfile entries?

now i've read the "Separate Cluster Network" wiki page and i have noticed that my corosync communication uses the same 1g network interface as all of my VMs. Now i've configured corosync to use a vlan on the 10g card which is used from the ceph public network (ceph cluster network is another 10g card). Yes, a have seen the hint "Storage communication should never be on the same network as corosync!", but this network adapter is never been utilized as the 1g adapter mentioned above. and besides, this is a test for now.

my summary:
- this never happend with pve6, but twice in a short time with pve7
- why is there no log entry when a node fenced itself?

best regards
Stefan

aaron · Oct 28, 2021

elterminatore said:
nearly exact the same happend here. During the upgrade from pve 6 to 7 of the first node in a three node cluster, all nodes rebooted at the same time. I thought, this is an upgrad issue. But two days ago, all three nodes rebooted again at exactly the same time without a warning and without any logfile entry.

Could be two different causes. During the update, did it happen right after you rebooted one node?

The latter sounds like the classical network problem, where for some reason the Corosync communication breaks down cluster wide (actual network loss or other services using up all the bandwidth).
Do you by any chance also run backups over that same network?

elterminatore said:
Now i've configured corosync to use a vlan on the 10g card which is used from the ceph public network (ceph cluster network is another 10g card)

As the only link, or as an additional link? ( https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy )

Having more links, will make Corosync more resilient but ideally you will still have one physical network dedicated to corosync alone.

elterminatore said:
- this never happend with pve6, but twice in a short time with pve7

might just be bad luck

elterminatore said:
- why is there no log entry when a node fenced itself?

Because the logs have not been written down to disk yet when the node fences itself. Having an external logging server can help to capture the last logs before the fencing.

elterminatore · Oct 28, 2021

aaron said:
Could be two different causes. During the update, did it happen right after you rebooted one node?

No. It happend during the update of the first node. I think apt showed about 97 percent and then everything was offline and all nodes rebooted.

aaron said:
The latter sounds like the classical network problem, where for some reason the Corosync communication breaks down cluster wide (actual network loss or other services using up all the bandwidth).
Do you by any chance also run backups over that same network?

Yes. File Backup from several VMs are made through this physical interface (other vlan). But definitely not at this time when this happened.

aaron said:
As the only link, or as an additional link? ( https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy )

Having more links, will make Corosync more resilient but ideally you will still have one physical network dedicated to corosync alone.

As the only link in corosync.conf

aaron said:
might just be bad luck

this is my testing cluster. i don't want have bad luck in the prduction environment

aaron said:
Because the logs have not been written down to disk yet when the node fences itself. Having an external logging server can help to capture the last logs before the fencing.

i don't know what really happend when a node is fencing. is it like a cold reset?
yes, i have a remote syslog server..... but as a VM in the proxmox cluster. I am now thinking about a change. ;-)

aaron · Oct 28, 2021

Overall I would check the syslogs for any corosync and pmxcfs log lines. If the network is having troubles, they might show more often than expected, but short enough to not cause problems.

elterminatore said:
i don't know what really happend when a node is fencing. is it like a cold reset?

If a node is losing the connection to the cluster (via corosync) for too long, 1 or 2 minutes (not sure right now) and has (or had since the last boot) HA guest running on it, it will fence itself to make sure that those guests are definitely off before the remaining cluster will start them. Fencing in this case is the equivalent of pushing the reset button.
Now if for some reason, this affects the whole cluster, all nodes will fence themselves as they are not part of the quorum (majority) anymore.

The classic cause for this is, when corosync is sharing the physical network with other services that might take up all the bandwidth. In such a case, the latency for the corosync packets is going up and since corosync is very latency sensitive, it is likely to consider that link as unusable with that latency.

That is why the recommendation for corosync is to have at least one dedicated physical network for it, to avoid such a problem. Additionally, it is a good idea to configure multiple corosync links, so it can try to switch to another network in case of problems. That could be, because the dedicated physical network is down due to some other problems (e.g. tripping over some cables).

If you do not have a dedicated physical network for corosync, having multiple links configured might save you from fencing, but it is no guarantee as the other networks might also be in a state that makes them unusable for corosync.

David Herselman · Oct 28, 2021

The disruption of fencing a node unnecessarily is massive, we adjusted Corosync to simply be less sensitive and only fence when a node was really unavailable. We typically recommend 4 x 10G interfaces, basically comprising of two LACP bonds where one is used for VM traffic and the second is used for Ceph replication (front & back) as well as Corosync and VM migration traffic.

We have several clusters where we however also simply use 2 x 10G interfaces in a bond and then run everything over that via VLANs. The following post details the changes we made to Corosync and how to stop the cluster and local resource managers prior to making changes:

https://forum.proxmox.com/threads/pve-5-4-11-corosync-3-x-major-issues.56124/post-269235

David Herselman · Oct 28, 2021

Also consider getting Corosync to automatically restart should it ever crash:

Code:

mkdir /etc/systemd/system/corosync.service.d;
echo -e '[Service]\nRestart=on-failure' > /etc/systemd/system/corosync.service.d/override.conf;
systemctl daemon-reload;
systemctl restart corosync;
corosync-cfgtool -s;
systemctl restart pve-cluster.service;

elterminatore · Oct 28, 2021

aaron said:
Overall I would check the syslogs for any corosync and pmxcfs log lines. If the network is having troubles, they might show more often than expected, but short enough to not cause problems.

i checked all my logs but nothing relevant at. very rare entries in the night during backup tasks like this:

Oct 06 02:03:23 p1 corosync[4323]:   [KNET  ] link: host: 3 link: 0 is down
Oct 06 02:03:23 p1 corosync[4323]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 06 02:03:23 p1 corosync[4323]:   [KNET  ] host: host: 3 has no active links
Oct 06 02:03:25 p1 corosync[4323]:   [KNET  ] rx: host: 3 link: 0 is up
Oct 06 02:03:25 p1 corosync[4323]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

...but absolutelly nothing before the nodes are fencing

hm.... ok...

i adjusted corosync to be less sensitive as described here:
https://forum.proxmox.com/threads/pve-5-4-11-corosync-3-x-major-issues.56124/post-269235

I've also added a redundant link to corosync as described here:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

hope this helps.

(and i am thinking about adding this "restart if crashed" feature described here:
https://forum.proxmox.com/threads/unexpected-restart-of-all-cluster-nodes.97021/post-426580
not sure if this is necessary)

aaron · Oct 29, 2021

elterminatore said:
very rare entries in the night during backup tasks like this:

Oct 06 02:03:23 p1 corosync[4323]: [KNET ] link: host: 3 link: 0 is down Oct 06 02:03:23 p1 corosync[4323]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) Oct 06 02:03:23 p1 corosync[4323]: [KNET ] host: host: 3 has no active links Oct 06 02:03:25 p1 corosync[4323]: [KNET ] rx: host: 3 link: 0 is up Oct 06 02:03:25 p1 corosync[4323]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)

Well, that is exactly what I was talking about. The backups needed all the bandwidth, and Corosync did not have a working link for 2 seconds. Short enough to not be a problem in this case, but an indication that there are some issues that could become very problematic if they persist for longer.

Again, configure more Corosync links on the other networks that you have available is one way to mitigate the problem. This way, Corosync can switch to another network that hopefully works fine. Lastly, if you do have some unused NICs, use them to configure a network dedicated for Corosync so that it will always have a network that is not interfered on by other services.

elterminatore said:
...but absolutelly nothing before the nodes are fencing

Because those unfortunately did not make it down on the disk before the nodes fenced :-/
Therefore we can only speculate what actually happened there.

itNGO · Oct 29, 2021

Hi,
maybe setting the Bandwidth Limit for Backup will help. But also dedicated Corosync link is the best way to prevent issues....

elterminatore · Oct 29, 2021

itNGO said:
Hi,
maybe setting the Bandwidth Limit for Backup will help. But also dedicated Corosync link is the best way to prevent issues....

of course.... but in both cases when it happened, no backup ran

elterminatore · Nov 26, 2021

today i checked the pve7.1 release notes. is it possible, that this fixes my problem described above?

Updated corosync to include bug-fixes for issues occurring during network recovery.
This could have otherwise lead to loss of quorum on all cluster nodes, which in turn would cause a cluster-wide fencing event in case HA was enabled.

if yes, is it safe for the update scenario from pve6 to pve7 ?

elterminatore · Dec 2, 2021

elterminatore said:
today i checked the pve7.1 release notes. is it possible, that this fixes my problem described above?

Updated corosync to include bug-fixes for issues occurring during network recovery.
This could have otherwise lead to loss of quorum on all cluster nodes, which in turn would cause a cluster-wide fencing event in case HA was enabled.

if yes, is it safe for the update scenario from pve6 to pve7 ?

ouuu... today i installed the regular updates on the "old" and not yet updated pve6 cluster and started booting the hosts in order. then on the second of seven i had "cluster-wide" fencing and all hosts booted. :-(
don't know which bugfixes for corosync are included in pve7.... but maybe they exist in pve6?

fabian · Dec 2, 2021

elterminatore said:
ouuu... today i installed the regular updates on the "old" and not yet updated pve6 cluster and started booting the hosts in order. then on the second of seven i had "cluster-wide" fencing and all hosts booted. :-(
don't know which bugfixes for corosync are included in pve7.... but maybe they exist in pve6?

the same fixes are on the way to being rolled out for PVE 6.4 as well. currently available in pvetest

elterminatore · Dec 2, 2021

fabian said:
the same fixes are on the way to being rolled out for PVE 6.4 as well. currently available in pvetest

hi fabian,

thank you for this information

aaron said:
"If you do not have a dedicated physical network for corosync, having multiple links configured might save you from fencing, but it is no guarantee as the other networks might also be in a state that makes them unusable for corosync."

... yes.. i do not have dedicated physical networks and i will confiigure another link in corosync as i have done it my test environment.

but... was this a bug? or would this fencing also happened if i had multiple links?

best regards
stefan

fabian · Dec 2, 2021

irrespective of this bug a separate (ideally redundant) physical link for corosync is recommended. sharing the link with other, latency-affecting traffic can cause fencing events, either for single nodes or the whole cluster, or cause frequent failure of operations if nodes lose quorum even if they are able to rejoin before being fenced. this bug was special because it could cause fencing of the whole cluster simultaneously with practically no chance of recovery (with a normal network issue, there is a chance of the backlog being cleared in time, with this bug, only a restart of corosync within a minute after triggering would prevent fencing).

tuxillo · Dec 2, 2021

Thanks @fabian, I've read the corosync github issue, looked quited involved

Good job!

fabian · Dec 3, 2021

indeed it was

mmenaz · Dec 3, 2021

fabian said:
irrespective of this bug a separate (ideally redundant) physical link for corosync is recommended. sharing the link with other, latency-affecting traffic can cause fencing events, either for single nodes or the whole cluster, or cause frequent failure of operations if nodes lose quorum even if they are able to rejoin before being fenced. this bug was special because it could cause fencing of the whole cluster simultaneously with practically no chance of recovery (with a normal network issue, there is a chance of the backlog being cleared in time, with this bug, only a restart of corosync within a minute after triggering would prevent fencing).

Reading this thread, but not being experienced in clusters, I'm really worried about a couple of points:
a) fencing should be different, i.e. Proxmox node finds itself isolated, understands that has to "suicide" then stops/kills all KVM processes (or LXC or whatever), logs the fact, syncs the local storage (where logs are located) then does a clean "reboot" or if you think is risky, a "reset".
b) if corosync is separated from other networks, it can be that all the other networks are working (storage and VM) but just a corosync network problem can provoke a cluster suicide... that's bad
c) why not just have an option for not really critical setup (i.e. max 10 nodes and that can work with the described setup) to consider a note be OK as long as can communicate with it's shared cluster storage? Just reserve a "cluster_disk" in that storage with a FS that supports concurrent writes and each node rewrites a file with nodename. If a node can't write there, has to "commit suicide" (but as point a)), if it can write, has just to read all other nodes timestamps and if finds ones that are older than "n" minutes, can understand that that node is out of the cluster and, i.e., start HA VMs. I'm in a hurry and maybe must be thought something more sophisticated like node_vmid.txt or a sort of "cluster db" like proxmox already has or something good enough? Corosync is really overcomplicated and for small setups introduces more problems that it solves, OMHO

unexpected restart of all cluster nodes

Renowned Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Renowned Member

Renowned Member

Active Member

Proxmox Staff Member

Renowned Member

Active Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

We value your privacy