unexpected restart of all cluster nodes

Feb 18, 2016
37
1
28
Hi everybody,

Yesterday we started installing updates on our 7 node cluster (PVE 7). We installed the updates on one node after the other. After the third node had finished installing all 7 nodes restarted unexpected and without clean shutdown. I think (but did not find evidence) watchdog and fencing did something unexpected here. I also saw that new corosync packages have been included in the list of updates.

When all servers had been online again the filesystems of every single virtual machine (about 100) where broken beyond repair. We had to restore all of them (Proxmox Backup Server was a huge help in that). All the disk images are stored on our external ceph cluster with caching mode writeback.

Questions are:

1. Any ideas on why all cluster nodes have been killed and restarted at once or any hints for tracking this issue down? And even more important: how can this be prevented?

2. Should we stop PVE-HA-LRM and PVE-HA-CRM (to close watchdog) before upgrading corosync?

3. Should we remove caching on the virtual disk images?

Thanks in advance for your help.


Version info:
proxmox-ve: 7.0-2 (running kernel: 5.11.22-3-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-6
pve-kernel-helper: 7.0-6
pve-kernel-5.4: 6.4-5
pve-kernel-5.11.22-3-pve: 5.11.22-7
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
ceph-fuse: 15.2.14-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-6
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-9
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.2-4
pve-ha-manager: 3.3-1
pve-i18n: 2.4-1
pve-qemu-kvm: 6.0.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-13
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1
 

aaron

Proxmox Staff Member
Staff member
Jun 3, 2019
3,009
493
88
This sounds indeed like there was something happening with the HA watchdog. If all nodes restarted at the same time, it sounds like the corosync communication between all nodes broke down, thus each node fenced itself.

How is your corosync configuration? Do you have more than one link? Does corosync have a dedicated physical network for itself or are there only links where other services are on the same physical network available?

If the corosync communication breaks down cluster wide, it could be because you have only one link, and the switch used for that has problems. Another likely cause is, if corosync is sharing the physical network with other services that could start using up all the bandwidth, thus the latency for the corosync packets is going up and might be too high for too long -> fencing. Such services can be for example Ceph or other networked storages as well as backups being sent over the same network.
 

elterminatore

Member
Jun 18, 2018
45
3
13
47
Hey guys,
nearly exact the same happend here. During the upgrade from pve 6 to 7 of the first node in a three node cluster, all nodes rebooted at the same time. I thought, this is an upgrad issue. But two days ago, all three nodes rebooted again at exactly the same time without a warning and without any logfile entry.

2021-10-28 07_38_21-Window.png

i can't reproduce, if there was any network related issues. the connected switches are up and running. this cluster was running about two years without any problems and after the upgrade to pve7 this happend two times in two weeks. and at the second time definitely no backups or similar were running, which could cause a lot of network traffic. corosync problems would be a good explanation, but why there are no logfile entries?

now i've read the "Separate Cluster Network" wiki page and i have noticed that my corosync communication uses the same 1g network interface as all of my VMs. Now i've configured corosync to use a vlan on the 10g card which is used from the ceph public network (ceph cluster network is another 10g card). Yes, a have seen the hint "Storage communication should never be on the same network as corosync!", but this network adapter is never been utilized as the 1g adapter mentioned above. and besides, this is a test for now.

my summary:
- this never happend with pve6, but twice in a short time with pve7
- why is there no log entry when a node fenced itself?

best regards
Stefan
 

aaron

Proxmox Staff Member
Staff member
Jun 3, 2019
3,009
493
88
nearly exact the same happend here. During the upgrade from pve 6 to 7 of the first node in a three node cluster, all nodes rebooted at the same time. I thought, this is an upgrad issue. But two days ago, all three nodes rebooted again at exactly the same time without a warning and without any logfile entry.
Could be two different causes. During the update, did it happen right after you rebooted one node?

The latter sounds like the classical network problem, where for some reason the Corosync communication breaks down cluster wide (actual network loss or other services using up all the bandwidth).
Do you by any chance also run backups over that same network?

Now i've configured corosync to use a vlan on the 10g card which is used from the ceph public network (ceph cluster network is another 10g card)
As the only link, or as an additional link? ( https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy )

Having more links, will make Corosync more resilient but ideally you will still have one physical network dedicated to corosync alone.

- this never happend with pve6, but twice in a short time with pve7
might just be bad luck

- why is there no log entry when a node fenced itself?
Because the logs have not been written down to disk yet when the node fences itself. Having an external logging server can help to capture the last logs before the fencing.
 

elterminatore

Member
Jun 18, 2018
45
3
13
47
Could be two different causes. During the update, did it happen right after you rebooted one node?

No. It happend during the update of the first node. I think apt showed about 97 percent and then everything was offline and all nodes rebooted.

The latter sounds like the classical network problem, where for some reason the Corosync communication breaks down cluster wide (actual network loss or other services using up all the bandwidth).
Do you by any chance also run backups over that same network?

Yes. File Backup from several VMs are made through this physical interface (other vlan). But definitely not at this time when this happened.

As the only link, or as an additional link? ( https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy )

Having more links, will make Corosync more resilient but ideally you will still have one physical network dedicated to corosync alone.

As the only link in corosync.conf

might just be bad luck
:eek: this is my testing cluster. i don't want have bad luck in the prduction environment

Because the logs have not been written down to disk yet when the node fences itself. Having an external logging server can help to capture the last logs before the fencing.

i don't know what really happend when a node is fencing. is it like a cold reset?
yes, i have a remote syslog server..... but as a VM in the proxmox cluster. I am now thinking about a change. ;-)
 

aaron

Proxmox Staff Member
Staff member
Jun 3, 2019
3,009
493
88
Overall I would check the syslogs for any corosync and pmxcfs log lines. If the network is having troubles, they might show more often than expected, but short enough to not cause problems.

i don't know what really happend when a node is fencing. is it like a cold reset?
If a node is losing the connection to the cluster (via corosync) for too long, 1 or 2 minutes (not sure right now) and has (or had since the last boot) HA guest running on it, it will fence itself to make sure that those guests are definitely off before the remaining cluster will start them. Fencing in this case is the equivalent of pushing the reset button.
Now if for some reason, this affects the whole cluster, all nodes will fence themselves as they are not part of the quorum (majority) anymore.

The classic cause for this is, when corosync is sharing the physical network with other services that might take up all the bandwidth. In such a case, the latency for the corosync packets is going up and since corosync is very latency sensitive, it is likely to consider that link as unusable with that latency.

That is why the recommendation for corosync is to have at least one dedicated physical network for it, to avoid such a problem. Additionally, it is a good idea to configure multiple corosync links, so it can try to switch to another network in case of problems. That could be, because the dedicated physical network is down due to some other problems (e.g. tripping over some cables).

If you do not have a dedicated physical network for corosync, having multiple links configured might save you from fencing, but it is no guarantee as the other networks might also be in a state that makes them unusable for corosync.
 
Jun 8, 2016
341
65
48
46
Johannesburg, South Africa
The disruption of fencing a node unnecessarily is massive, we adjusted Corosync to simply be less sensitive and only fence when a node was really unavailable. We typically recommend 4 x 10G interfaces, basically comprising of two LACP bonds where one is used for VM traffic and the second is used for Ceph replication (front & back) as well as Corosync and VM migration traffic.

We have several clusters where we however also simply use 2 x 10G interfaces in a bond and then run everything over that via VLANs. The following post details the changes we made to Corosync and how to stop the cluster and local resource managers prior to making changes:

https://forum.proxmox.com/threads/pve-5-4-11-corosync-3-x-major-issues.56124/post-269235
 
Jun 8, 2016
341
65
48
46
Johannesburg, South Africa
Also consider getting Corosync to automatically restart should it ever crash:

Code:
mkdir /etc/systemd/system/corosync.service.d;
echo -e '[Service]\nRestart=on-failure' > /etc/systemd/system/corosync.service.d/override.conf;
systemctl daemon-reload;
systemctl restart corosync;
corosync-cfgtool -s;
systemctl restart pve-cluster.service;
 

elterminatore

Member
Jun 18, 2018
45
3
13
47
Overall I would check the syslogs for any corosync and pmxcfs log lines. If the network is having troubles, they might show more often than expected, but short enough to not cause problems.

i checked all my logs but nothing relevant at. very rare entries in the night during backup tasks like this:

Oct 06 02:03:23 p1 corosync[4323]: [KNET ] link: host: 3 link: 0 is down Oct 06 02:03:23 p1 corosync[4323]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) Oct 06 02:03:23 p1 corosync[4323]: [KNET ] host: host: 3 has no active links Oct 06 02:03:25 p1 corosync[4323]: [KNET ] rx: host: 3 link: 0 is up Oct 06 02:03:25 p1 corosync[4323]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)

...but absolutelly nothing before the nodes are fencing

hm.... ok...

i adjusted corosync to be less sensitive as described here:
https://forum.proxmox.com/threads/pve-5-4-11-corosync-3-x-major-issues.56124/post-269235

I've also added a redundant link to corosync as described here:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

hope this helps.

(and i am thinking about adding this "restart if crashed" feature described here:
https://forum.proxmox.com/threads/unexpected-restart-of-all-cluster-nodes.97021/post-426580
not sure if this is necessary)
 

aaron

Proxmox Staff Member
Staff member
Jun 3, 2019
3,009
493
88
very rare entries in the night during backup tasks like this:

Oct 06 02:03:23 p1 corosync[4323]: [KNET ] link: host: 3 link: 0 is down Oct 06 02:03:23 p1 corosync[4323]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) Oct 06 02:03:23 p1 corosync[4323]: [KNET ] host: host: 3 has no active links Oct 06 02:03:25 p1 corosync[4323]: [KNET ] rx: host: 3 link: 0 is up Oct 06 02:03:25 p1 corosync[4323]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Well, that is exactly what I was talking about. The backups needed all the bandwidth, and Corosync did not have a working link for 2 seconds. Short enough to not be a problem in this case, but an indication that there are some issues that could become very problematic if they persist for longer.

Again, configure more Corosync links on the other networks that you have available is one way to mitigate the problem. This way, Corosync can switch to another network that hopefully works fine. Lastly, if you do have some unused NICs, use them to configure a network dedicated for Corosync so that it will always have a network that is not interfered on by other services.

...but absolutelly nothing before the nodes are fencing
Because those unfortunately did not make it down on the disk before the nodes fenced :-/
Therefore we can only speculate what actually happened there.
 

itNGO

Well-Known Member
Jun 12, 2020
557
120
48
44
Germany
it-ngo.com
Hi,
maybe setting the Bandwidth Limit for Backup will help. But also dedicated Corosync link is the best way to prevent issues....
 

elterminatore

Member
Jun 18, 2018
45
3
13
47
today i checked the pve7.1 release notes. is it possible, that this fixes my problem described above?
  • Updated corosync to include bug-fixes for issues occurring during network recovery.
    This could have otherwise lead to loss of quorum on all cluster nodes, which in turn would cause a cluster-wide fencing event in case HA was enabled.
if yes, is it safe for the update scenario from pve6 to pve7 ?
 

elterminatore

Member
Jun 18, 2018
45
3
13
47
today i checked the pve7.1 release notes. is it possible, that this fixes my problem described above?
  • Updated corosync to include bug-fixes for issues occurring during network recovery.
    This could have otherwise lead to loss of quorum on all cluster nodes, which in turn would cause a cluster-wide fencing event in case HA was enabled.
if yes, is it safe for the update scenario from pve6 to pve7 ?
ouuu... today i installed the regular updates on the "old" and not yet updated pve6 cluster and started booting the hosts in order. then on the second of seven i had "cluster-wide" fencing and all hosts booted. :-(
don't know which bugfixes for corosync are included in pve7.... but maybe they exist in pve6?
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
7,906
1,521
164
ouuu... today i installed the regular updates on the "old" and not yet updated pve6 cluster and started booting the hosts in order. then on the second of seven i had "cluster-wide" fencing and all hosts booted. :-(
don't know which bugfixes for corosync are included in pve7.... but maybe they exist in pve6?
the same fixes are on the way to being rolled out for PVE 6.4 as well. currently available in pvetest
 
  • Like
Reactions: tuxillo

elterminatore

Member
Jun 18, 2018
45
3
13
47
the same fixes are on the way to being rolled out for PVE 6.4 as well. currently available in pvetest

hi fabian,

thank you for this information

aaron said:
"If you do not have a dedicated physical network for corosync, having multiple links configured might save you from fencing, but it is no guarantee as the other networks might also be in a state that makes them unusable for corosync."

... yes.. i do not have dedicated physical networks and i will confiigure another link in corosync as i have done it my test environment.

but... was this a bug? or would this fencing also happened if i had multiple links?

best regards
stefan
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
7,906
1,521
164
irrespective of this bug a separate (ideally redundant) physical link for corosync is recommended. sharing the link with other, latency-affecting traffic can cause fencing events, either for single nodes or the whole cluster, or cause frequent failure of operations if nodes lose quorum even if they are able to rejoin before being fenced. this bug was special because it could cause fencing of the whole cluster simultaneously with practically no chance of recovery (with a normal network issue, there is a chance of the backlog being cleared in time, with this bug, only a restart of corosync within a minute after triggering would prevent fencing).
 
  • Like
Reactions: tuxillo

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
7,906
1,521
164
indeed it was ;)
 

mmenaz

Active Member
Jun 25, 2009
828
20
38
Northern east Italy
irrespective of this bug a separate (ideally redundant) physical link for corosync is recommended. sharing the link with other, latency-affecting traffic can cause fencing events, either for single nodes or the whole cluster, or cause frequent failure of operations if nodes lose quorum even if they are able to rejoin before being fenced. this bug was special because it could cause fencing of the whole cluster simultaneously with practically no chance of recovery (with a normal network issue, there is a chance of the backlog being cleared in time, with this bug, only a restart of corosync within a minute after triggering would prevent fencing).
Reading this thread, but not being experienced in clusters, I'm really worried about a couple of points:
a) fencing should be different, i.e. Proxmox node finds itself isolated, understands that has to "suicide" then stops/kills all KVM processes (or LXC or whatever), logs the fact, syncs the local storage (where logs are located) then does a clean "reboot" or if you think is risky, a "reset".
b) if corosync is separated from other networks, it can be that all the other networks are working (storage and VM) but just a corosync network problem can provoke a cluster suicide... that's bad
c) why not just have an option for not really critical setup (i.e. max 10 nodes and that can work with the described setup) to consider a note be OK as long as can communicate with it's shared cluster storage? Just reserve a "cluster_disk" in that storage with a FS that supports concurrent writes and each node rewrites a file with nodename. If a node can't write there, has to "commit suicide" (but as point a)), if it can write, has just to read all other nodes timestamps and if finds ones that are older than "n" minutes, can understand that that node is out of the cluster and, i.e., start HA VMs. I'm in a hurry and maybe must be thought something more sophisticated like node_vmid.txt or a sort of "cluster db" like proxmox already has or something good enough? Corosync is really overcomplicated and for small setups introduces more problems that it solves, OMHO
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!