Unexpected fencing

bjsko

Active Member
Sep 25, 2019
30
4
28
I have a 28 node PVE cluster running PVE 7.4-16. All nodes are Dell R640/R650 servers with 1.5 or 2TB of RAM and Intel Xeon Gold CPU's. They all have 2 x 1GB NIC's and 4 x 25 GB NIC's.

We are connected to an external CEPH cluster.

Node network config:
1 x 1GB NIC is used for management/ssh/GUI access. Connected to a 1GB top-of rack switch
1 x 1GB NIC for cluster ring2 with highest priority. Connected to a dedicated 1GB switch used only for this purpose and for this cluster
2 x 25GB OVS bond for VM traffic, migration network and cluster ring 0
2 x 25GB OVS bond for ceph-public and cluster ring 1

4 nodes in this cluster had HA enabled vm's running on them and quorum master was located on a fifth node (which had no HA enabled VM's running).

During a planned DR test for a customer (which did not have any VM's running on either of the 5 nodes mentioned above, we pulled the power from two hypervisors containing the customer VM's. (Yes, we know, it may not be great for the vm's/corruption/etc, that's not the point here :) )

To our surprise the 4 hypervisors containing HA enabled VM's + the at the time quorum master rebooted very soon after.

The log below is from one of the mentioned 5 hypervisors that were fenced (all 5 have similar logs). It shows that host 17 and host 22 is unavailable (that's expected as they had power removed from them), and then
Code:
Nov 11 10:11:13 pve191 watchdog-mux[815]: client watchdog expired - disable watchdog updates

Code:
Nov 11 10:10:24 pve191 corosync[1405]:   [KNET  ] link: host: 17 link: 0 is down
Nov 11 10:10:24 pve191 corosync[1405]:   [KNET  ] link: host: 17 link: 1 is down
Nov 11 10:10:24 pve191 corosync[1405]:   [KNET  ] link: host: 17 link: 2 is down
Nov 11 10:10:24 pve191 corosync[1405]:   [KNET  ] host: host: 17 (passive) best link: 2 (pri: 40)
Nov 11 10:10:24 pve191 corosync[1405]:   [KNET  ] host: host: 17 has no active links
Nov 11 10:10:24 pve191 corosync[1405]:   [KNET  ] host: host: 17 (passive) best link: 2 (pri: 40)
Nov 11 10:10:24 pve191 corosync[1405]:   [KNET  ] host: host: 17 has no active links
Nov 11 10:10:24 pve191 corosync[1405]:   [KNET  ] host: host: 17 (passive) best link: 2 (pri: 40)
Nov 11 10:10:24 pve191 corosync[1405]:   [KNET  ] host: host: 17 has no active links
Nov 11 10:10:31 pve191 corosync[1405]:   [TOTEM ] Token has not been received in 14925 ms
Nov 11 10:10:49 pve191 corosync[1405]:   [KNET  ] link: host: 22 link: 0 is down
Nov 11 10:10:49 pve191 corosync[1405]:   [KNET  ] link: host: 22 link: 1 is down
Nov 11 10:10:49 pve191 corosync[1405]:   [KNET  ] link: host: 22 link: 2 is down
Nov 11 10:10:49 pve191 corosync[1405]:   [KNET  ] host: host: 22 (passive) best link: 2 (pri: 40)
Nov 11 10:10:49 pve191 corosync[1405]:   [KNET  ] host: host: 22 has no active links
Nov 11 10:10:49 pve191 corosync[1405]:   [KNET  ] host: host: 22 (passive) best link: 2 (pri: 40)
Nov 11 10:10:49 pve191 corosync[1405]:   [KNET  ] host: host: 22 has no active links
Nov 11 10:10:49 pve191 corosync[1405]:   [KNET  ] host: host: 22 (passive) best link: 2 (pri: 40)
Nov 11 10:10:49 pve191 corosync[1405]:   [KNET  ] host: host: 22 has no active links
Nov 11 10:11:13 pve191 watchdog-mux[815]: client watchdog expired - disable watchdog updates
-- Boot 003ce3afa06e4086bcf8d1bf8a73ebfe --

We have performed similar DR tests before, but the new factor in the cluster is enabling HA for some VM's.

Is it possible to figure out why the fencing happened? Can I provide more logs of some sort for anyone to help me understand if I have any issues in my configuration?

proxmox-ve: 7.4-1 (running kernel: 5.15.116-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-6
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph: 17.2.6-pve1
ceph-fuse: 17.2.6-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
openvswitch-switch: 2.15.0+ds1-2+deb11u4
proxmox-backup-client: 2.4.3-1
proxmox-backup-file-restore: 2.4.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

Many thanks
Bjørn
 

Attachments

  • interfaces.txt
    2.4 KB · Views: 6
  • corosync.conf.txt
    4.6 KB · Views: 6
  • corosync-cfgtool.txt
    13.2 KB · Views: 4
did you pull the power at the same time? it seems your cluster takes a long time to (re)establish communication, if your timing was sub-optimal the following could have gone down:

- token round starts
- first node's power is pulled
- link down is noticed, token restarted
- token times out, another round started
- second node's power is pulled
- link down is noticed, token restarted

what I find a bit suspicious is that except for the single token timeout line, nothing about corosync is logged at all.. we might find more information if you post the full journal output covering 10:10-10:12 on that day for *all* nodes.

if you actually pulled the power at the same time for both nodes, then something is very wrong, since the gap between the link down events is 25s!
 
did you pull the power at the same time? it seems your cluster takes a long time to (re)establish communication, if your timing was sub-optimal the following could have gone down:
First of all, thanks for coming back to me.

The power was pulled in sequence, so first PSU1 and PSU2 on the first node, and then the same for node two. It will definitely be some seconds between the two nodes going down, maybe 15-30 seconds. The person pulling the cables had to double check pulling the right power etc.

- token round starts
- first node's power is pulled
- link down is noticed, token restarted
- token times out, another round started
- second node's power is pulled
- link down is noticed, token restarted

If this is correct assumed, could that have caused the fencing?

what I find a bit suspicious is that except for the single token timeout line, nothing about corosync is logged at all.. we might find more information if you post the full journal output covering 10:10-10:12 on that day for *all* nodes.
Code:
journalctl --since "2023-11-11 10:10:00" --until "2023-11-11 10:14:00"
for all nodes in the cluster attached

if you actually pulled the power at the same time for both nodes, then something is very wrong, since the gap between the link down events is 25s!
As mentioned above, there was 15-30 seconds between the power was pulled on the nodes.

I hope you are able to see something in the attached logs, highly appreciated!

BR
Bjørn
 

Attachments

  • pve.log.gz
    29.5 KB · Views: 1
this is really peculiar - do you have any kind of monitoring in place (for the nodes and the network)? anything interesting showing up there (e.g., massive load or traffic spikes during that time period)?

the logs don't really tell us much:
- host 17 is detected as down at 10:21-10:26
- token timeout logged at 10:31
- host 22 detected as down at 10:45-10:50
- silence
- host 20 detected as down at 11:23 (roughly a minute after the first node goes down, which lines up with the watchdog timeout)
- other nodes go down until 11:34 (which is still roughly a minute after, depending on when the last write still went through, and when the watchdog was last pulled up exactly)
- new membership with 6 nodes removed formed at 12:12

does this last timestamp (roughly) line up with one of the fenced nodes finishing their reboot and starting corosync? it seems that up until that point, corosync was either stuck, or repeatedly tried but failed to establish the cluster membership (without running into any more timeouts, so could be that some condition just caused the process to be aborted and started over, over and over again).
 
this is really peculiar - do you have any kind of monitoring in place (for the nodes and the network)? anything interesting showing up there (e.g., massive load or traffic spikes during that time period)?

We do have monitoring of switches and hypervisors. I will dig some more to see if I can see something out of the ordinary.
the logs don't really tell us much:
- host 17 is detected as down at 10:21-10:26
- token timeout logged at 10:31
- host 22 detected as down at 10:45-10:50
- silence
- host 20 detected as down at 11:23 (roughly a minute after the first node goes down, which lines up with the watchdog timeout)
- other nodes go down until 11:34 (which is still roughly a minute after, depending on when the last write still went through, and when the watchdog was last pulled up exactly)
- new membership with 6 nodes removed formed at 12:12

does this last timestamp (roughly) line up with one of the fenced nodes finishing their reboot and starting corosync? it seems that up until that point, corosync was either stuck, or repeatedly tried but failed to establish the cluster membership (without running into any more timeouts, so could be that some condition just caused the process to be aborted and started over, over and over again).
I attached a syslog from one hypervisor in the cluster that wasn't involved in fencing or planned power off. To me it seems like the first fenced node (host 20/pve190) joined the cluster at 10:13:49 (line 102 in the attached log file).

I guess it is not so easy to find out if it was actually a stuck process or if it kept restarting? I am puzzled to what caused this.

I have removed HA from all vm's in our two large PVE clusters for now in any case just to be on the safe side until we hopefully can figure it out.

BR
Bjørn
 

Attachments

  • pve114_syslog.txt
    25.3 KB · Views: 2
could you check old logs and see how long it takes normally on this cluster to go from "host: NN has no active links" to "A new membership (...) was formed."? 40s seems awfully long (and that's going with the "nicest" interpretation, counting with the last host down, not the first!). when the first node comes back online, it takes around 4-5s from the link being detect as up to the sync being finished - this is more in line with expectations (your cluster is rather big, and the sync is an expensive operation).
 
could you check old logs and see how long it takes normally on this cluster to go from "host: NN has no active links" to "A new membership (...) was formed."? 40s seems awfully long (and that's going with the "nicest" interpretation, counting with the last host down, not the first!). when the first node comes back online, it takes around 4-5s from the link being detect as up to the sync being finished - this is more in line with expectations (your cluster is rather big, and the sync is an expensive operation).
I was away yesterday, so sorry for not coming back to you before.

I have looked through quite a few logs and if I have interpreted it correctly most of the time it takes around 4-5 seconds. I have seen once or twice that it reports around 15-17s.

Cluster size is a real worry. We have a lot of rather memory hungry VM's and need many hypervisors. We have thought about splitting up the cluster into smaller ones, but as we are using external CEPH storage and have been advised not to connect several PVE clusters to the same storage, we are not really sure about how to progress.

Will fencing like we saw in this incident only occur when we have HA enabled VM's/CT's running in the cluster?

BR
Bjørn
 
I have looked through quite a few logs and if I have interpreted it correctly most of the time it takes around 4-5 seconds. I have seen once or twice that it reports around 15-17s.
15 does already sound quite long to be honest..
Cluster size is a real worry. We have a lot of rather memory hungry VM's and need many hypervisors. We have thought about splitting up the cluster into smaller ones, but as we are using external CEPH storage and have been advised not to connect several PVE clusters to the same storage, we are not really sure about how to progress.
yes, your cluster is definitely on the upper end size-wise. splitting up should be do-able, Ceph supports namespaces and/or different pools to separate your PVE clusters while still using a single (external) Ceph cluster, but it will be a bit of work to actually do that split for sure..
Will fencing like we saw in this incident only occur when we have HA enabled VM's/CT's running in the cluster?
yes. without HA active, the watchdogs are not armed, and no fencing will occur. /etc/pve might still become read-only during a loss of quorum, but that should not affect already running guests.
 
Thanks for confirming that no fencing will occur if no HA enabled services are running. Would it be a good idea to make sure the quorum master is running on a node that have HA VM's/CT's to prevent fencing of a "unrelated" node?

Regarding multiple PVE clusters connected to the same external CEPH cluster you just made the penny drop. We are already using different pools for different customers, and have just recently started thinking about namespaces as an alternative. Making sure pools/namespaces only can be accessed from one PVE cluster makes it (hopefully) safe to do the splitting and keeping our current storage cluster. Not sure why I haven't really got this point before...

Back to the fencing "incident": I think we have done it pretty much by the book when it comes to setting up the cluster rings etc. Would you say based on the logs you have seen and experience etc. that we should dig deeper in our network infrastructure and that the issue is most likely somewhere in there, or do you think we will be able to find either a corner case or bug(s)? It's a silly question, I know, but I guess you have seen many cases over the years and have some data on weather it is normally a config/infrastructure issue or if it might be something in the code :)

Any other ideas on troubleshooting the issue?

Many thanks
Bjørn
 
Thanks for confirming that no fencing will occur if no HA enabled services are running. Would it be a good idea to make sure the quorum master is running on a node that have HA VM's/CT's to prevent fencing of a "unrelated" node?

I don't think that's forceable..

Back to the fencing "incident": I think we have done it pretty much by the book when it comes to setting up the cluster rings etc. Would you say based on the logs you have seen and experience etc. that we should dig deeper in our network infrastructure and that the issue is most likely somewhere in there, or do you think we will be able to find either a corner case or bug(s)? It's a silly question, I know, but I guess you have seen many cases over the years and have some data on weather it is normally a config/infrastructure issue or if it might be something in the code :)

I don't really see anything "wrong" with your setup - the only thing that might help would be tuning certain parameters since it is rather big, but that is a very delicate matter.

Any other ideas on troubleshooting the issue?

you could try to reproduce the issue with debug logging active (warning, will generate a lot of logs!), we might be able to figure out (possibly together with upstream) why it takes so long to re-establish the membership. but I'd understand it if you don't want to do that and rather focus on implementing the split ;)
 
I don't think that's forceable..

I read this comment earlier, so that's why I was mentioning it :)

I don't really see anything "wrong" with your setup - the only thing that might help would be tuning certain parameters since it is rather big, but that is a very delicate matter.

Yes, I have been quite reluctant to mess around with corosync config options/tuning as I expect it to be rather high risk...

you could try to reproduce the issue with debug logging active (warning, will generate a lot of logs!), we might be able to figure out (possibly together with upstream) why it takes so long to re-establish the membership. but I'd understand it if you don't want to do that and rather focus on implementing the split ;)

Just for the record, is it enabling the debug option in corosync that would possibly tell us something? I might give it a go.

I think we will start preparing for a split in any case, even if I don't want to ;)

BR
Bjørn
 
Just for the record, is it enabling the debug option in corosync that would possibly tell us something? I might give it a go.
yes. IIRC it requires a restart of corosync to become effective. and like I said - it will produce **a lot** of logs, so it's best to just have it enabled for the duration of repeating the "hard power off two nodes" experiment, and not for an extended period of time. if you don't want to inflate your system logs, you could also disable logging to syslog and start corosync manually (dumping the log output to a file), but you probably would want to test that on a test system first.
 
Thanks for confirming that no fencing will occur if no HA enabled services are running. Would it be a good idea to make sure the quorum master is running on a node that have HA VM's/CT's to prevent fencing of a "unrelated" node?

I don't think that's forceable..

I read this comment earlier, so that's why I was mentioning it :)

@fabian I think this got buried here. This is a genuine problem, also in the light of what's in the OP's find - a good summary post on this [1]. The common (wrong) "wisdom" including all around the forum (and without explicitly this being in the docs) is that if there are no HA resources running on a node, it can't fence as the watchdog is "disarmed". Never mind the watchdog is actually never disarmed (@t.lamprecht is the only one who actually I found mentioned this) unless one gracefully stops the watchdog-mux, even if the client watchdog is inactive, from the perspective of admin, one cannot really take this "wisdom" as useful for the reasons:

1) there's still some CRM node and may or may not be the same as one with resources that's LRM - would need to check which one;
2) resources might migrate away and that LRM goes idle or the CRM loses the lock and another node becomes master;
3) zero configuration ability for admin to influence the (2) above.

As such, at any given point, any node, LRM idle or not, could be or become CRM, thus could potentially fence, so with HA enabled, any node in the whole cluster could fence.

I do not think it's that easy to go "intelligent" (as mentioned in [1]) on picking the CRM, but the admin's expectation is realistic that when there are HA groups to limit resources onto "HA nodes", there should be also CRM group that can limit which nodes could ever become HA master. That's a very sensible addon, backwards compatible, defaulting to all, but very straightforward for admin to limit and predict.

Note, there's also the freshly filed bug [2] that leaves behind active watchdog on CRM when never goes idle. But that's beyond scope of this post, just to clarify it's completely unrelated.

[1] https://forum.proxmox.com/threads/i...p-the-only-ones-to-fences.122428/#post-532470
[2] https://bugzilla.proxmox.com/show_bug.cgi?id=5243
 
I suppose this was TL;DR for most.

I will go on and create 3 Bugzillas issues then:
1) The docs should not imply there's some "disarmed" state for the watchdog;
2) The docs should mention there's a chance ANY node can fence as any node can become CRS;
3) There should be crm_candidates_group or such that's configurable for the future.

Will update with links later on.
 
  • Like
Reactions: bjsko

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!