PVE crashes if corosync is slow

jonasled · Jan 1, 2023

Hi,
I have a PVE Cluster with 5 nodes. One had tonight a very slow disk IO on the root disk. This slowed down the corosync until the Volume wasn't accessible anymore on any node. (The folder was still there, but for example a ls never returned a result). The problem then was, that there were multiple tasks waiting for responses from the cluster storage. This resulted in a lockup and then a full crash of all nodes in the cluster. Is there any solution to prevent this from happen again in the future?

bbgeek17 · Jan 2, 2023

You have not stated it explicitly, so just to clarify - by "cluster storage" did you mean /etc/pve pmxfs or Ceph?
What exactly did "full crash of all nodes" look like?

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

jonasled · Jan 2, 2023

I mean /etc/pve (pmxfs)

Realtox · Apr 26, 2023

Hi @jonasled,

what version do you use (pveversion)?
And can you say how a crash looks like?

jonasled · Apr 26, 2023

The Host crashes without logging a kernel panic and the reboots. I've uploaded the syslog from the last minutes before the crash here: https://paste.jonasled.de/atakemitew

fabian · Apr 26, 2023

you had HA enabled, but issues with your network caused corosync and thus /etc/pve to have an outage and the node fenced itself:

Code:

Apr 25 21:25:24 pve-router-01 pmxcfs[5090]: [dcdb] notice: cpg_send_message retry 10
...
Apr 25 21:26:16 pve-router-01 watchdog-mux[4098]: client watchdog expired - disable watchdog updates

this means that the HA stack couldn't pull up the watchdog for 60s because /etc/pve was not writable. before that already you can see the link to host 2 going down and up again and corosync struggling to keep up. at the point of the full outage it seems like corosync was not able to complete the sync up on membership change - hard to tell what is going on there without logs from other nodes, and likely would require debug logs to fully analyze.

jonasled · Apr 26, 2023

Yes, hosts 2 ( pve-router-05 ) had a problem with very slow IO. I've uploaded the log for the last few days from all nodes:
* corosync config: https://paste.jonasled.de/uremaxawek.yaml
* pve-router-01: https://transfer.jonasled.de/EDWlzC1oNT/syslog
* pve-router-02: https://transfer.jonasled.de/2g3NOepvyA/syslog
* pve-router-03: https://transfer.jonasled.de/SVeLY2QCAG/syslog
* pve-router-04: https://transfer.jonasled.de/cPozunHx99/syslog
* pve-router-05: https://transfer.jonasled.de/NkO2Q0hrtg/syslog

fabian · Apr 26, 2023

thanks for the logs!

how is your network configured? was the flapping link caused by the slow storage and a resulting overload of the host in question? or was the flapping link causing the storage outage? or was the flapping link unrelated?

in any case, it looks to me like the combination of host 2 (pve-router-05)
- having a flapping link that went down and up frequently made corosync unable to establish the cluster members (each change in topology makes the process start over)
- while at the same time not having HA active so it was not fenced even though it lost quorum
was the main factor making the situation as bad as it is.

this might have been avoided by making corosync mark a link as up after a longer time only (so that a node with all links flapping drops from the membership and the others might be able to establish a quorum before the flapping link(s) are considered up again), but that also means that recovery takes longer in case of a short outage. without knowing the root cause it's hard to recommend any changes.

based on the boot up messages, it looks to me like 3 of your PVE host are virtual themselves? is it possible that the hypervisor had issues causing the VM of pve-router-5 to not be scheduled (frequently enough), and that caused the link flapping?

jonasled · Apr 26, 2023

Only PVE-router-01 is a physical host, the rest is virtual. When router-01 is online all VMs are running on this host, otherwise they will run on the other nodes. The network link is not the problem, they are all on the same switch and the switch hasn't logged any disconnects.
With the crash yesterday there was a problem on the host of router-05 which resulted in a very high CPU load. This caused the flapping link for corosync, but the question is why other hosts are crashing when one node has a problem with the corosync.

fabian · Apr 26, 2023

because if all links of a node are flapping for a longer period of time, corosync might not be able to finish establishing which nodes are online, and if a node has HA enabled, not having a quorum for a longer period of time causes that node to fence itself.

Search

Search

PVE crashes if corosync is slow

jonasled

Member

bbgeek17

Distinguished Member

jonasled

Member

Realtox

New Member

jonasled

Member

fabian

Proxmox Staff Member

jonasled

Member

fabian

Proxmox Staff Member

jonasled

Member

fabian

Proxmox Staff Member