PVE crashes if corosync is slow

jonasled

Member
May 7, 2020
22
7
23
23
Germany
jonasled.de
Hi,
I have a PVE Cluster with 5 nodes. One had tonight a very slow disk IO on the root disk. This slowed down the corosync until the Volume wasn't accessible anymore on any node. (The folder was still there, but for example a ls never returned a result). The problem then was, that there were multiple tasks waiting for responses from the cluster storage. This resulted in a lockup and then a full crash of all nodes in the cluster. Is there any solution to prevent this from happen again in the future?
 
you had HA enabled, but issues with your network caused corosync and thus /etc/pve to have an outage and the node fenced itself:

Code:
Apr 25 21:25:24 pve-router-01 pmxcfs[5090]: [dcdb] notice: cpg_send_message retry 10
...
Apr 25 21:26:16 pve-router-01 watchdog-mux[4098]: client watchdog expired - disable watchdog updates

this means that the HA stack couldn't pull up the watchdog for 60s because /etc/pve was not writable. before that already you can see the link to host 2 going down and up again and corosync struggling to keep up. at the point of the full outage it seems like corosync was not able to complete the sync up on membership change - hard to tell what is going on there without logs from other nodes, and likely would require debug logs to fully analyze.
 
Yes, hosts 2 ( pve-router-05 ) had a problem with very slow IO. I've uploaded the log for the last few days from all nodes:
* corosync config: https://paste.jonasled.de/uremaxawek.yaml
* pve-router-01: https://transfer.jonasled.de/EDWlzC1oNT/syslog
* pve-router-02: https://transfer.jonasled.de/2g3NOepvyA/syslog
* pve-router-03: https://transfer.jonasled.de/SVeLY2QCAG/syslog
* pve-router-04: https://transfer.jonasled.de/cPozunHx99/syslog
* pve-router-05: https://transfer.jonasled.de/NkO2Q0hrtg/syslog
 
thanks for the logs!

how is your network configured? was the flapping link caused by the slow storage and a resulting overload of the host in question? or was the flapping link causing the storage outage? or was the flapping link unrelated?

in any case, it looks to me like the combination of host 2 (pve-router-05)
- having a flapping link that went down and up frequently made corosync unable to establish the cluster members (each change in topology makes the process start over)
- while at the same time not having HA active so it was not fenced even though it lost quorum
was the main factor making the situation as bad as it is.

this might have been avoided by making corosync mark a link as up after a longer time only (so that a node with all links flapping drops from the membership and the others might be able to establish a quorum before the flapping link(s) are considered up again), but that also means that recovery takes longer in case of a short outage. without knowing the root cause it's hard to recommend any changes.

based on the boot up messages, it looks to me like 3 of your PVE host are virtual themselves? is it possible that the hypervisor had issues causing the VM of pve-router-5 to not be scheduled (frequently enough), and that caused the link flapping?
 
Only PVE-router-01 is a physical host, the rest is virtual. When router-01 is online all VMs are running on this host, otherwise they will run on the other nodes. The network link is not the problem, they are all on the same switch and the switch hasn't logged any disconnects.
With the crash yesterday there was a problem on the host of router-05 which resulted in a very high CPU load. This caused the flapping link for corosync, but the question is why other hosts are crashing when one node has a problem with the corosync.
 
because if all links of a node are flapping for a longer period of time, corosync might not be able to finish establishing which nodes are online, and if a node has HA enabled, not having a quorum for a longer period of time causes that node to fence itself.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!