Proxmox outage - single node became quorate on its own

Nov 24, 2022
10
2
8
Hi everyone,

last Friday we experienced a major outage in our Proxmox cluster. While performing maintenance on one of the nodes, that node unexpectedly became quorate on its own, which triggered HA to restart VMs. As a result, multiple instances of the same VMs ended up running simultaneously on different nodes, leading to data corruption.

We’re currently trying to understand how and why Corosync determined that this single node had quorum.

I’ve attached the relevant Corosync logs from the incident.

What we know so far:​

  • Corosync was trying to reach other cluster members, so we believe the nodes were correctly listed in the Corosync configuration.
  • The network/heartbeat interfaces were flapping – but only on the node that was under maintenance. All other nodes remained stable and were reachable among each other.
  • At 12:51:28, Corosync logs state:

    "This node is within the primary component and will provide service."
    This appears to be the point where the trouble began – but how can this happen with only a single active node?
We would greatly appreciate any insights or guidance from the community on:

  • What could cause Corosync to falsely assume quorum on a single node?
  • How to prevent such a situation in the future?
Thanks in advance for your help!

Best regards,
Julian
 

Attachments

Hi,

Thank you for the log you've provided!


For more information before I reply the actual reason of the issue cause, could you please provide us the `/etc/pve/corosync.conf` and the network configuration `/etc/network/interfaces` as well as the `ip a` output? This should help us to take an overview of the network configuration, and help us understand what happened exactly and how you can improve or avoid that in future.
 
@Moayad @LnxBil
Please find the requested files attached here. The corosync.conf was last modified in October. I'm currently trying to put together what exactly happened, since I wasn't there when the outage occurred. Apparently, the node was put into maintenance, updates were installed and after a reboot, there were some connectivity issues or something - I'm sorry I can't provide more details at the current time.
 

Attachments

Hi,
please check your shell command history from around the time of the message, i.e. May 30 12:51:28
 
somebody ran "pvecm expected 1"..
 
Nothing in the history indicates that somebody did that.

We only just today noticed that a second node als exhibited the same behaviour. It too became quorate on its own, just four seconds earlier.
 
corosync doesn't change the quorum rules on its own, and neither does PVE. the only way this can happen is if somebody (or something) ran "pvecm expected 1" (or the corresponding "corosync-quorumtool" call it wraps), or the log files and config you provided are incomplete/are not the config that ran at that time.

maybe you can provide the full journal covering those ~15 minutes for all nodes?
 
there's a root ssh login from 192.168.32.10 on both affected nodes (well, two on one, one on the other). those seem to be interactive shells. additionally there's two UI logins on other nodes. I'd start by asking the people that did those - the logs show no trace of anything other than what gets logged when somebody manually sets the expected vote counters on those two nodes.
 
Unfortunately, we do not have a timestamped history. Any hints what to look for?
But you have a history, so have you checked there for the word expected?

We need in general better auditing for the whole system. I just looked into this last week and it is not as easy as it looks, especially filtering out the good and only concentrating on the bad and the ugly.