Proxmox outage - single node became quorate on its own

jeinwag · Monday at 12:44

Hi everyone,

last Friday we experienced a major outage in our Proxmox cluster. While performing maintenance on one of the nodes, that node unexpectedly became quorate on its own, which triggered HA to restart VMs. As a result, multiple instances of the same VMs ended up running simultaneously on different nodes, leading to data corruption.

We’re currently trying to understand how and why Corosync determined that this single node had quorum.

I’ve attached the relevant Corosync logs from the incident.

What we know so far:

Corosync was trying to reach other cluster members, so we believe the nodes were correctly listed in the Corosync configuration.
The network/heartbeat interfaces were flapping – but only on the node that was under maintenance. All other nodes remained stable and were reachable among each other.
At 12:51:28, Corosync logs state:

"This node is within the primary component and will provide service."
This appears to be the point where the trouble began – but how can this happen with only a single active node?

Click to expand...

We would greatly appreciate any insights or guidance from the community on:

What could cause Corosync to falsely assume quorum on a single node?
How to prevent such a situation in the future?

Thanks in advance for your help!

Best regards,
Julian

LnxBil · Monday at 13:26

What does pvecm status show?

Moayad · Monday at 13:27

Hi,

Thank you for the log you've provided!

For more information before I reply the actual reason of the issue cause, could you please provide us the `/etc/pve/corosync.conf` and the network configuration `/etc/network/interfaces` as well as the `ip a` output? This should help us to take an overview of the network configuration, and help us understand what happened exactly and how you can improve or avoid that in future.

jeinwag · Monday at 15:13

@Moayad @LnxBil
Please find the requested files attached here. The corosync.conf was last modified in October. I'm currently trying to put together what exactly happened, since I wasn't there when the outage occurred. Apparently, the node was put into maintenance, updates were installed and after a reboot, there were some connectivity issues or something - I'm sorry I can't provide more details at the current time.

fiona · 2025-06-03T10:35:04+0200

Hi,
please check your shell command history from around the time of the message, i.e. May 30 12:51:28

jeinwag · 2025-06-03T11:08:56+0200

Unfortunately, we do not have a timestamped history. Any hints what to look for?

fabian · 2025-06-03T11:43:16+0200

somebody ran "pvecm expected 1"..

jeinwag · 2025-06-03T11:57:22+0200

Nothing in the history indicates that somebody did that.

We only just today noticed that a second node als exhibited the same behaviour. It too became quorate on its own, just four seconds earlier.

fabian · 2025-06-03T12:32:40+0200

corosync doesn't change the quorum rules on its own, and neither does PVE. the only way this can happen is if somebody (or something) ran "pvecm expected 1" (or the corresponding "corosync-quorumtool" call it wraps), or the log files and config you provided are incomplete/are not the config that ran at that time.

maybe you can provide the full journal covering those ~15 minutes for all nodes?

jeinwag · 2025-06-03T13:18:50+0200

I attached the journals from 12:35 until 12:55 from that day, for all 15 nodes. Proxmox-02-gpu was the other affected host.

fabian · 2025-06-03T13:39:29+0200

there's a root ssh login from 192.168.32.10 on both affected nodes (well, two on one, one on the other). those seem to be interactive shells. additionally there's two UI logins on other nodes. I'd start by asking the people that did those - the logs show no trace of anything other than what gets logged when somebody manually sets the expected vote counters on those two nodes.

jeinwag · 2025-06-03T15:01:00+0200

Thank you for your feedback, looks like we won't be able to clear up the root cause of this.

LnxBil · 2025-06-03T19:13:32+0200

jeinwag said:
Unfortunately, we do not have a timestamped history. Any hints what to look for?

But you have a history, so have you checked there for the word expected?

We need in general better auditing for the whole system. I just looked into this last week and it is not as easy as it looks, especially filtering out the good and only concentrating on the bad and the ugly.

Search

Search

Proxmox outage - single node became quorate on its own

jeinwag

Member

What we know so far:

Attachments

LnxBil

Distinguished Member

Moayad

Proxmox Staff Member

jeinwag

Member

Attachments

fiona

Proxmox Staff Member

jeinwag

Member

fabian

Proxmox Staff Member

jeinwag

Member

fabian

Proxmox Staff Member

jeinwag

Member

Attachments

fabian

Proxmox Staff Member

jeinwag

Member

LnxBil

Distinguished Member

We value your privacy

Proxmox outage - single node became quorate on its own

Member

What we know so far:​

Attachments

Distinguished Member

Proxmox Staff Member

Member

Attachments

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Attachments

Proxmox Staff Member

Member

Distinguished Member

We value your privacy

What we know so far: