Cluster had lost quorum overnight

dick_fiddler69

New Member
May 1, 2026
7
1
3
Built a three node cluster which was all working correctly.

All networking working tested, for failover using dual nics, across two switches.

PVE-HYP-04P10 ⚠️ (warning)
PVE-HYP-05P10 ❌ (red/offline)
PVE-HYP-06P10 ❌ (red/offline)

pvecm status from PVE-HYP-04P10:

Nodes: 1
Quorate: No
Activity blocked
Members: 0x00000003 172.20.34.4 (local) only

Ping results:

ping 172.20.34.5 → 100% packet loss
ping 172.20.34.6 → 0% packet loss

corosync-cfgtool -s:

nodeid: 1 disconnected
nodeid: 2 connected
nodeid: 3 localhost

This shows 04P10 could reach 06P10 but not 05P10 — and the cluster had lost quorum overnight.

What is even weirder, is the corosync network is a VLAN on the same trunk which carries three other VLANs, which were all working and could ping each other.

But the corosync vLAN could not reach 04P10 and 05P10

Any ideas ?
 
is the corosync network is a VLAN on the same trunk which carries three other VLANs
That's bad. When one of "the other" networks does saturate this single physical wire... corosync will die.

QoS settings may help to prioritize the corosync VLAN, but recommended is to use a separate wire. The VLAN approach will be fine as a fallback connection, as a second "ring".
 
  • Like
Reactions: Johannes S
Switch problem?

You can set corosync to have multiple links. Doesn’t fix the problem but could get you operational.
No switch problems detected, there is a single static trunk, two network interfaces carrying four other VLANs, which none had any issues which also carries storage traffic, it was specific to a single VLAN only which was carrying corosync traffic, and it was odd that two could not communicate over that specific vLAN only.

And again all was working perfectly, the day before and overnight failed.

After running many checks over 6 hours, it just fixed it self, for no reason. Hence the question.
 
That's bad. When one of "the other" networks does saturate this single physical wire... corosync will die.

QoS settings may help to prioritize the corosync VLAN, but recommended is to use a separate wire. The VLAN approach will be fine as a fallback connection, as a second "ring".

Just to add, this was all working fine when it was ESXi and vSAN.

I understand the theory of 10GBe LInks saturated, in a bond - but this is three nodes doing nothing.

This agrument has been going on for years physical nics versus vLANs.

I don't believe there was enough traffic to create a point when VLAN carrying traffic died, and how does that explain this was just two nodes which could not communicate.