Proxmox cluster quorum problem?

halt · Sep 11, 2024

We have 19 hosts in the cluster. The Proxmox cluster worked until yesterday. A problem appeared - the cluster (web-ui) became unavailable.
The pvecm nodes or pvecm list command works and waits for a very long time and does not show all servers.
The corosync-quorumtool command does not gain quorum, corosync-cfgtool -n shows ALL 19 "enabled connected" servers.

The /etc/pve file system is unavailable(very long time answer), i cannot enter the directory.

We found out -
8 servers are in DC1
11 servers are in DC2
ping between DCs is 8-9ms.

If you run 11 servers in DC2 - everything works, if we add DC1 servers /etc/pve becomes unavailable after 2-3 nodes.
On web site proxmox write - ping for pmxcfs to work should be no more than 5 ms.

Network Requirements
The Proxmox VE cluster stack requires a reliable network with latencies under 5 milliseconds (LAN performance) between all nodes to operate stably. While on setups with a small node count a network with higher latencies may work, this is not guaranteed and gets rather unlikely with more than three nodes and latencies above around 10 ms.

I think we have problems with this.

Questions:
Why have we been working without problems for many months?
Is it possible to change corosync pmxcfs timeouts? (if the problem is due to ping 8-9ms)
Why does /etc/pve become unavailable?

sw-omit · Sep 11, 2024

One possible reason (other then "being lucky") that it worked before and now doesn't, is that something changed about the link between the two datacenters, for example that because of maintenance/other reasons on the provider's side traffic is taking a different (slower/more congested) route. If you are also syncing data over that same inter-datacenter network, that could be contributing to that delay as well.

As for changing the timeouts, there is no officially supported way, the values are hard-coded (although someone found something that they think worked for them, but could be removed again on an update) [1]

/etc/pve is a file-system backed by a database that keeps all the changes in sync between all nodes. I'm guessing that, if some nodes are slower with their updates, the database is constantly busy handling those changes, instead of providing them for the users.

As you already found in the wiki, running a cluster with too high of a delay is not supported and on top of that, if datacenter 2 were to go down, you wouldn't be able to run datacenter 1 either, since less then "more then half of the nodes" are there, so it would fall out of quorum as well.

[1] https://forum.proxmox.com/threads/change-corosync-timeout.29465/#post-147755

Search

Search

Proxmox cluster quorum problem?

halt

New Member

sw-omit

Active Member