[SOLVED] high latency clusters

esi_y

Renowned Member
Nov 29, 2023
2,221
368
63
github.com
The docs [1] state that:

The Proxmox VE cluster stack requires a reliable network with latencies under 5 milliseconds (LAN performance) between all nodes to operate stably. While on setups with a small node count a network with higher latencies may work, this is not guaranteed and gets rather unlikely with more than three nodes and latencies above around 10 ms.

What exactly in the stack requires the low latencies? Anything other than HA?

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network
 
Corosync, eg cluster networking, not HA directly.
The reason I asked is specifically because corosync itself can happily go from e.g. default values to much higher numbers for token timeout, even the defaults changed [1] over time. RHEL has maximum as 300 seconds even [2] for their support. Other values related to ping and pong can be changed too.

I can imagine when HA is hardcoded to detect within a minute and recover in total within 2, this would be problem, but what else in the PVE stack does indeed require the low latency?

EDIT: A valid example with RHEL9, for instance [3] under "3.6. MODIFYING THE COROSYNC.CONF FILE WITH THE PCS COMMAND":
The following example command udates the knet_pmtud_interval transport value and the token and join totem values.
Code:
# pcs cluster config update transport knet_pmtud_interval=35 totem token=10000 join=100

And the join otherwise defaults to 50ms.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1870449
[2] https://access.redhat.com/articles/3068821
[3] https://access.redhat.com/documenta...managing_high_availability_clusters-en-us.pdf
 
Last edited:
The more I have been testing (the increased values), the more I think it's arbitrary limit due to how HA component of the stack is also shipped with hardcoded values.

Anyone running their cluster across WAN long-term?

Note I am not endorsing using it with HA. (Before I get dose of responses to read the docs which I did, I just look for reasons which are undocumented, including from my further searches.)
 
it's not just HA, but basically anything interacting with pmxcfs (/etc/pve , but also some direct interfaces offered by it) that would start breaking if the token processing would take longer.
 
it's not just HA, but basically anything interacting with pmxcfs (/etc/pve , but also some direct interfaces offered by it) that would start breaking if the token processing would take longer.

Thanks a lot for this one as well, Fabian.

But could you be a little bit more specific, please? I understand that e.g. spinning up a new VM on separate nodes in a "laggy" corosync could end up with e.g. duplicate IDs for those, in respect to pmxcfs. But is there anything else that cannot be foreseen to break? Is this about the IPC calls?
 
no, you wouldn't end up with duplicate IDs, just a lot of lag/blocking when writing to /etc/pve (best case), or so much back pressure that corosync can't keep up anymore and stuff starts timing out (including things like lock requests for cluster-wide config files, broadcasts, but also synchronous API requests which have a timeout of 30s for completion of request handling!)
 
  • Like
Reactions: VictorSTS
no, you wouldn't end up with duplicate IDs, just a lot of lag/blocking when writing to /etc/pve (best case), or so much back pressure that corosync can't keep up anymore and stuff starts timing out

Fair enough, I was mostly waving that one off as without HA and if one has otherwise own mechanism to avoid duplicate ID, it would be of no concern.

(including things like lock requests for cluster-wide config files, broadcasts, but also synchronous API requests which have a timeout of 30s for completion of request handling!)

I know I am testing your patience here, but I really just want to where is the requirement coming from:

a) if I increased the API timeout or stayed well under 30 seconds (realistically corosync would be fine with 10s token for even across-the-world resources cluster); AND

b) never have any duplicate ID issue; AND

c) do not use HA

... am I still at risk?

The lock requests for cluster-wide configs must also have a time-out somewhere hardcoded, correct?

EDIT: Or do I risk a deadlock somewhere with such high latencies?

(NB The reason I am asking is to have an idea what will be breaking, I will not be endorsing it here or asking for support for such setup.)
 
Last edited:
the assumption that writes to /etc/pve finish within a reasonable amount of time is hard-coded everywhere. increasing the token timeout/latency breaks that assumption and will cause issues across the board. if each write suddenly blocks for 5s, unless your cluster is totally inactive, nothing will work anymore, since those blocking writes will quickly accumulate and effectively stall/timeout all changes.
 
  • Like
Reactions: esi_y
This is more interesting not because of OP's original question but because of how it is related to the practical max cluster size.

pmxcfs works great as-is and the max cluster size, for 99% of people's use, is high enough.

Far into the future, what are everyone's ideas for how this could be pushed without sacrificing synchronousness? Or is there absolutely no way, and do we just need to explore managing multiple clusters and build an async inter-cluster interface?

It would be cool if there was a container above "Datacenter" in the pve resource tree that could contain many clusters in many sites.
 
Yes I'm aware of the 4 year old thread with 50 +1's on it.

In that amount of time, surely there have been multiple ideas kicked around internally, and not shared with us.
 
Last edited:
Yes I'm aware of the 4 year old thread with 50 +1's on it.

In that amount of time, surely there have been multiple ideas kicked around internally, and not shared with us.
No worries, I just wanted to put it there for reference for anyone else. Then, go ahead and take over the thread. I got what I wanted, so you can expand it. ;)
 
i'm gonna give a explicit scenario, i have a couple of server on OVH Canada, and one on OVH Germany, i wanna add the one in Germany to the cluster just for replication purposes, i don't use HA, in fact, i only use the cluster to manage all the nodes from the same place, there's like 100ms between both DC, once things are working, i barely touch them, no power on, power off, edit vms config etc, just add the node and replicate the vms, will i encounter any problem?, why is there a need for so low latency, i don't think 100ms should be a problem, thanks!
 
i'm gonna give a explicit scenario, i have a couple of server on OVH Canada, and one on OVH Germany, i wanna add the one in Germany to the cluster just for replication purposes, i don't use HA, in fact, i only use the cluster to manage all the nodes from the same place, there's like 100ms between both DC, once things are working, i barely touch them, no power on, power off, edit vms config etc, just add the node and replicate the vms, will i encounter any problem?, why is there a need for so low latency, i don't think 100ms should be a problem, thanks!

As this was my thread, I will take the liberty of (kind of) replying.

Without HA (and assuming there's no more bugs in HA stack [1] that actually keep it active even it is supposed to be disabled), you would be losing quorum intermittently. This would not cause any auto-reboots, but e.g. starting a VM (or other cluster-wide actions) might fail occasionally (without any reason apparent to you).

Additionally, with this geographic routing scenario, you might be hitting other seemingly weird symptoms [2].

With everything up and running, things may look fine for months, then during some elevated jitter situation, you will be posting on the forum about this mysterious timeouts in your log.

I would not go on to say it is "unsupported", but if you have a subscription, this would be a response you would get from e.g. @fabian et al in such a scenario, correct? ;)

PS I still do not have any explanation why anything is hardcoded in PVE code. :D

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=5243
[2] https://github.com/corosync/corosync/issues/659
 
As this was my thread, I will take the liberty of (kind of) replying.

Without HA (and assuming there's no more bugs in HA stack [1] that actually keep it active even it is supposed to be disabled), you would be losing quorum intermittently. This would not cause any auto-reboots, but e.g. starting a VM (or other cluster-wide actions) might fail occasionally (without any reason apparent to you).

Additionally, with this geographic routing scenario, you might be hitting other seemingly weird symptoms [2].

With everything up and running, things may look fine for months, then during some elevated jitter situation, you will be posting on the forum about this mysterious timeouts in your log.

I would not go on to say it is "unsupported", but if you have a subscription, this would be a response you would get from e.g. @fabian et al in such a scenario, correct? ;)

PS I still do not have any explanation why anything is hardcoded in PVE code. :D

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=5243
[2] https://github.com/corosync/corosync/issues/659
thanks, I'm just gonna install pbs then.
 
A multi cluster GUI is in the works [1]. Hopefully it will integrate remote migrate!

[1] https://forum.proxmox.com/threads/f...ti-cluster-node-management.144416/post-649956

The issue is that's the case since ~2022 from the other posts. The most intersting thing would be to see how that plays with the current model - i.e. not having separate control plane, but SPA and relaying any node for any node. That will generate support load. It would have been better to see a separately installable control even out of the cluster.
 
In the light of this thread, I now wonder [1]:
With the ZFS-based Proxmox VE storage replication framework, we have found a solution that meets our needs. Replication is configurable per VM and is fast, stable and flexible. In total, we now operate 3 Proxmox VE clusters, each of which having one node in Vienna and Salzburg. The sites are redundantly connected via the nic.at backbone with 1Gbit, where the VMs communicate via IPsec encrypted GRE tunnels.

Head of Operations, nic.at

They got the <5ms? over GRE tunnel? :)

[1] https://www.proxmox.com/en/about/stories/story/nic-at
 
  • Like
Reactions: LEI

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!