Proxmox 6 Cluster in WAN and low latency

Mike Tkatchouk

Active Member
Jan 19, 2018
6
2
43
46
Hey. Earlier, I used version 5 and a cluster of 4 nodes, one of which was on another network with delays of 30-80ms. To work, I configured corosync to work with udpg and everything worked fine. When upgrading to version 6, I recommended updating corosync and removing the udpu settings as recommended.

Now there is a periodic collapse of the cluster, the absence of a quorum and a bunch of errors in the logs. Perhaps this is due to network degradation.

Did you manage to launch a cluster in a distributed WAN network with unstable communication quality or is this idea now utopian?
 
if you have a single link, you can try setting (see 'man corosync.conf', edit the config in /etc/pve/corosync.conf, don't forget to bump the config_version)
  • knet_ping_timeout to 5000
  • knet_pong_count 1
  • knet_ping_interval 200ms
the default calculated values are not very good for single-link clusters with unreliable networks.
 
that being said, 30-80ms is a lot!
 
Hi. I change /etc/corosync/corosync.conf and restart deamon on evry node

totem {
cluster_name: pve
config_version: 28
interface {
ringnumber: 0
knet_ping_timeout: 5000
knet_pong_count: 1
knet_ping_interval: 200
}

And receive this message:

root@pve-01:~# tail -f /var/log/daemon.log
Dec 24 11:22:41 pve-01 pmxcfs[30679]: [status] notice: cpg_send_message retry 10
Dec 24 11:22:41 pve-01 pmxcfs[30679]: [dcdb] notice: cpg_send_message retry 10
Dec 24 11:22:42 pve-01 pmxcfs[30679]: [status] notice: cpg_send_message retry 20
Dec 24 11:22:42 pve-01 pmxcfs[30679]: [dcdb] notice: cpg_send_message retry 20
Dec 24 11:22:43 pve-01 pmxcfs[30679]: [status] notice: cpg_send_message retry 30
Dec 24 11:22:43 pve-01 pmxcfs[30679]: [dcdb] notice: cpg_send_message retry 30
Dec 24 11:22:44 pve-01 pmxcfs[30679]: [status] notice: cpg_send_message retry 40
Dec 24 11:22:44 pve-01 pmxcfs[30679]: [dcdb] notice: cpg_send_message retry 40
Dec 24 11:22:45 pve-01 pmxcfs[30679]: [status] notice: cpg_send_message retry 50
Dec 24 11:22:45 pve-01 pmxcfs[30679]: [dcdb] notice: cpg_send_message retry 50

root@pve-03:~$ tail -f /var/log/daemon.log
Dec 24 11:22:18 pve-03 pmxcfs[22672]: [dcdb] notice: cpg_join retry 460
Dec 24 11:22:19 pve-03 pmxcfs[22672]: [dcdb] notice: cpg_join retry 470
Dec 24 11:22:20 pve-03 pmxcfs[22672]: [dcdb] notice: cpg_join retry 480
Dec 24 11:22:21 pve-03 pmxcfs[22672]: [dcdb] notice: cpg_join retry 490
Dec 24 11:22:22 pve-03 pmxcfs[22672]: [dcdb] notice: cpg_join retry 500
Dec 24 11:22:23 pve-03 pmxcfs[22672]: [dcdb] notice: cpg_join retry 510
Dec 24 11:22:24 pve-03 pmxcfs[22672]: [dcdb] notice: cpg_join retry 520
Dec 24 11:22:25 pve-03 pmxcfs[22672]: [dcdb] notice: cpg_join retry 530
Dec 24 11:22:26 pve-03 pmxcfs[22672]: [dcdb] notice: cpg_join retry 540
Dec 24 11:22:27 pve-03 pmxcfs[22672]: [dcdb] notice: cpg_join retry 550
Dec 24 11:22:28 pve-03 pmxcfs[22672]: [dcdb] notice: cpg_join retry 560

Any ideas?
 
please provide the full logs ("journalctl -u pve-cluster -u corosync") starting with a restart of pve-cluster and corosync on all nodes..