[SOLVED] PVE 6.0/corosync over WAN (high latency) - looses sync

Progratron

Active Member
Feb 27, 2019
40
4
28
40
I am running PVE cluster over WAN (different datacenters across the globe). It worked all the time flawlessly and best suited my needs (of course no shared storage, LM or HA but still central management, easy offline migrations etc). Some time ago I've upgraded to PVE 6.0 and was able to run the corosync directly through WAN unicast interfaces, no need to build VPN which is not necessary for some of my nodes. Simplified my setup and I was glad :)

But now sporadically I have some kind of corosync "sync" problems. When there are some (even short time!) connection problems between nodes (which is understandable and unavoidable on WAN links) cluster seem to get broken. When I notice this I simply run:

Code:
killall -9 corosync
systemctl restart pve-cluster

On disconnected nodes and it brings them back (simply replays all the messages as I can see), but it seems strange for me. I do understand that there was some connectivity issue (as said - on WAN unavoidable), but why it doesn't get re-synced automatically? It worked without problems before when I used old multicast corosync over VPN...

P.S. I have mentioned this problem here, but it seems that this topic from 2015 is dead :)
 
Can you share some syslog entries around the time of the connectivity loss? Just for the records, cluster over wan -> not supported nor recommended, but I'm curious about it.
 
Can you share some syslog entries around the time of the connectivity loss? Just for the records, cluster over wan -> not supported nor recommended, but I'm curious about it.

Thanks for your attention. Meanwhile, I've altered corosync.conf and increased token value to 10000 as advised here. This change definitely didn't solve the problem completely (it still crashes), but subjectively it feels to happen not that often now (I might be wrong at this).

I tried to switch transport to STCP, but failed so far (will research later, might be my firewall setup).

Log attached (too long for a message).

I am aware that cluster over wan is not officially supported, but a) it worked before b) from what I see around, even on this forum - such setups are being used nowadays more often.
 

Attachments

  • cluster_wan_crash.log.txt
    37.5 KB · Views: 16
Last edited:
please share your pveversion as well.
 
please use:
pveversion --verbose
 
please use:
pveversion --verbose

Sure. Sorry. See below.

By the way, as I saw this packages list I remembered that yesterday as I tried to solve the problem besides the above-mentioned change in corosync.conf I've also updated nodes with apt. To be sure I am not giving you the log where something is already fixed with updates I installed yesterday, I am attaching another one from yesterday evening. This "crash" happened definitely after all updates were done, so pveversion is actual for it.

Code:
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1
 

Attachments

  • cluster_wan_crash_nov5.txt
    51 KB · Views: 4
Ok, that's definitely not the most recent version. There where various fixe regarding libknet1 (your current version->libknet1: 1.10-pve1), please update to the latest version and monitor your system, looks like this is one of the bugs fixed already.
 
Ok, that's definitely not the most recent version. There where various fixe regarding libknet1 (your current version->libknet1: 1.10-pve1), please update to the latest version and monitor your system, looks like this is one of the bugs fixed already.

Done. Running libknet1: 1.13-pve1 now. Will monitor and update now.
 
Can confirm working again stable now :) Thanks for your assistance.

And yes, cluster over WAN not recomended, but works :)
 
Can confirm working again stable now :) Thanks for your assistance.

And yes, cluster over WAN not recomended, but works :)

Hey,

sorry but how did u upgraded? We have the same version like you before your upgrade, but apt-get dist-upgrade says there is no new package....
 
hi,
your configuration interests me because I have to propose this type of architecture to a customer.
can you boot your VMs on all your cluster nodes?
if yes, what mechanism do you use to transfer VM data from one datacenter to another?
I don't need to be able to do livemigration, I just need to be able to restart the VMS with the least possible data loss.
 
hi,
your configuration interests me because I have to propose this type of architecture to a customer.
can you boot your VMs on all your cluster nodes?
if yes, what mechanism do you use to transfer VM data from one datacenter to another?
I don't need to be able to do livemigration, I just need to be able to restart the VMS with the least possible data loss.

What do you mean with "boot your VMs" - of course I am able to.

VM transfer functions just OKay within the GUI.
 
some cluster their servers only for centralized management.
I mean, can you move your VMs from one server to another. Apparently yes, if I understand you.
so the question is : how do you copy data from one server to another ? shared storage ? replication ? if you have documentation I'm very interested, because i m not very confortable with ProxMox/Linux storage
 
some cluster their servers only for centralized management.
I mean, can you move your VMs from one server to another. Apparently yes, if I understand you.
so the question is : how do you copy data from one server to another ? shared storage ? replication ? if you have documentation I'm very interested, because i m not very confortable with ProxMox/Linux storage

Well, I guess you have to read the documentation, because the questions you ask do not make much sense for me now... Especially if this is the service some customer will get...
 
hum, maybye some translations problem.
my question is : wich storage mechanism do you use in order to make VM Datas (vhd) to transit accross your differents locations ?
i've read some different implementations about this, but not sure all are WAN compatible.
 
Thanks for your attention. Meanwhile, I've altered corosync.conf and increased token value to 10000 as advised here. This change definitely didn't solve the problem completely (it still crashes), but subjectively it feels to happen not that often now (I might be wrong at this).

I tried to switch transport to STCP, but failed so far (will research later, might be my firewall setup).

Log attached (too long for a message).

I am aware that cluster over wan is not officially supported, but a) it worked before b) from what I see around, even on this forum - such setups are being used nowadays more often.
I see you increased token to 10000 and your WAN was running around 10ms.

I need to achieve cluster with WAN closer to 100ms for one of the nodes.

Is this a linear scale,, such as to simply increase token to 100000?
 
I need to achieve cluster with WAN closer to 100ms for one of the nodes.
You know already that this is probably definitely not a good idea.

Some weeks ago there was a post from a Staff member with a deeper-than-usual insight of the why - but I can not find it...

Please post your experience after your tests regarding failover and network outages with or w/o HA, a lot of people will be interested :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!