[SOLVED] PVE 6.0/corosync over WAN (high latency) - looses sync

Progratron

Member
Feb 27, 2019
40
4
13
38
I am running PVE cluster over WAN (different datacenters across the globe). It worked all the time flawlessly and best suited my needs (of course no shared storage, LM or HA but still central management, easy offline migrations etc). Some time ago I've upgraded to PVE 6.0 and was able to run the corosync directly through WAN unicast interfaces, no need to build VPN which is not necessary for some of my nodes. Simplified my setup and I was glad :)

But now sporadically I have some kind of corosync "sync" problems. When there are some (even short time!) connection problems between nodes (which is understandable and unavoidable on WAN links) cluster seem to get broken. When I notice this I simply run:

Code:
killall -9 corosync
systemctl restart pve-cluster

On disconnected nodes and it brings them back (simply replays all the messages as I can see), but it seems strange for me. I do understand that there was some connectivity issue (as said - on WAN unavoidable), but why it doesn't get re-synced automatically? It worked without problems before when I used old multicast corosync over VPN...

P.S. I have mentioned this problem here, but it seems that this topic from 2015 is dead :)
 

tim

Proxmox Staff Member
Oct 1, 2018
330
42
33
Can you share some syslog entries around the time of the connectivity loss? Just for the records, cluster over wan -> not supported nor recommended, but I'm curious about it.
 

Progratron

Member
Feb 27, 2019
40
4
13
38
Can you share some syslog entries around the time of the connectivity loss? Just for the records, cluster over wan -> not supported nor recommended, but I'm curious about it.

Thanks for your attention. Meanwhile, I've altered corosync.conf and increased token value to 10000 as advised here. This change definitely didn't solve the problem completely (it still crashes), but subjectively it feels to happen not that often now (I might be wrong at this).

I tried to switch transport to STCP, but failed so far (will research later, might be my firewall setup).

Log attached (too long for a message).

I am aware that cluster over wan is not officially supported, but a) it worked before b) from what I see around, even on this forum - such setups are being used nowadays more often.
 

Attachments

  • cluster_wan_crash.log.txt
    37.5 KB · Views: 14
Last edited:

tim

Proxmox Staff Member
Oct 1, 2018
330
42
33
please share your pveversion as well.
 

tim

Proxmox Staff Member
Oct 1, 2018
330
42
33
please use:
pveversion --verbose
 

Progratron

Member
Feb 27, 2019
40
4
13
38
please use:
pveversion --verbose

Sure. Sorry. See below.

By the way, as I saw this packages list I remembered that yesterday as I tried to solve the problem besides the above-mentioned change in corosync.conf I've also updated nodes with apt. To be sure I am not giving you the log where something is already fixed with updates I installed yesterday, I am attaching another one from yesterday evening. This "crash" happened definitely after all updates were done, so pveversion is actual for it.

Code:
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1
 

Attachments

  • cluster_wan_crash_nov5.txt
    51 KB · Views: 4

tim

Proxmox Staff Member
Oct 1, 2018
330
42
33
Ok, that's definitely not the most recent version. There where various fixe regarding libknet1 (your current version->libknet1: 1.10-pve1), please update to the latest version and monitor your system, looks like this is one of the bugs fixed already.
 

Progratron

Member
Feb 27, 2019
40
4
13
38
Ok, that's definitely not the most recent version. There where various fixe regarding libknet1 (your current version->libknet1: 1.10-pve1), please update to the latest version and monitor your system, looks like this is one of the bugs fixed already.

Done. Running libknet1: 1.13-pve1 now. Will monitor and update now.
 

Progratron

Member
Feb 27, 2019
40
4
13
38
Can confirm working again stable now :) Thanks for your assistance.

And yes, cluster over WAN not recomended, but works :)
 

argonius

Active Member
Jan 17, 2012
46
0
26
Can confirm working again stable now :) Thanks for your assistance.

And yes, cluster over WAN not recomended, but works :)

Hey,

sorry but how did u upgraded? We have the same version like you before your upgrade, but apt-get dist-upgrade says there is no new package....
 

wellbein

New Member
May 12, 2020
12
8
3
49
hi,
your configuration interests me because I have to propose this type of architecture to a customer.
can you boot your VMs on all your cluster nodes?
if yes, what mechanism do you use to transfer VM data from one datacenter to another?
I don't need to be able to do livemigration, I just need to be able to restart the VMS with the least possible data loss.
 

Progratron

Member
Feb 27, 2019
40
4
13
38
hi,
your configuration interests me because I have to propose this type of architecture to a customer.
can you boot your VMs on all your cluster nodes?
if yes, what mechanism do you use to transfer VM data from one datacenter to another?
I don't need to be able to do livemigration, I just need to be able to restart the VMS with the least possible data loss.

What do you mean with "boot your VMs" - of course I am able to.

VM transfer functions just OKay within the GUI.
 

wellbein

New Member
May 12, 2020
12
8
3
49
some cluster their servers only for centralized management.
I mean, can you move your VMs from one server to another. Apparently yes, if I understand you.
so the question is : how do you copy data from one server to another ? shared storage ? replication ? if you have documentation I'm very interested, because i m not very confortable with ProxMox/Linux storage
 

Progratron

Member
Feb 27, 2019
40
4
13
38
some cluster their servers only for centralized management.
I mean, can you move your VMs from one server to another. Apparently yes, if I understand you.
so the question is : how do you copy data from one server to another ? shared storage ? replication ? if you have documentation I'm very interested, because i m not very confortable with ProxMox/Linux storage

Well, I guess you have to read the documentation, because the questions you ask do not make much sense for me now... Especially if this is the service some customer will get...
 

wellbein

New Member
May 12, 2020
12
8
3
49
hum, maybye some translations problem.
my question is : wich storage mechanism do you use in order to make VM Datas (vhd) to transit accross your differents locations ?
i've read some different implementations about this, but not sure all are WAN compatible.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!