pvesr service fails on one server every week

Oct 27, 2018
10
2
3
West Norway
I've been having this really weird issue where one of my Proxmox nodes will fail a week after the cluster is fixed, and I've really been struggling to troubleshoot the issue. Especially since it takes a week to see if any solution actually worked.

My cluster has been in a loop where every week one of the nodes' pvesr service fails with the following details in its logs:

Code:
Oct 13 20:57:00 tethealla systemd[1]: Starting Proxmox VE replication runner...
Oct 13 20:57:00 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:01 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:02 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:03 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:04 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:05 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:06 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:07 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:08 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:09 tethealla pvesr[2019]: error with cfs lock 'file-replication_cfg': no quorum!
Oct 13 20:57:09 tethealla systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a

At first I didn't know what it could be. But looking for patterns in how the servers failed. I observed the following details:

  • Node A first failed on Saturday September 28th.
  • I couldn't figure out how to get it working again, so I shut down the node, removed the it from the cluster and reinstalled/rejoined it.
  • Exactly one week later, on Saturday October 5th, Node B failed instead.
  • This time I wasn't on-site on Saturday, so I reinstalled it on Sunday October 6th instead.
  • Now another exact week after I fixed the server, on Sunday October 13th, Node B failed again.
I have not reinstalled Node B yet, relying on my Qdevice to keep the cluster online, as I've been trying to figure out a more permanent solution. The last time Node B failed, I looked up the pvesr error and found information about igmp and multicast. I tried to enable both igmp snooping and igmp querying on my switch, but as it failed again one week later, it does not seem to have helped. I also doubt it's network related, as this issue only started happening in the past month. The last firmware update performed on the switch was in July.

The only thing I imagine could cause this is that I upgraded to Proxmox VE 6 in August. But that still leaves a 1 month gap until these issues started happening. Corosync seems to establish a link for a few seconds, but then the server promptly gets "kicked out":

Code:
Oct 15 01:59:03 tethealla systemd[1]: Started Corosync Cluster Engine.
Oct 15 01:59:28 tethealla corosync[1045]:   [KNET  ] rx: host: 1 link: 0 is up
Oct 15 01:59:28 tethealla corosync[1045]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 15 01:59:28 tethealla corosync[1045]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Oct 15 01:59:28 tethealla corosync[1045]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Oct 15 12:36:41 tethealla corosync[1045]:   [KNET  ] link: host: 1 link: 0 is down
Oct 15 12:36:41 tethealla corosync[1045]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 15 12:36:41 tethealla corosync[1045]:   [KNET  ] host: host: 1 has no active links
Oct 15 12:36:51 tethealla corosync[1045]:   [QUORUM] This node is within the primary component and will provide service.
Oct 15 12:36:51 tethealla corosync[1045]:   [QUORUM] Members[1]: 2
Oct 15 12:36:52 tethealla corosync[1045]:   [KNET  ] rx: host: 1 link: 0 is up
Oct 15 12:36:52 tethealla corosync[1045]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 15 12:36:53 tethealla corosync[1045]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 15 12:36:53 tethealla corosync[1045]:   [QUORUM] Members[1]: 2
Oct 15 12:36:58 tethealla corosync[1045]:   [KNET  ] link: host: 1 link: 0 is down
Oct 15 12:36:58 tethealla corosync[1045]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 15 12:37:00 tethealla corosync[1045]:   [KNET  ] rx: host: 1 link: 0 is up
Oct 15 12:37:00 tethealla corosync[1045]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

As is the original source of my confusion. If I shut down the stray node. Remove it from the cluster using the remaining node, reinstall it, rejoin it to the cluster, remove the quorum device and then readd the quorum device, the cluster will then work normally for exactly one week. I didn't note down all the timestamps. But it seems to be almost on the minute. As in 168 hours, 0 minutes and 0 seconds after the cluster was last fixed.
 
please post the output of `pveversion -v`
There were quite a few bugfixes for corosync since August.

I hope this helps!
 
Thanks for the quick reply.
Here's the output from Node A:
Code:
proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.0-7 (running version: 6.0-7/28984024)
pve-kernel-5.0: 6.0-8
pve-kernel-helper: 6.0-8
pve-kernel-4.15: 5.4-8
pve-kernel-5.0.21-2-pve: 5.0.21-6
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.12-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-5
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-65
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2
And here's the output from Node B:
Code:
proxmox-ve: 6.0-2 (running kernel: 5.0.21-2-pve)
pve-manager: 6.0-7 (running version: 6.0-7/28984024)
pve-kernel-5.0: 6.0-8
pve-kernel-helper: 6.0-8
pve-kernel-5.0.21-2-pve: 5.0.21-6
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.12-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-5
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-65
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2

Node A is on an older kernel. (Because I haven't been able to reboot it, since it's the only working cluster node atm)
But otherwise the output seems to be identical. I update my servers regularly using Ansible.

Edit: Forgot to mention that they're both on the enterprise repo
 
The versions look ok - although is a new version of libknet available in the pvetest repository addressing one further mtu issue - see https://bugzilla.proxmox.com/show_bug.cgi?id=2326
It might be worth a try updating to that version.

I looked up the pvesr error and found information about igmp and multicast. I
igmp and multicast were used and required with corosync version 2, which was used with PVE versions <6.0.
PVE 6.0 uses corosync 3.x, which (for now) uses only unicast traffic.
So for PVE-cluster communication multicast (and thus igmp snooping and mulitcast queriers) should not be necessary


My cluster has been in a loop where every week one of the nodes' pvesr service fails with the following details in its logs:
But it seems to be almost on the minute. As in 168 hours, 0 minutes and 0 seconds after the cluster was last fixed.
That sounds too odd to be a coincidence...
My guess would be to check if the switch used might have a bug related to that or if there's a firmware update available.

Alternatively you could setup a dedicated physical network for corosync communication (for 2 nodes just use one network cable to connect them directly) and thus exclude the network equipment inbetween as possible source of the problem

I hope this helps!
 
  • Like
Reactions: Quad
The versions look ok - although is a new version of libknet available in the pvetest repository addressing one further mtu issue - see https://bugzilla.proxmox.com/show_bug.cgi?id=2326
It might be worth a try updating to that version.

Thank you. I will try installing these new patches on the servers when I'm at their location later today.
By pvetest do you mean the non-enterprise repository? Or is there a third repo that I'm not aware of?

That sounds too odd to be a coincidence...
My guess would be to check if the switch used might have a bug related to that or if there's a firmware update available.

Alternatively you could setup a dedicated physical network for corosync communication (for 2 nodes just use one network cable to connect them directly) and thus exclude the network equipment inbetween as possible source of the problem

I thought so too, seems oddly specific.
My switch is a Ubiquiti EdgeSwitch-10XP. There is firmware from August that I have not applied yet, but the patch notes do not seem to claim that they fixed anything which could be related to this.

Regarding the separate corosync network. How would I go about using my Qdevice if I did this? My servers currently use the secondary network interface for iSCSI and corosync traffic. I can move storage traffic to the main interface and use a point-to-point link between the two secondary interfaces. But as I've got a two-node cluster, I need a Qdevice to get quorum working. Is there any way to get that connected, since I wouldn't be able to connect a third device to the corosync network without a switch.

Thanks for all the help so far
 
Regarding the separate corosync network. How would I go about using my Qdevice if I did this?
Sorry I did not consider the Qdevice...
You could use a simple and cheap 1G switch for the corosync-traffic and connect all 3 to that via dedicated interfaces.
(corosync does not need much bandwidth but is very sensitive to latency)

My switch is a Ubiquiti EdgeSwitch-10XP. There is firmware from August that I have not applied yet, but the patch notes do not seem to claim that they fixed anything which could be related to this.
hm - maybe the switch provides some (debug) logs which could give a hint to the source of the problem?

I hope this helps!
 
  • Like
Reactions: Quad
Thanks for all the help Stoiko. I got my hands on a dedicated 1 Gbit "dumb" switch for Corosync traffic and I moved all corosync traffic to its own adapter over the weekend. Guess we'll see what happens in about a week.
 
  • Like
Reactions: Stoiko Ivanov
Sounds like a good plan! Please report back (in any case :)

Thanks!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!