pvesr service fails on one server every week

Quad · Oct 16, 2019

I've been having this really weird issue where one of my Proxmox nodes will fail a week after the cluster is fixed, and I've really been struggling to troubleshoot the issue. Especially since it takes a week to see if any solution actually worked.

My cluster has been in a loop where every week one of the nodes' pvesr service fails with the following details in its logs:

Code:

Oct 13 20:57:00 tethealla systemd[1]: Starting Proxmox VE replication runner...
Oct 13 20:57:00 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:01 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:02 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:03 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:04 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:05 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:06 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:07 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:08 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:09 tethealla pvesr[2019]: error with cfs lock 'file-replication_cfg': no quorum!
Oct 13 20:57:09 tethealla systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a

At first I didn't know what it could be. But looking for patterns in how the servers failed. I observed the following details:

Node A first failed on Saturday September 28th.
I couldn't figure out how to get it working again, so I shut down the node, removed the it from the cluster and reinstalled/rejoined it.
Exactly one week later, on Saturday October 5th, Node B failed instead.
This time I wasn't on-site on Saturday, so I reinstalled it on Sunday October 6th instead.
Now another exact week after I fixed the server, on Sunday October 13th, Node B failed again.

I have not reinstalled Node B yet, relying on my Qdevice to keep the cluster online, as I've been trying to figure out a more permanent solution. The last time Node B failed, I looked up the pvesr error and found information about igmp and multicast. I tried to enable both igmp snooping and igmp querying on my switch, but as it failed again one week later, it does not seem to have helped. I also doubt it's network related, as this issue only started happening in the past month. The last firmware update performed on the switch was in July.

The only thing I imagine could cause this is that I upgraded to Proxmox VE 6 in August. But that still leaves a 1 month gap until these issues started happening. Corosync seems to establish a link for a few seconds, but then the server promptly gets "kicked out":

Code:

Oct 15 01:59:03 tethealla systemd[1]: Started Corosync Cluster Engine.
Oct 15 01:59:28 tethealla corosync[1045]:   [KNET  ] rx: host: 1 link: 0 is up
Oct 15 01:59:28 tethealla corosync[1045]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 15 01:59:28 tethealla corosync[1045]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Oct 15 01:59:28 tethealla corosync[1045]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Oct 15 12:36:41 tethealla corosync[1045]:   [KNET  ] link: host: 1 link: 0 is down
Oct 15 12:36:41 tethealla corosync[1045]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 15 12:36:41 tethealla corosync[1045]:   [KNET  ] host: host: 1 has no active links
Oct 15 12:36:51 tethealla corosync[1045]:   [QUORUM] This node is within the primary component and will provide service.
Oct 15 12:36:51 tethealla corosync[1045]:   [QUORUM] Members[1]: 2
Oct 15 12:36:52 tethealla corosync[1045]:   [KNET  ] rx: host: 1 link: 0 is up
Oct 15 12:36:52 tethealla corosync[1045]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 15 12:36:53 tethealla corosync[1045]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 15 12:36:53 tethealla corosync[1045]:   [QUORUM] Members[1]: 2
Oct 15 12:36:58 tethealla corosync[1045]:   [KNET  ] link: host: 1 link: 0 is down
Oct 15 12:36:58 tethealla corosync[1045]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 15 12:37:00 tethealla corosync[1045]:   [KNET  ] rx: host: 1 link: 0 is up
Oct 15 12:37:00 tethealla corosync[1045]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

As is the original source of my confusion. If I shut down the stray node. Remove it from the cluster using the remaining node, reinstall it, rejoin it to the cluster, remove the quorum device and then readd the quorum device, the cluster will then work normally for exactly one week. I didn't note down all the timestamps. But it seems to be almost on the minute. As in 168 hours, 0 minutes and 0 seconds after the cluster was last fixed.

Stoiko Ivanov · Oct 16, 2019

please post the output of `pveversion -v`
There were quite a few bugfixes for corosync since August.

I hope this helps!

Quad · Oct 16, 2019

Thanks for the quick reply.
Here's the output from Node A:

Code:

proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.0-7 (running version: 6.0-7/28984024)
pve-kernel-5.0: 6.0-8
pve-kernel-helper: 6.0-8
pve-kernel-4.15: 5.4-8
pve-kernel-5.0.21-2-pve: 5.0.21-6
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.12-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-5
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-65
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2

And here's the output from Node B:

Code:

proxmox-ve: 6.0-2 (running kernel: 5.0.21-2-pve)
pve-manager: 6.0-7 (running version: 6.0-7/28984024)
pve-kernel-5.0: 6.0-8
pve-kernel-helper: 6.0-8
pve-kernel-5.0.21-2-pve: 5.0.21-6
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.12-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-5
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-65
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2

Node A is on an older kernel. (Because I haven't been able to reboot it, since it's the only working cluster node atm)
But otherwise the output seems to be identical. I update my servers regularly using Ansible.

Edit: Forgot to mention that they're both on the enterprise repo

Stoiko Ivanov · Oct 16, 2019

The versions look ok - although is a new version of libknet available in the pvetest repository addressing one further mtu issue - see https://bugzilla.proxmox.com/show_bug.cgi?id=2326
It might be worth a try updating to that version.

Quad said:
I looked up the pvesr error and found information about igmp and multicast. I

igmp and multicast were used and required with corosync version 2, which was used with PVE versions <6.0.
PVE 6.0 uses corosync 3.x, which (for now) uses only unicast traffic.
So for PVE-cluster communication multicast (and thus igmp snooping and mulitcast queriers) should not be necessary

Quad said:
My cluster has been in a loop where every week one of the nodes' pvesr service fails with the following details in its logs:

Quad said:
But it seems to be almost on the minute. As in 168 hours, 0 minutes and 0 seconds after the cluster was last fixed.

That sounds too odd to be a coincidence...
My guess would be to check if the switch used might have a bug related to that or if there's a firmware update available.

Alternatively you could setup a dedicated physical network for corosync communication (for 2 nodes just use one network cable to connect them directly) and thus exclude the network equipment inbetween as possible source of the problem

I hope this helps!

Quad · Oct 16, 2019

Stoiko Ivanov said:
The versions look ok - although is a new version of libknet available in the pvetest repository addressing one further mtu issue - see https://bugzilla.proxmox.com/show_bug.cgi?id=2326
It might be worth a try updating to that version.

Thank you. I will try installing these new patches on the servers when I'm at their location later today.
By pvetest do you mean the non-enterprise repository? Or is there a third repo that I'm not aware of?

Stoiko Ivanov said:
That sounds too odd to be a coincidence...
My guess would be to check if the switch used might have a bug related to that or if there's a firmware update available.

Alternatively you could setup a dedicated physical network for corosync communication (for 2 nodes just use one network cable to connect them directly) and thus exclude the network equipment inbetween as possible source of the problem

I thought so too, seems oddly specific.
My switch is a Ubiquiti EdgeSwitch-10XP. There is firmware from August that I have not applied yet, but the patch notes do not seem to claim that they fixed anything which could be related to this.

Regarding the separate corosync network. How would I go about using my Qdevice if I did this? My servers currently use the secondary network interface for iSCSI and corosync traffic. I can move storage traffic to the main interface and use a point-to-point link between the two secondary interfaces. But as I've got a two-node cluster, I need a Qdevice to get quorum working. Is there any way to get that connected, since I wouldn't be able to connect a third device to the corosync network without a switch.

Thanks for all the help so far

Stoiko Ivanov · Oct 17, 2019

Quad said:
Regarding the separate corosync network. How would I go about using my Qdevice if I did this?

Sorry I did not consider the Qdevice...
You could use a simple and cheap 1G switch for the corosync-traffic and connect all 3 to that via dedicated interfaces.
(corosync does not need much bandwidth but is very sensitive to latency)

Quad said:
My switch is a Ubiquiti EdgeSwitch-10XP. There is firmware from August that I have not applied yet, but the patch notes do not seem to claim that they fixed anything which could be related to this.

hm - maybe the switch provides some (debug) logs which could give a hint to the source of the problem?

I hope this helps!

Quad · Oct 21, 2019

Thanks for all the help Stoiko. I got my hands on a dedicated 1 Gbit "dumb" switch for Corosync traffic and I moved all corosync traffic to its own adapter over the weekend. Guess we'll see what happens in about a week.

Stoiko Ivanov · Oct 21, 2019

Sounds like a good plan! Please report back (in any case

Thanks!

Search

Search

pvesr service fails on one server every week

Quad

New Member

Stoiko Ivanov

Proxmox Staff Member

Quad

New Member

Stoiko Ivanov

Proxmox Staff Member

Quad

New Member

Stoiko Ivanov

Proxmox Staff Member

Quad

New Member

Stoiko Ivanov

Proxmox Staff Member