pvescheduler: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout

ecolaizzi · May 5, 2023

Good morning,

I have a Proxmox VE cluster with 3 nodes in hyperconvergence (Ceph).
lately I'm starting to get some strange syslog errors that refer to file locks timeout and ceph mon failures:

Code:

[2023-05-05 06:54:15.000] pve02 pvescheduler[232656]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
[2023-05-05 06:32:12.000] pve03 pvescheduler[256736]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout

[2023-05-05 04:15:22.000] pve01 pvescheduler[200412]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
[2023-05-05 04:15:15.000] pve03 pvescheduler[251472]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
[2023-05-05 04:15:11.000] pve02 pvescheduler[227323]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
[2023-05-05 04:00:13.000] pve03 pvescheduler[247824]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout

[2023-05-05 01:14:07.000] pve01 ceph-mgr[28824]: 2023-05-05T01:14:07.732+0200 7fc102be6500 -1 failed for service _ceph-mon._tcp
[2023-05-05 01:14:07.000] pve01 systemd[1]: ceph-mgr@pve-bgp-rm01.service: Failed with result 'exit-code'.

[2023-05-05 00:28:15.000] pve01 pvescheduler[145218]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout

Is this something to worry about? Can anyone tell me what's causing this?

I specify that the servers are not suffering in terms of resources: all three nodes have a constant load and less than 10% compared to their resources (RAM, Disk, CPU). I also tried to carry out failover and migration tests, everything ends successfully in less than 5 seconds.

At the network level, each cluster has a 20Gbit (2x10 bond) uplink to the core switch and vlan-based traffic segmentation (HA has a dedicated VLAN).
Proxmox version is 7.4.3. We had the issue with 7.3.6 and we upgraded thinking about kernel bugs or similar in this version.

Thank you in advance,
Edwin.

ecolaizzi · May 8, 2023

New syslogs:

Code:

[2023-05-07 22:00:25.000] pve02 ceph-mon[1800]: 2023-05-08T00:00:34.696+0200 7f601cfff700 -1 Fail to read '/proc/1211621/cmdline' error = (3) No such process
[2023-05-07 22:00:25.000] pve02 ceph-osd[2247]: 2023-05-08T00:00:34.696+0200 7f8f20910700 -1 Fail to open '/proc/1211621/cmdline' error = (2) No such file or directory
[2023-05-08 05:46:18.000] pve01 ceph-mgr[1282602]: 2023-05-08T07:46:18.302+0200 7f75512a6500 -1 failed for service _ceph-mon._tcp
[2023-05-08 05:46:18.000] pve01 systemd[1]: ceph-mgr@pve-bgp-rm01.service: Failed with result 'exit-code'.
[2023-05-08 05:46:18.000] pve01 systemd[1]: ceph-mgr@pve-bgp-rm01.service: Main process exited, code=exited, status=1/FAILURE
[2023-05-08 05:46:18.000] pve01 ceph-mgr[1282602]: failed to fetch mon config (--no-mon-config to skip)

ecolaizzi · May 15, 2023

Has anyone encountered similar errors? I would like to understand if this is a normal behaviour or not.
Thank you.

ecolaizzi · May 26, 2023

Hi, has anyone ever experienced something like this?

Search

Search

pvescheduler: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout

ecolaizzi

New Member

ecolaizzi

New Member

ecolaizzi

New Member

ecolaizzi

New Member

We value your privacy