Hello 
Since 24h I'v been banging my head on a problem that appeared by itself yesterday around 12pm according to the logs.
Since yesterday, I can't get my cluster to work properly:
I have 3 nodes A, B, C.
If I turn on nodes A and B or B and C, everything works fine.
As soon as I turn on A and C together (so Either A and C or A, B and C), the commnands start hanging, The gui bug (hanging for a long time, can't access to C node interface from A node, or A from C), can't load the replication tabs of the VMs, and cluster commands like pvesh /nodes hang until I turn off one of the nodes.
Also, I have a Samba VM on C node.... and if A and C, are ON at the same time, the writing/reading perf drop.
By looking at the I/O of the nodes, it seems they are permanently exchanging a lot of data
Weirdly, I have a full quorum and no much problem to start/stop VMs.
The following data come from a state where all nodes are running.
The problem see to come from the pvescheduler (Node C):
pve-ha-lrm (Node C):
pve-ha-lrm (Node A):
pvestatd service (Node C)
i updated all three nodes to the last version.
pveversion -v (Node C):
Hope somebody can help, in any case, thanks for reading me and scratching your head

Since 24h I'v been banging my head on a problem that appeared by itself yesterday around 12pm according to the logs.
Since yesterday, I can't get my cluster to work properly:
I have 3 nodes A, B, C.
If I turn on nodes A and B or B and C, everything works fine.
As soon as I turn on A and C together (so Either A and C or A, B and C), the commnands start hanging, The gui bug (hanging for a long time, can't access to C node interface from A node, or A from C), can't load the replication tabs of the VMs, and cluster commands like pvesh /nodes hang until I turn off one of the nodes.
Also, I have a Samba VM on C node.... and if A and C, are ON at the same time, the writing/reading perf drop.
By looking at the I/O of the nodes, it seems they are permanently exchanging a lot of data
Weirdly, I have a full quorum and no much problem to start/stop VMs.
The following data come from a state where all nodes are running.
The problem see to come from the pvescheduler (Node C):
Code:
pvescheduler.service - Proxmox VE scheduler
Loaded: loaded (/lib/systemd/system/pvescheduler.service; enabled; preset: enabled)
Active: active (running) since Mon 2025-03-31 18:10:37 CEST; 3h 51min ago
Process: 10658 ExecStart=/usr/bin/pvescheduler start (code=exited, status=0/SUCCESS)
Main PID: 10833 (pvescheduler)
Tasks: 3 (limit: 38286)
Memory: 118.4M
CPU: 3.882s
CGroup: /system.slice/pvescheduler.service
├─ 10833 pvescheduler
├─109087 pvescheduler
└─109088 pvescheduler
Mar 31 18:28:09 pvestorage01 pvescheduler[24910]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Mar 31 18:28:09 pvestorage01 pvescheduler[24909]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Mar 31 21:10:35 pvestorage01 pvescheduler[94073]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Mar 31 21:10:35 pvestorage01 pvescheduler[94072]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Mar 31 21:31:44 pvestorage01 pvescheduler[98848]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Mar 31 21:31:44 pvestorage01 pvescheduler[98849]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Mar 31 21:39:32 pvestorage01 pvescheduler[100460]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Mar 31 21:39:32 pvestorage01 pvescheduler[100461]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Mar 31 22:00:21 pvestorage01 pvescheduler[104669]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Mar 31 22:00:21 pvestorage01 pvescheduler[102682]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
pve-ha-lrm (Node C):
Code:
pve-ha-lrm.service - PVE Local HA Resource Manager Daemon
Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled; preset: enabled)
Active: active (running) since Mon 2025-03-31 17:56:41 CEST; 4h 5min ago
Process: 3987 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
Main PID: 3999 (pve-ha-lrm)
Tasks: 1 (limit: 38286)
Memory: 111.0M
CPU: 1.774s
CGroup: /system.slice/pve-ha-lrm.service
└─3999 pve-ha-lrm
Mar 31 21:34:38 pvestorage01 pve-ha-lrm[3999]: loop take too long (172 seconds)
Mar 31 21:39:32 pvestorage01 pve-ha-lrm[3999]: unable to write lrm status file - close (rename) atomic file '/etc/pve/nodes/pvestorage01/lrm_status' failed: Permission denied
Mar 31 21:39:37 pvestorage01 pve-ha-lrm[3999]: loop take too long (299 seconds)
Mar 31 21:44:02 pvestorage01 pve-ha-lrm[3999]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pvestorage01/lrm_status.tmp.3999' - Permission denied
Mar 31 21:44:07 pvestorage01 pve-ha-lrm[3999]: loop take too long (270 seconds)
Mar 31 21:52:51 pvestorage01 pve-ha-lrm[3999]: unable to write lrm status file - unable to write '/etc/pve/nodes/pvestorage01/lrm_status.tmp.3999' - Permission denied
Mar 31 21:52:56 pvestorage01 pve-ha-lrm[3999]: loop take too long (529 seconds)
Mar 31 21:52:56 pvestorage01 pve-ha-lrm[3999]: unable to write lrm status file - unable to delete old temp file: Device or resource busy
Mar 31 22:00:21 pvestorage01 pve-ha-lrm[3999]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pvestorage01/lrm_status.tmp.3999' - Permission denied
Mar 31 22:00:26 pvestorage01 pve-ha-lrm[3999]: loop take too long (445 seconds)
pve-ha-lrm (Node A):
Code:
pve-ha-lrm.service - PVE Local HA Resource Manager Daemon
Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled; preset: enabled)
Active: active (running) since Mon 2025-03-31 21:09:34 CEST; 59min ago
Process: 2140 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
Main PID: 2144 (pve-ha-lrm)
Tasks: 1 (limit: 76898)
Memory: 111.0M
CPU: 593ms
CGroup: /system.slice/pve-ha-lrm.service
└─2144 pve-ha-lrm
Mar 31 21:17:27 R340-01 pve-ha-lrm[2144]: loop take too long (69 seconds)
Mar 31 21:20:27 R340-01 pve-ha-lrm[2144]: loop take too long (175 seconds)
Mar 31 21:22:37 R340-01 pve-ha-lrm[2144]: loop take too long (110 seconds)
Mar 31 21:24:15 R340-01 pve-ha-lrm[2144]: loop take too long (78 seconds)
Mar 31 21:26:08 R340-01 pve-ha-lrm[2144]: loop take too long (93 seconds)
Mar 31 21:28:00 R340-01 pve-ha-lrm[2144]: loop take too long (92 seconds)
Mar 31 21:29:47 R340-01 pve-ha-lrm[2144]: loop take too long (92 seconds)
Mar 31 21:31:36 R340-01 pve-ha-lrm[2144]: loop take too long (89 seconds)
Mar 31 21:34:38 R340-01 pve-ha-lrm[2144]: loop take too long (172 seconds)
Mar 31 21:52:57 R340-01 pve-ha-lrm[2144]: loop take too long (1094 seconds)
pvestatd service (Node C)
Code:
pvestatd.service - PVE Status Daemon
Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; preset: enabled)
Active: active (running) since Mon 2025-03-31 17:54:11 CEST; 4h 8min ago
Process: 2735 ExecStart=/usr/bin/pvestatd start (code=exited, status=0/SUCCESS)
Main PID: 2774 (pvestatd)
Tasks: 1 (limit: 38286)
Memory: 164.6M
CPU: 7min 6.894s
CGroup: /system.slice/pvestatd.service
└─2774 pvestatd
Mar 31 21:26:03 pvestorage01 pvestatd[2774]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Mar 31 21:26:03 pvestorage01 pvestatd[2774]: status update time (112.975 seconds)
Mar 31 21:31:44 pvestorage01 pvestatd[2774]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Mar 31 21:31:44 pvestorage01 pvestatd[2774]: status update time (340.638 seconds)
Mar 31 21:34:33 pvestorage01 pvestatd[2774]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Mar 31 21:34:33 pvestorage01 pvestatd[2774]: status update time (169.084 seconds)
Mar 31 21:39:32 pvestorage01 pvestatd[2774]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Mar 31 21:39:33 pvestorage01 pvestatd[2774]: status update time (299.533 seconds)
Mar 31 22:00:21 pvestorage01 pvestatd[2774]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Mar 31 22:00:22 pvestorage01 pvestatd[2774]: status update time (1249.205 seconds)
i updated all three nodes to the last version.
pveversion -v (Node C):
Code:
proxmox-ve: 8.3.0 (running kernel: 6.8.12-9-pve)
pve-manager: 8.3.5 (running version: 8.3.5/dac3aa88bac3f300)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8: 6.8.12-9
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.2.0
libpve-network-perl: 0.10.1
libpve-rs-perl: 0.9.2
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.4-1
proxmox-backup-file-restore: 3.3.4-1
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.7
pve-cluster: 8.0.10
pve-container: 5.2.4
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-3
pve-ha-manager: 4.0.6
pve-i18n: 3.4.1
pve-qemu-kvm: 9.2.0-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.8
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2
Hope somebody can help, in any case, thanks for reading me and scratching your head
