reboot whole cluster because of CPU throttling, after new update

cairo · Dec 6, 2024

We have a cluster of 25 nodes, all machines are backed up on PBS. The last update to the latest version, on proxmox nodes on which we have VM and LXC, was carried out on 24.11. However, on 3.12 in the morning, a problem appeared with one node. CPU load on one node increased to 100%, it was impossible to work on LXC's under it. In the system log we saw information about CPU throttling which was preceded by problem with communication to other nodes (critical messages in log). Additionally, we could not log in to the cluster GUI on any of the nodes. We restarted this node at 09:13:37, and after this, problem with throttling appeared on another machine. We had to restart the entire cluster to fix this.
I saw an unusually high load average before the incident. Such a high load has never appeared in the history of this server.

System:
Debian 12
Linux 6.8.12-4-pve (2024-11-06T15:04Z)
pve-manager/8.3.0/c1689ccb1065a83b

Packet versions:
libpve-rs-perl/stable 0.9.0
proxmox-backup-client/stable 3.2.9-1
proxmox-backup-file-restore/stable 3.2.9-1
proxmox-widget-toolkit/stable 4.3.1
pve-i18n/stable 3.3.1

Syslog (when the problem appeared):
2024-12-03T09:02:59.927819+00:00 CloudProx5 pmxcfs[1408]: [status] crit: cpg_send_message failed: 6
2024-12-03T09:03:00.894253+00:00 CloudProx5 pmxcfs[1408]: [dcdb] notice: cpg_send_message retry 90
2024-12-03T09:03:00.931447+00:00 CloudProx5 pmxcfs[1408]: [status] notice: cpg_send_message retry 10
2024-12-03T09:03:01.895661+00:00 CloudProx5 pmxcfs[1408]: [dcdb] notice: cpg_send_message retry 100
2024-12-03T09:03:01.895882+00:00 CloudProx5 pmxcfs[1408]: [dcdb] notice: cpg_send_message retried 100 tim
es
2024-12-03T09:03:01.895980+00:00 CloudProx5 pmxcfs[1408]: [dcdb] crit: failed to send SYNC_START message
2024-12-03T09:03:01.896043+00:00 CloudProx5 pmxcfs[1408]: [dcdb] crit: leaving CPG group
2024-12-03T09:03:01.932802+00:00 CloudProx5 pmxcfs[1408]: [status] notice: cpg_send_message retry 20
2024-12-03T09:03:02.933752+00:00 CloudProx5 pmxcfs[1408]: [status] notice: cpg_send_message retry 30
2024-12-03T09:03:03.676600+00:00 CloudProx5 kernel: [ 1702.060476] sched: RT throttling activated
2024-12-03T09:03:03.934744+00:00 CloudProx5 pmxcfs[1408]: [status] notice: cpg_send_message retry 40
2024-12-03T09:03:04.935681+00:00 CloudProx5 pmxcfs[1408]: [status] notice: cpg_send_message retry 50
2024-12-03T09:03:05.936649+00:00 CloudProx5 pmxcfs[1408]: [status] notice: cpg_send_message retry 60
2024-12-03T09:03:06.937594+00:00 CloudProx5 pmxcfs[1408]: [status] notice: cpg_send_message retry 70
2024-12-03T09:03:07.938539+00:00 CloudProx5 pmxcfs[1408]: [status] notice: cpg_send_message retry 80
2024-12-03T09:03:08.939442+00:00 CloudProx5 pmxcfs[1408]: [status] notice: cpg_send_message retry 90
2024-12-03T09:03:09.940523+00:00 CloudProx5 pmxcfs[1408]: [status] notice: cpg_send_message retry 100
2024-12-03T09:03:09.940632+00:00 CloudProx5 pmxcfs[1408]: [status] notice: cpg_send_message retried 100 t
imes

/var/log/apt/history.log.1.gz:
Start-Date: 2024-11-21 13:14:22
Commandline: apt dist-upgrade -y
Install: proxmox-kernel-6.8.12-4-pve-signed:amd64 (6.8.12-4, automatic), libpve-network-api-perl:amd64 (0
.10.0, automatic)
Upgrade: pve-docs:amd64 (8.2.3, 8.2.5), proxmox-widget-toolkit:amd64 (4.2.3, 4.3.1), libpve-rs-perl:amd64
(0.8.10, 0.9.0), pve-firmware:amd64 (3.13-2, 3.14-1), pve-qemu-kvm:amd64 (9.0.2-3, 9.0.2-4), libjs-extjs
:amd64 (7.0.0-4, 7.0.0-5), proxmox-mail-forward:amd64 (0.2.3, 0.3.1), libpve-cluster-api-perl:amd64 (8.0.
7, 8.0.10), pve-ha-manager:amd64 (4.0.5, 4.0.6), libpve-storage-perl:amd64 (8.2.5, 8.2.9), libpve-guest-c
ommon-perl:amd64 (5.1.4, 5.1.6), proxmox-kernel-6.8:amd64 (6.8.12-2, 6.8.12-4), pve-cluster:amd64 (8.0.7,
8.0.10), novnc-pve:amd64 (1.4.0-4, 1.5.0-1), proxmox-backup-file-restore:amd64 (3.2.7-1, 3.2.9-1), ifupd
own2:amd64 (3.2.0-1+pmx9, 3.2.0-1+pmx11), qemu-server:amd64 (8.2.4, 8.2.7), libpve-access-control:amd64 (
8.1.4, 8.2.0), pve-container:amd64 (5.2.0, 5.2.2), pve-i18n:amd64 (3.2.3, 3.3.0), proxmox-archive-keyring
:amd64 (3.0, 3.1), proxmox-backup-client:amd64 (3.2.7-1, 3.2.9-1), libpve-http-server-perl:amd64 (5.1.1,
5.1.2), proxmox-firewall:amd64 (0.5.0, 0.6.0), pve-manager:amd64 (8.2.7, 8.2.10), libpve-common-perl:amd6
4 (8.2.3, 8.2.9), libpve-network-perl:amd64 (0.9.8, 0.10.0), libpve-notify-perl:amd64 (8.0.7, 8.0.10), pv
e-firewall:amd64 (5.0.7, 5.1.0), libpve-cluster-perl:amd64 (8.0.7, 8.0.10)
End-Date: 2024-11-21 13:17:11

Start-Date: 2024-11-22 06:54:02
Commandline: /usr/bin/unattended-upgrade
Remove: proxmox-kernel-6.8.12-1-pve-signed:amd64 (6.8.12-1)
End-Date: 2024-11-22 06:54:23

Start-Date: 2024-11-24 12:00:35
Commandline: apt dist-upgrade -y
Upgrade: pve-docs:amd64 (8.2.5, 8.3.1), proxmox-ve:amd64 (8.2.0, 8.3.0), qemu-server:amd64 (8.2.7, 8.3.0)
, pve-i18n:amd64 (3.3.0, 3.3.1), pve-manager:amd64 (8.2.10, 8.3.0)
End-Date: 2024-11-24 12:01:05

Start-Date: 2024-11-24 12:02:11
Commandline: apt autoremove
Remove: libperl5.32:amd64 (5.32.1-4+deb11u3), libvpx6:amd64 (1.9.0-1+deb11u3), libcodec2-0.9:amd64 (0.9.2
-4), g++-10:amd64 (10.2.1-6), libidn11:amd64 (1.33-3), libleveldb1d:amd64 (1.23-4), libx264-160:amd64 (2:
0.160.3011+gitcde9a93-2.1), libmpdec3:amd64 (2.5.1-1), libaom0:amd64 (1.0.0.errata1-3+deb11u1), libx265-1
92:amd64 (3.4-2), libfftw3-double3:amd64 (3.3.10-1), libdav1d4:amd64 (0.7.1-3+deb11u1), libtiff5:amd64 (4
.2.0-1+deb11u5), libsigsegv2:amd64 (2.14-1), libllvm11:amd64 (1:11.0.1-2), libflac8:amd64 (1.3.3-2+deb11u
2), libbpf0:amd64 (1:0.3-2), libldap-2.4-2:amd64 (2.4.57+dfsg-3+deb11u1), libpostproc55:amd64 (7:4.3.7-0+
deb11u1), libisc-export1105:amd64 (1:9.11.19+dfsg-2.1), libpython3.9-stdlib:amd64 (3.9.2-1), libavcodec58
:amd64 (7:4.3.7-0+deb11u1), libcbor0:amd64 (0.5.0+dfsg-2), libboost-coroutine1.74.0:amd64 (1.74.0+ds1-21)
, net-tools:amd64 (2.10-0.1), libpython3.9:amd64 (3.9.2-1), liburing1:amd64 (0.7-3), libavutil56:amd64 (7
:4.3.7-0+deb11u1), libwebp6:amd64 (0.6.1-2.1+deb11u2), libswscale5:amd64 (7:4.3.7-0+deb11u1), libprocps8:
amd64 (2:3.3.17-5), guile-2.2-libs:amd64 (2.2.7+1-9), libdns-export1110:amd64 (1:9.11.19+dfsg-2.1), libsw
resample3:amd64 (7:4.3.7-0+deb11u1), libprotobuf23:amd64 (3.12.4-1+deb11u1), libsrt1.4-gnutls:amd64 (1.4.
2-1.3), libavformat58:amd64 (7:4.3.7-0+deb11u1), perl-modules-5.32:amd64 (5.32.1-4+deb11u3), libpython3.9
-minimal:amd64 (3.9.2-1), libigdgmm11:amd64 (20.4.1+ds1-1), python3.9:amd64 (3.9.2-1), libstdc++-10-dev:a
md64 (10.2.1-6), libicu67:amd64 (67.1-7), liburcu6:amd64 (0.12.2-1), python3.9-minimal:amd64 (3.9.2-1), l
ibavfilter7:amd64 (7:4.3.7-0+deb11u1)
End-Date: 2024-11-24 12:02:40

Start-Date: 2024-11-25 06:54:48
Commandline: /usr/bin/unattended-upgrade
Upgrade: linux-libc-dev:amd64 (6.1.115-1, 6.1.119-1)
End-Date: 2024-11-25 06:54:52

Journalctl log from corosync is attached as a file.

After this incident, we updated pve to newest version:
root@CloudProx5:~# pveversion -v
proxmox-ve: 8.3.0 (running kernel: 6.8.12-4-pve)
pve-manager: 8.3.0 (running version: 8.3.0/c1689ccb1065a83b)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-15
proxmox-kernel-6.8: 6.8.12-4
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
pve-kernel-5.15.158-2-pve: 5.15.158-2
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.2.9
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.0-1
proxmox-backup-file-restore: 3.3.0-1
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.3
pve-cluster: 8.0.10
pve-container: 5.2.2
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-1
pve-ha-manager: 4.0.6
pve-i18n: 3.3.2
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.0
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1

Issue is similar to bug described in this thread:
https://bugzilla.proxmox.com/show_bug.cgi?id=5868

We can provide more logs if you need.

cairo · Dec 10, 2024

Anyone?

fabian · Dec 10, 2024

well, you'd need to find out what's causing the load.. if your system is severely overloaded, that can affect corosync (which in turn, if HA is enabled, can cause nodes to get fenced).

I already told you in the linked bug report that this issue is in no way related..

usridzero · Jan 26, 2025

I'm having same issues... systems get overrun by the KVM process and then dies. I have a ticket here:
https://forum.proxmox.com/threads/pve-crashing-hanging-with-very-high-load-average.161263/ with logs. Please advise. I don't want to have to find an alternative to proxmox, but before I rebuild I want to try to fix this.

cairo · Jan 27, 2025

The problem appears again, we can't login to GUI from any node. Some how we manage to login on node 1, but from only one PC, and all nodes are greyed out. We can see summary status of each node, and summary status of VM and LXC, it seems working fine, but without names, just ID's. When logged to node 1 from ssh, we can see that /etc/pve is set to year 1970, and when we try to enter that folder, ssh session hangs, and we have to open another ssh window, this happen on all nodes. pvecm status seems ok.

Code:

root@node1:~# pvecm status
Cluster information
-------------------
Name:             CloudKlaster
Config Version:   27
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Jan 27 14:40:04 2025
Quorum provider:  corosync_votequorum
Nodes:            27
Node ID:          0x00000001
Ring ID:          1.26b4
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   27
Highest expected: 27
Total votes:      27
Quorum:           14 
Flags:            Quorate

cairo · Jan 27, 2025

This time, none of the nodes have problem with CPU usage or waiting

fabian · Jan 30, 2025

what about the status of "pvestatd"? it's the service responsible for broadcasting the node and guest status in the cluster..

cairo · Jan 30, 2025

fabian said:
what about the status of "pvestatd"? it's the service responsible for broadcasting the node and guest status in the cluster..

We checked only pvecm and was showing full functionality. Had to reboot every node to create new cluster and is working now.
Logs show, that pvestatd was working just fine to 12:32:02 and then probably stopped. Working again after reboot:

sty 27 09:09:35 CloudProx1 pvestatd[1956]: status update time (70.256 seconds)
sty 27 09:15:45 CloudProx1 pvestatd[1956]: status update time (130.506 seconds)
sty 27 09:17:56 CloudProx1 pvestatd[1956]: status update time (130.440 seconds)
sty 27 09:20:06 CloudProx1 pvestatd[1956]: status update time (130.384 seconds)
sty 27 09:22:17 CloudProx1 pvestatd[1956]: status update time (130.408 seconds)
sty 27 09:24:27 CloudProx1 pvestatd[1956]: status update time (130.443 seconds)
sty 27 09:26:48 CloudProx1 pvestatd[1956]: status update time (140.462 seconds)
sty 27 09:28:58 CloudProx1 pvestatd[1956]: status update time (130.367 seconds)
sty 27 09:31:08 CloudProx1 pvestatd[1956]: status update time (130.374 seconds)
sty 27 09:33:19 CloudProx1 pvestatd[1956]: status update time (130.466 seconds)
sty 27 09:35:29 CloudProx1 pvestatd[1956]: status update time (130.466 seconds)
sty 27 09:37:40 CloudProx1 pvestatd[1956]: status update time (130.380 seconds)
sty 27 09:39:50 CloudProx1 pvestatd[1956]: status update time (130.435 seconds)
sty 27 09:42:10 CloudProx1 pvestatd[1956]: status update time (140.196 seconds)
sty 27 09:44:21 CloudProx1 pvestatd[1956]: status update time (130.429 seconds)
sty 27 09:46:31 CloudProx1 pvestatd[1956]: status update time (130.413 seconds)
sty 27 09:48:42 CloudProx1 pvestatd[1956]: status update time (130.470 seconds)
sty 27 09:50:52 CloudProx1 pvestatd[1956]: status update time (130.473 seconds)
sty 27 09:53:02 CloudProx1 pvestatd[1956]: status update time (130.394 seconds)
sty 27 09:55:13 CloudProx1 pvestatd[1956]: status update time (130.402 seconds)
sty 27 09:57:33 CloudProx1 pvestatd[1956]: status update time (140.181 seconds)
sty 27 09:59:43 CloudProx1 pvestatd[1956]: status update time (130.466 seconds)
sty 27 10:01:54 CloudProx1 pvestatd[1956]: status update time (130.397 seconds)
sty 27 10:04:04 CloudProx1 pvestatd[1956]: status update time (130.373 seconds)
sty 27 10:06:15 CloudProx1 pvestatd[1956]: status update time (130.508 seconds)
sty 27 10:08:25 CloudProx1 pvestatd[1956]: status update time (130.411 seconds)
sty 27 10:10:36 CloudProx1 pvestatd[1956]: status update time (130.385 seconds)
sty 27 10:13:06 CloudProx1 pvestatd[1956]: status update time (150.184 seconds)
sty 27 10:15:16 CloudProx1 pvestatd[1956]: status update time (130.401 seconds)
sty 27 10:17:27 CloudProx1 pvestatd[1956]: status update time (130.424 seconds)
sty 27 10:19:37 CloudProx1 pvestatd[1956]: status update time (130.390 seconds)
sty 27 10:21:48 CloudProx1 pvestatd[1956]: status update time (130.537 seconds)
sty 27 10:23:58 CloudProx1 pvestatd[1956]: status update time (130.532 seconds)
sty 27 10:26:08 CloudProx1 pvestatd[1956]: status update time (130.394 seconds)
sty 27 10:28:29 CloudProx1 pvestatd[1956]: status update time (140.381 seconds)
sty 27 10:30:39 CloudProx1 pvestatd[1956]: status update time (130.386 seconds)
sty 27 10:32:50 CloudProx1 pvestatd[1956]: status update time (130.412 seconds)
sty 27 10:35:00 CloudProx1 pvestatd[1956]: status update time (130.387 seconds)
sty 27 10:37:10 CloudProx1 pvestatd[1956]: status update time (130.375 seconds)
sty 27 10:39:21 CloudProx1 pvestatd[1956]: status update time (130.376 seconds)
sty 27 10:41:31 CloudProx1 pvestatd[1956]: status update time (130.370 seconds)
sty 27 10:43:52 CloudProx1 pvestatd[1956]: status update time (140.505 seconds)
sty 27 10:46:02 CloudProx1 pvestatd[1956]: status update time (130.435 seconds)
sty 27 10:48:12 CloudProx1 pvestatd[1956]: status update time (130.429 seconds)
sty 27 10:49:22 CloudProx1 pvestatd[1956]: status update time (69.120 seconds)
sty 27 10:52:12 CloudProx1 pvestatd[1956]: status update time (30.283 seconds)
sty 27 10:52:27 CloudProx1 pvestatd[1956]: status update time (14.998 seconds)
sty 27 10:53:37 CloudProx1 pvestatd[1956]: status update time (9.662 seconds)
sty 27 10:57:43 CloudProx1 pvestatd[1956]: status update time (6.202 seconds)
sty 27 11:35:46 CloudProx1 pvestatd[1956]: status update time (18.539 seconds)
sty 27 11:48:24 CloudProx1 pvestatd[1956]: status update time (37.995 seconds)
sty 27 11:51:14 CloudProx1 pvestatd[1956]: status update time (9.070 seconds)
sty 27 11:53:22 CloudProx1 pvestatd[1956]: status update time (7.541 seconds)
sty 27 11:55:38 CloudProx1 pvestatd[1956]: status update time (24.077 seconds)
sty 27 11:59:42 CloudProx1 pvestatd[1956]: status update time (23.791 seconds)
sty 27 12:24:23 CloudProx1 pvestatd[1956]: status update time (120.428 seconds)
sty 27 12:26:33 CloudProx1 pvestatd[1956]: status update time (130.474 seconds)
sty 27 12:28:43 CloudProx1 pvestatd[1956]: status update time (130.478 seconds)
sty 27 12:30:54 CloudProx1 pvestatd[1956]: status update time (130.511 seconds)
sty 27 12:32:02 CloudProx1 pvestatd[1956]: status update time (67.810 seconds)
-- Boot 82c53094db794c09909309e5b2d7095c --
sty 28 19:33:21 CloudProx1 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
sty 28 19:33:25 CloudProx1 pvestatd[2049]: starting server
sty 28 19:33:25 CloudProx1 systemd[1]: Started pvestatd.service - PVE Status Daemon.

fabian · Jan 30, 2025

those update times are pretty high though, could you check the system logs around them to see what's going on? and the logs at/after 12:32 would probably also be interesting..

ghusson · Jan 30, 2025

Hello, it seems to be a classical resource starvation case. In order to avoid this, you should follow the best practices :
- admin network, data network (iSCSI, NFS, backups) and VMs network should be physically separated
- OS disk should not be the same as VMs disks
- you should not come to a SWAP usage need
- on production, buy a subscription, this will bring you official support and help the project to last long and to be of good quality

If you need professional assistance, I would be pleased to help you.

VictorSTS · Jan 30, 2025

I've had similar behavior when using NFS storage, either for VMs/CTs or for backups and NFS didn't work properly on some node(s): pvestatd had to wait a lot for the storage to reply or timeout, making the webui to show question marks. I would check that all storages, specially NFS, are working fine on all nodes and that no stale mount is hanging around.

cairo · Jan 31, 2025

Longer log:

sty 14 21:14:35 CloudProx1 pvestatd[1956]: status update time (5.269 seconds)
sty 14 21:43:07 CloudProx1 pvestatd[1956]: status update time (7.490 seconds)
sty 15 13:16:53 CloudProx1 pvestatd[1956]: auth key new enough, skipping rotation
sty 15 21:02:08 CloudProx1 pvestatd[1956]: status update time (6.606 seconds)
sty 15 21:07:36 CloudProx1 pvestatd[1956]: status update time (5.124 seconds)
sty 15 21:08:16 CloudProx1 pvestatd[1956]: status update time (5.776 seconds)
sty 15 21:11:38 CloudProx1 pvestatd[1956]: status update time (6.741 seconds)
sty 16 06:46:36 CloudProx1 pvestatd[1956]: status update time (5.296 seconds)
sty 16 13:16:55 CloudProx1 pvestatd[1956]: auth key new enough, skipping rotation
sty 17 04:58:47 CloudProx1 pvestatd[1956]: status update time (5.368 seconds)
sty 17 07:00:38 CloudProx1 pvestatd[1956]: status update time (6.288 seconds)
sty 18 21:00:17 CloudProx1 pvestatd[1956]: status update time (6.066 seconds)
sty 18 21:14:29 CloudProx1 pvestatd[1956]: status update time (7.848 seconds)
sty 19 06:31:07 CloudProx1 pvestatd[1956]: status update time (5.437 seconds)
sty 19 06:41:48 CloudProx1 pvestatd[1956]: status update time (5.320 seconds)
sty 19 13:16:56 CloudProx1 pvestatd[1956]: auth key new enough, skipping rotation
sty 19 21:07:50 CloudProx1 pvestatd[1956]: CairoProxBackup2: error fetching datastores - 500 Can't connect to cloudproxbackup2.***.**:8007 (Connection timed out)
sty 19 21:07:57 CloudProx1 pvestatd[1956]: status update time (13.635 seconds)
sty 20 04:58:12 CloudProx1 pvestatd[1956]: status update time (5.494 seconds)
sty 20 21:04:55 CloudProx1 pvestatd[1956]: status update time (7.986 seconds)
sty 20 21:06:42 CloudProx1 pvestatd[1956]: status update time (5.263 seconds)
sty 20 21:09:44 CloudProx1 pvestatd[1956]: status update time (7.306 seconds)
sty 21 05:34:12 CloudProx1 pvestatd[1956]: status update time (5.371 seconds)
sty 21 13:17:00 CloudProx1 pvestatd[1956]: auth key new enough, skipping rotation
sty 22 06:40:45 CloudProx1 pvestatd[1956]: status update time (5.311 seconds)
sty 22 21:04:50 CloudProx1 pvestatd[1956]: status update time (8.972 seconds)
sty 22 21:12:01 CloudProx1 pvestatd[1956]: status update time (9.857 seconds)
sty 23 06:41:17 CloudProx1 pvestatd[1956]: status update time (6.355 seconds)
sty 23 11:17:49 CloudProx1 pvestatd[1956]: CloudProxBackup2: error fetching datastores - 500 Can't connect to cloudproxbackup2.***.**:8007 (Connection timed out)
sty 23 11:17:49 CloudProx1 pvestatd[1956]: status update time (7.194 seconds)
sty 23 11:17:59 CloudProx1 pvestatd[1956]: CloudProxBackup2: error fetching datastores - 500 Can't connect to cloudproxbackup2.***.**:8007 (Connection timed out)
sty 23 11:17:59 CloudProx1 pvestatd[1956]: status update time (7.191 seconds)
sty 23 11:18:22 CloudProx1 pvestatd[1956]: proxmox-backup-client failed: error connecting to https://cloudproxbackup2.***.**:8007/ - tcp connect error: deadline has elapsed
sty 23 11:18:22 CloudProx1 pvestatd[1956]: status update time (10.273 seconds)
sty 23 11:19:11 CloudProx1 pvestatd[1956]: status update time (19.053 seconds)
sty 23 11:20:12 CloudProx1 pvestatd[1956]: status update time (41.284 seconds)
sty 23 11:26:53 CloudProx1 pvestatd[1956]: status update time (111.033 seconds)
sty 23 11:32:54 CloudProx1 pvestatd[1956]: status update time (130.395 seconds)
sty 23 11:34:22 CloudProx1 pvestatd[1956]: status update time (88.753 seconds)
sty 23 11:37:42 CloudProx1 pvestatd[1956]: status update time (40.347 seconds)
sty 23 11:39:53 CloudProx1 pvestatd[1956]: status update time (130.383 seconds)
sty 23 11:42:03 CloudProx1 pvestatd[1956]: status update time (130.400 seconds)
sty 23 11:44:14 CloudProx1 pvestatd[1956]: status update time (130.462 seconds)
sty 23 11:46:24 CloudProx1 pvestatd[1956]: status update time (130.365 seconds)
sty 23 11:48:34 CloudProx1 pvestatd[1956]: status update time (130.460 seconds)
sty 23 11:49:13 CloudProx1 pvestatd[1956]: status update time (38.633 seconds)
sty 23 11:53:12 CloudProx1 pvestatd[1956]: status update time (38.681 seconds)
sty 23 11:55:24 CloudProx1 pvestatd[1956]: status update time (12.598 seconds)
sty 23 11:59:33 CloudProx1 pvestatd[1956]: status update time (38.812 seconds)
sty 23 12:01:01 CloudProx1 pvestatd[1956]: status update time (8.595 seconds)
sty 23 12:02:14 CloudProx1 pvestatd[1956]: status update time (40.910 seconds)
sty 23 15:13:04 CloudProx1 pvestatd[1956]: proxmox-backup-client failed: error connecting to https://cloudproxbackup2.***.**:8007/ - tcp connect error: deadline has elapsed
sty 23 15:13:05 CloudProx1 pvestatd[1956]: status update time (11.299 seconds)
sty 23 15:13:25 CloudProx1 pvestatd[1956]: proxmox-backup-client failed: error connecting to https://cloudproxbackup2.***.**:8007/ - tcp connect error: deadline has elapsed
sty 23 15:13:25 CloudProx1 pvestatd[1956]: status update time (10.256 seconds)
sty 23 21:12:16 CloudProx1 pvestatd[1956]: status update time (10.719 seconds)
sty 23 21:15:11 CloudProx1 pvestatd[1956]: status update time (5.371 seconds)
sty 24 02:15:22 CloudProx1 pvestatd[1956]: status update time (15.900 seconds)
sty 24 21:00:44 CloudProx1 pvestatd[1956]: status update time (10.303 seconds)
sty 24 21:09:30 CloudProx1 pvestatd[1956]: status update time (5.675 seconds)
sty 25 13:17:10 CloudProx1 pvestatd[1956]: auth key new enough, skipping rotation
sty 25 13:17:10 CloudProx1 pvestatd[1956]: status update time (5.370 seconds)
sty 25 21:00:59 CloudProx1 pvestatd[1956]: status update time (5.389 seconds)
sty 25 21:10:00 CloudProx1 pvestatd[1956]: status update time (5.438 seconds)
sty 25 22:39:15 CloudProx1 pvestatd[1956]: status update time (10.932 seconds)
sty 26 13:17:05 CloudProx1 pvestatd[1956]: auth key pair too old, rotating..
sty 26 21:00:32 CloudProx1 pvestatd[1956]: status update time (7.108 seconds)
sty 26 21:08:30 CloudProx1 pvestatd[1956]: status update time (5.200 seconds)
sty 26 21:11:30 CloudProx1 pvestatd[1956]: status update time (5.335 seconds)
sty 26 21:12:21 CloudProx1 pvestatd[1956]: status update time (5.861 seconds)
sty 27 09:09:35 CloudProx1 pvestatd[1956]: status update time (70.256 seconds)
sty 27 09:15:45 CloudProx1 pvestatd[1956]: status update time (130.506 seconds)
sty 27 09:17:56 CloudProx1 pvestatd[1956]: status update time (130.440 seconds)
sty 27 09:20:06 CloudProx1 pvestatd[1956]: status update time (130.384 seconds)
sty 27 09:22:17 CloudProx1 pvestatd[1956]: status update time (130.408 seconds)
sty 27 09:24:27 CloudProx1 pvestatd[1956]: status update time (130.443 seconds)
sty 27 09:26:48 CloudProx1 pvestatd[1956]: status update time (140.462 seconds)
sty 27 09:28:58 CloudProx1 pvestatd[1956]: status update time (130.367 seconds)
sty 27 09:31:08 CloudProx1 pvestatd[1956]: status update time (130.374 seconds)
sty 27 09:33:19 CloudProx1 pvestatd[1956]: status update time (130.466 seconds)
sty 27 09:35:29 CloudProx1 pvestatd[1956]: status update time (130.466 seconds)
sty 27 09:37:40 CloudProx1 pvestatd[1956]: status update time (130.380 seconds)
sty 27 09:39:50 CloudProx1 pvestatd[1956]: status update time (130.435 seconds)
sty 27 09:42:10 CloudProx1 pvestatd[1956]: status update time (140.196 seconds)
sty 27 09:44:21 CloudProx1 pvestatd[1956]: status update time (130.429 seconds)
sty 27 09:46:31 CloudProx1 pvestatd[1956]: status update time (130.413 seconds)
sty 27 09:48:42 CloudProx1 pvestatd[1956]: status update time (130.470 seconds)
sty 27 09:50:52 CloudProx1 pvestatd[1956]: status update time (130.473 seconds)
sty 27 09:53:02 CloudProx1 pvestatd[1956]: status update time (130.394 seconds)
sty 27 09:55:13 CloudProx1 pvestatd[1956]: status update time (130.402 seconds)
sty 27 09:57:33 CloudProx1 pvestatd[1956]: status update time (140.181 seconds)
sty 27 09:59:43 CloudProx1 pvestatd[1956]: status update time (130.466 seconds)
sty 27 10:01:54 CloudProx1 pvestatd[1956]: status update time (130.397 seconds)
sty 27 10:04:04 CloudProx1 pvestatd[1956]: status update time (130.373 seconds)
sty 27 10:06:15 CloudProx1 pvestatd[1956]: status update time (130.508 seconds)
sty 27 10:08:25 CloudProx1 pvestatd[1956]: status update time (130.411 seconds)
sty 27 10:10:36 CloudProx1 pvestatd[1956]: status update time (130.385 seconds)
sty 27 10:13:06 CloudProx1 pvestatd[1956]: status update time (150.184 seconds)
sty 27 10:15:16 CloudProx1 pvestatd[1956]: status update time (130.401 seconds)
sty 27 10:17:27 CloudProx1 pvestatd[1956]: status update time (130.424 seconds)
sty 27 10:19:37 CloudProx1 pvestatd[1956]: status update time (130.390 seconds)
sty 27 10:21:48 CloudProx1 pvestatd[1956]: status update time (130.537 seconds)
sty 27 10:23:58 CloudProx1 pvestatd[1956]: status update time (130.532 seconds)
sty 27 10:26:08 CloudProx1 pvestatd[1956]: status update time (130.394 seconds)
sty 27 10:28:29 CloudProx1 pvestatd[1956]: status update time (140.381 seconds)
sty 27 10:30:39 CloudProx1 pvestatd[1956]: status update time (130.386 seconds)
sty 27 10:32:50 CloudProx1 pvestatd[1956]: status update time (130.412 seconds)
sty 27 10:35:00 CloudProx1 pvestatd[1956]: status update time (130.387 seconds)
sty 27 10:37:10 CloudProx1 pvestatd[1956]: status update time (130.375 seconds)
sty 27 10:39:21 CloudProx1 pvestatd[1956]: status update time (130.376 seconds)
sty 27 10:41:31 CloudProx1 pvestatd[1956]: status update time (130.370 seconds)
sty 27 10:43:52 CloudProx1 pvestatd[1956]: status update time (140.505 seconds)
sty 27 10:46:02 CloudProx1 pvestatd[1956]: status update time (130.435 seconds)
sty 27 10:48:12 CloudProx1 pvestatd[1956]: status update time (130.429 seconds)
sty 27 10:49:22 CloudProx1 pvestatd[1956]: status update time (69.120 seconds)
sty 27 10:52:12 CloudProx1 pvestatd[1956]: status update time (30.283 seconds)
sty 27 10:52:27 CloudProx1 pvestatd[1956]: status update time (14.998 seconds)
sty 27 10:53:37 CloudProx1 pvestatd[1956]: status update time (9.662 seconds)
sty 27 10:57:43 CloudProx1 pvestatd[1956]: status update time (6.202 seconds)
sty 27 11:35:46 CloudProx1 pvestatd[1956]: status update time (18.539 seconds)
sty 27 11:48:24 CloudProx1 pvestatd[1956]: status update time (37.995 seconds)
sty 27 11:51:14 CloudProx1 pvestatd[1956]: status update time (9.070 seconds)
sty 27 11:53:22 CloudProx1 pvestatd[1956]: status update time (7.541 seconds)
sty 27 11:55:38 CloudProx1 pvestatd[1956]: status update time (24.077 seconds)
sty 27 11:59:42 CloudProx1 pvestatd[1956]: status update time (23.791 seconds)
sty 27 12:24:23 CloudProx1 pvestatd[1956]: status update time (120.428 seconds)
sty 27 12:26:33 CloudProx1 pvestatd[1956]: status update time (130.474 seconds)
sty 27 12:28:43 CloudProx1 pvestatd[1956]: status update time (130.478 seconds)
sty 27 12:30:54 CloudProx1 pvestatd[1956]: status update time (130.511 seconds)
sty 27 12:32:02 CloudProx1 pvestatd[1956]: status update time (67.810 seconds)
-- Boot 82c53094db794c09909309e5b2d7095c --
sty 28 19:33:21 CloudProx1 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
sty 28 19:33:25 CloudProx1 pvestatd[2049]: starting server
sty 28 19:33:25 CloudProx1 systemd[1]: Started pvestatd.service - PVE Status Daemon.
sty 28 19:49:43 CloudProx1 pvestatd[2049]: authkey rotation error: cfs-lock 'authkey' error: no quorum!
sty 28 19:49:43 CloudProx1 pvestatd[2049]: status update time (958.153 seconds)
sty 28 19:54:08 CloudProx1 systemd[1]: Stopping pvestatd.service - PVE Status Daemon...
sty 28 19:54:08 CloudProx1 pvestatd[2049]: received signal TERM
sty 28 19:54:08 CloudProx1 pvestatd[2049]: server closing
sty 28 19:54:08 CloudProx1 pvestatd[2049]: server stopped
sty 28 19:54:09 CloudProx1 systemd[1]: pvestatd.service: Deactivated successfully.
sty 28 19:54:09 CloudProx1 systemd[1]: Stopped pvestatd.service - PVE Status Daemon.
sty 28 19:54:09 CloudProx1 systemd[1]: pvestatd.service: Consumed 6.487s CPU time.
-- Boot e104b46b118b4f6cabc84de6c4956ca3 --
sty 28 19:55:13 CloudProx1 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
sty 28 19:55:17 CloudProx1 pvestatd[2032]: starting server
sty 28 19:55:17 CloudProx1 systemd[1]: Started pvestatd.service - PVE Status Daemon.
sty 28 22:00:24 CloudProx1 pvestatd[2032]: status update time (7.015 seconds)
sty 28 22:07:13 CloudProx1 pvestatd[2032]: status update time (5.463 seconds)
sty 28 22:26:12 CloudProx1 pvestatd[2032]: status update time (5.216 seconds)
sty 28 22:37:53 CloudProx1 pvestatd[2032]: status update time (5.695 seconds)
sty 28 22:41:13 CloudProx1 pvestatd[2032]: status update time (5.918 seconds)
sty 28 22:48:52 CloudProx1 pvestatd[2032]: status update time (5.001 seconds)
sty 28 22:51:56 CloudProx1 pvestatd[2032]: status update time (8.762 seconds)
sty 28 22:54:33 CloudProx1 pvestatd[2032]: status update time (5.484 seconds)
sty 28 23:07:33 CloudProx1 pvestatd[2032]: status update time (5.765 seconds)
sty 28 23:11:23 CloudProx1 pvestatd[2032]: status update time (5.922 seconds)
sty 28 23:11:34 CloudProx1 pvestatd[2032]: status update time (6.714 seconds)
sty 28 23:19:03 CloudProx1 pvestatd[2032]: status update time (6.088 seconds)
sty 28 23:25:42 CloudProx1 pvestatd[2032]: status update time (5.736 seconds)
sty 28 23:26:03 CloudProx1 pvestatd[2032]: status update time (6.059 seconds)
sty 28 23:26:55 CloudProx1 pvestatd[2032]: status update time (8.586 seconds)
sty 28 23:27:56 CloudProx1 pvestatd[2032]: status update time (9.417 seconds)
sty 28 23:41:33 CloudProx1 pvestatd[2032]: status update time (5.873 seconds)
sty 28 23:44:33 CloudProx1 pvestatd[2032]: status update time (6.000 seconds)
sty 28 23:45:33 CloudProx1 pvestatd[2032]: status update time (5.541 seconds)
sty 28 23:48:33 CloudProx1 pvestatd[2032]: status update time (5.985 seconds)
sty 29 00:01:04 CloudProx1 pvestatd[2032]: status update time (7.485 seconds)
sty 29 00:01:23 CloudProx1 pvestatd[2032]: status update time (5.472 seconds)
sty 29 00:07:19 CloudProx1 pvestatd[2032]: status update time (12.661 seconds)
sty 29 00:08:25 CloudProx1 pvestatd[2032]: status update time (5.623 seconds)
sty 29 00:11:27 CloudProx1 pvestatd[2032]: status update time (8.013 seconds)
sty 29 00:15:04 CloudProx1 pvestatd[2032]: status update time (5.327 seconds)
sty 29 04:02:01 CloudProx1 pvestatd[2032]: status update time (11.919 seconds)
sty 29 21:00:54 CloudProx1 pvestatd[2032]: status update time (30.848 seconds)
sty 29 21:01:01 CloudProx1 pvestatd[2032]: status update time (6.929 seconds)
sty 29 21:02:05 CloudProx1 pvestatd[2032]: status update time (9.087 seconds)
sty 29 21:05:22 CloudProx1 pvestatd[2032]: CloudProxBackup2: error fetching datastores - 500 Can't connect to cloudproxbackup2.***.**:8007
sty 29 21:05:29 CloudProx1 pvestatd[2032]: CairoProxBackup2: error fetching datastores - 500 Can't connect to cloudproxbackup2.***.**:8007
sty 29 21:05:29 CloudProx1 pvestatd[2032]: status update time (14.115 seconds)
sty 29 21:05:35 CloudProx1 pvestatd[2032]: status update time (5.451 seconds)
sty 29 21:06:16 CloudProx1 pvestatd[2032]: CloudProxBackup2: error fetching datastores - 500 Can't connect to cloudproxbackup2.***.**:8007
sty 29 21:06:19 CloudProx1 pvestatd[2032]: status update time (10.377 seconds)
sty 29 21:15:46 CloudProx1 pvestatd[2032]: status update time (7.430 seconds)
sty 29 21:51:59 CloudProx1 pvestatd[2032]: auth key new enough, skipping rotation
sty 29 21:51:59 CloudProx1 pvestatd[2032]: status update time (9.196 seconds)
sty 30 21:08:16 CloudProx1 pvestatd[2032]: status update time (5.191 seconds)

cairo · Jan 31, 2025

ghusson said:
Hello, it seems to be a classical resource starvation case. In order to avoid this, you should follow the best practices :
- admin network, data network (iSCSI, NFS, backups) and VMs network should be physically separated
- OS disk should not be the same as VMs disks
- you should not come to a SWAP usage need
- on production, buy a subscription, this will bring you official support and help the project to last long and to be of good quality
If you need professional assistance, I would be pleased to help you.

- Machines are on separate bare metal servers with different public IP addressess.
- OS is on another partition than LXC's. Backups are on another machine too (PBS).
- Why so? In this situation, when RAM is overloaded, OOM killer will start shutting down processess.
- How subscription is going to help me in similar situations? Isn't it only access to paid repositories? Would I have another support channel, than that forum?

cairo · Jan 31, 2025

VictorSTS said:
I've had similar behavior when using NFS storage, either for VMs/CTs or for backups and NFS didn't work properly on some node(s): pvestatd had to wait a lot for the storage to reply or timeout, making the webui to show question marks. I would check that all storages, specially NFS, are working fine on all nodes and that no stale mount is hanging around.

We don't use NFS on this cluster. Only local filesystems:

root@CloudProx1:~# pvesm status
Name Type Status Total Used Available %
B2B-A-PBS-local pbs disabled 0 0 0 N/A
CairoProxBackup2 pbs active 8879052672 2786884992 6092167680 31.39%
CloudProxBackup2 pbs active 13323297280 7231129600 6092167680 54.27%
local dir active 1898368768 1013925912 787937208 53.41%
thin lvmthin disabled 0 0 0 N/A
thin-hdd lvmthin disabled 0 0 0 N/A
thin-ssd lvmthin disabled 0 0 0 N/A
thin2-ssd lvmthin disabled 0 0 0 N/A

ghusson · Jan 31, 2025

cairo said:
- Machines are on separate bare metal servers with different public IP addressess.
- OS is on another partition than LXC's. Backups are on another machine too (PBS).
- Why so? In this situation, when RAM is overloaded, OOM killer will start shutting down processess.
- How subscription is going to help me in similar situations? Isn't it only access to paid repositories? Would I have another support channel, than that forum?

Helllo Cairo.
- so all networks goes thru only one physical link ?
- so OS is on the same disk as LXCs
- because swapping is slow and slows down the whole system (I put swapiness to 0 or 1 and monitor swap usage in big clusters)
- look at details here, you have several levels of subscription, some with official assistance : https://proxmox.com/en/products/proxmox-virtual-environment/pricing
If you need professional service on your infrastructure or subscriptions, I am an official reseller (Liberasys).

cairo · Jan 31, 2025

ghusson said:
Helllo Cairo.
- so all networks goes thru only one physical link ?
- so OS is on the same disk as LXCs
- because swapping is slow and slows down the whole system (I put swapiness to 0 or 1 and monitor swap usage in big clusters)
- look at details here, you have several levels of subscription, some with official assistance : https://proxmox.com/en/products/proxmox-virtual-environment/pricing
If you need professional service on your infrastructure or subscriptions, I am an official reseller (Liberasys).

- Yes, we have only one physical interface available on each server.
- Yes it is on the same disk.
- Will try.
- Standard option is 530 Eur per CPU socket per year. It would be 14'310 EUR per year for whole cluster. That's lot of money for only 10 support tickets per year.

ghusson · Jan 31, 2025

cairo said:
- Yes, we have only one physical interface available on each server.
- Yes it is on the same disk.
- Will try.
- Standard option is 530 Eur per CPU socket per year. It would be 14'310 EUR per year for whole cluster. That's lot of money for only 10 support tickets per year.

- so not following good practices
- so not following good practices
- ️
- I sell mostly basic subscription. You do not need a lot of tickets. Proxmox Virtual Environment is very stable. But you need good Linux admin skills.

cairo · Jan 31, 2025

ghusson said:
- so not following good practices
- so not following good practices
- ️
- I sell mostly basic subscription. You do not need a lot of tickets. Proxmox Virtual Environment is very stable. But you need good Linux admin skills.

So what should we do now? Creating 27 new servers and migrating VM's and LXC's to them would take too much time.
And can you assure us, that separating system to another disks and using two external links would fix our issue?

ghusson · Jan 31, 2025

cairo said:
So what should we do now? Creating 27 new servers and migrating VM's and LXC's to them would take too much time.
And can you assure us, that separating system to another disks and using two external links would fix our issue?

I do not have enough elements to help you to take a decision and we are in a community forum. I take time for free here to help people. I won't assure anything and I think you can understand.
I've only checked the recommendations (see https://www.proxmox.com/en/products/proxmox-virtual-environment/requirements). This may sound harsh, but if you don't meet the prerequisites on a production cluster, you should expect to have problems...
Did you look at the servers metrics like network bandwidth and CPU IO delay during the incidents ?
You mention https://bugzilla.proxmox.com/show_bug.cgi?id=5868 but this bug is about PBS VMs, not the whole cluster. Why do you think it is correlated ?

reboot whole cluster because of CPU throttling, after new update

New Member

Attachments

New Member

Proxmox Staff Member

Member

New Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

Renowned Member

Distinguished Member

New Member

New Member

New Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

We value your privacy