Super slow, timeout, and VM stuck while backing up, after updated to PVE 9.1.1 and PBS 4.0.20

Heracleos

Member
Mar 7, 2024
37
14
8
Hello everyone,
Since yesterday, after all the updates to PVE 9.1.1 and PBS 4.0.20 (all with no-subscription repository), the backup has become super slow.
Obviously, after the updates, I restarted both the nodes and the VMs.

I found my daily backup job still running after hours.
And some VMs even frozen. I had to kill all backup processes, both from the pve nodes and from the PBS.
Once killed, I also had to unlock the VMs that were being backed up and force them to stop.

Then, once normal operation was restored, I launched a manual backup, but for some VMs, the backup is still very slow and in some cases, the backup is even aborted due to timeout.
For example:

Code:
()
INFO: starting new backup job: vzdump 103 --storage BackupRepository --notes-template '{{guestname}}' --node pve03 --mode snapshot --notification-mode notification-system --remove 0
INFO: Starting Backup of VM 103 (qemu)
INFO: Backup started at 2025-11-21 07:09:48
INFO: status = running
INFO: VM Name: HomeSrvWin01
INFO: include disk 'scsi0' 'CephRBD:vm-103-disk-0' 90G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/103/2025-11-21T13:09:48Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '798f2410-2ddf-4552-81a9-df71b384288f'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: created new
INFO:   0% (700.0 MiB of 90.0 GiB) in 3s, read: 233.3 MiB/s, write: 141.3 MiB/s
INFO:   1% (1.0 GiB of 90.0 GiB) in 6s, read: 112.0 MiB/s, write: 30.7 MiB/s
INFO:   2% (1.8 GiB of 90.0 GiB) in 13s, read: 115.4 MiB/s, write: 29.1 MiB/s
INFO:   3% (2.8 GiB of 90.0 GiB) in 20s, read: 144.0 MiB/s, write: 17.1 MiB/s
INFO:   4% (3.6 GiB of 90.0 GiB) in 54s, read: 25.4 MiB/s, write: 9.5 MiB/s
ERROR: VM 103 qmp command 'query-backup' failed - got timeout
INFO: aborting backup job
ERROR: VM 103 qmp command 'backup-cancel' failed - unable to connect to VM 103 qmp socket - timeout after 5973 retries
INFO: resuming VM again
ERROR: Backup of VM 103 failed - VM 103 qmp command 'cont' failed - unable to connect to VM 103 qmp socket - timeout after 449 retries
INFO: Failed at 2025-11-21 07:35:47
INFO: Backup job finished with errors
INFO: notified via target `HeracleosMailServer`
TASK ERROR: job errors

When this happens, the VM is completely frozen.
The vm console is inaccessible, and from the summary, even the qemu agent does not appear to be active.
The only way to unblock it is to force a stop.

All VMs vdisks are in a ceph rdb volume of 3 pve nodes.
PBS runs as a VM, but the disks (including the backup repository) are qcow2 files in an NFS share on a Sinology NAS.

This is the first time this has happened since I installed the cluster months ago.
What steps can I take to diagnose the problem?
Thank you for your suggestions.
 
Last edited:
Please try to generate and post a backtrace the next time this happens by following the steps as described here https://forum.proxmox.com/threads/proxmox-virtual-environment-9-1-available.176255/post-818210

Also, make sure the datastore backing storage is online and reachable while this happens, but most likely the VM is IO starved because the backup to the PBS cannot make progress (which you can avoid by using a fleecing image). Check and post the corresponding backup task log to be found on the PBS as well.
 
Last edited:
  • Like
Reactions: Heracleos
I can confirm this behavior. The problem started for us on November 6. We are using a Ceph 3/2 cluster system with enterprise hardware, and PBS has been running continuously for two years without any problems. The problem does not always occur on the same VM or host, but jumps sporadically. Neither the network nor the NVMEs are at full load, as there is a 40Gb link per host and several Kioxias. PBS ist always reachable, there is no network timeout.

Attached are the logs, I hope it helps.
 

Attachments

Last edited:
Hi,
The problem started for us on November 6.
could you check if/what upgrades were made just before that, both on PVE and PBS side? You can check /var/log/apt/history.log and its rotations like /var/log/apt/history.log.1.gz.

Is the VM stuck at full CPU for you too? If yes, what do you get when you run timeout 10 strace -c -p $(cat /run/qemu-server/123.pid) replacing 123 with the correct ID while the backup is slow or while it is stuck (both might be interesting).
 
  • Like
Reactions: quanto11
Same issue on my test cluster and another quick one i threw together to confirm. Issue carries on


Note, zero hardware issues or limited bandwidth between nodes, have had no issues prior to 9.1 so whatever it is has appeared in this version.

Few more observations

All PVE Nodes are affected, does not matter which one. 20GB LAGG between everything, different NIC's, VLAN's, 9000 MTU. X710 NIC's. I/O Delay hovers around .05 - .10% during load.

With Fleecing enabled the VM's no longer crash and hardlock which is good, but the backups will just stop and never complete. Had one stuck at over 31% for 8+ hours, no movement, no increase in network traffic (was showing pretty much idle). Reran the backup and it completed fine (Again, random)

Without fleecing the CPU and RAM on the VM would Peak (sometimes both, sometimes Memory and sometimes only CPU) which would require a sigterm to kill it and bring it back.

On the PBS Side, CPU usage remains low, however system memory will spike to over 80% which is new as of the 4.X version. System has 256GB of Ram so plenty of runway typically. However, in HTOP disk writes are well within tolerance, and when the VM stops backing up the traffic / I/O drops accordingly
 
Last edited:
  • Like
Reactions: Heracleos
Hello,

we are experiencing the same problems using LVM over iSCSI storage and PBS backups. AFAICT this happened after updating on 2025-11-18:

Code:
Start-Date: 2025-11-18  10:02:05
Commandline: /usr/bin/apt-get dist-upgrade
Install: netavark:amd64 (1.14.0-2, automatic), containernetworking-plugins:amd64 (1.1.1+ds1-3+b17, automatic), golang-github-containers-image:amd64 (5.34.2-1, automatic), proxmox-kernel-6.17.2-1-pve-signed:amd64 (6.17.2-1, automatic), python3-cffi-backend:amd64 (1.17.1-3, automatic), python3-importlib-resources:amd64 (6.5.2-1, automatic), python3-pefile:amd64 (2024.8.26-2.1, automatic), proxmox-kernel-6.17:amd64 (6.17.2-1, automatic), aardvark-dns:amd64 (1.14.0-3, automatic), skopeo:amd64 (1.18.0+ds1-1+b5, automatic), python3-bcrypt:amd64 (4.2.0-2.1+b1, automatic), python3-virt-firmware:amd64 (24.11-2, automatic), golang-github-containers-common:amd64 (0.62.2+ds1-2, automatic), python3-cryptography:amd64 (43.0.0-3, automatic)
Upgrade: pve-docs:amd64 (9.0.8, 9.0.9), pve-edk2-firmware-ovmf:amd64 (4.2025.02-4, 4.2025.05-2), console-setup:amd64 (1.240, 1.242~deb13u1), udev:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), proxmox-default-kernel:amd64 (2.0.0, 2.0.1), proxmox-widget-toolkit:amd64 (5.0.6, 5.1.2), libpve-rs-perl:amd64 (0.10.10, 0.11.3), rrdcached:amd64 (1.7.2-4.2+pve3, 1.7.2-4.2+pve4), libldb2:amd64 (2:2.11.0+samba4.22.4+dfsg-1~deb13u1, 2:2.11.0+samba4.22.6+dfsg-0+deb13u1), libpam-systemd:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), pve-qemu-kvm:amd64 (10.1.2-1, 10.1.2-3), libpve-cluster-api-perl:amd64 (9.0.6, 9.0.7), pve-edk2-firmware-legacy:amd64 (4.2025.02-4, 4.2025.05-2), pve-ha-manager:amd64 (5.0.5, 5.0.8), libpve-apiclient-perl:amd64 (3.4.0, 3.4.2), swtpm-libs:amd64 (0.8.0+pve2, 0.8.0+pve3), swtpm-tools:amd64 (0.8.0+pve2, 0.8.0+pve3), librrds-perl:amd64 (1.7.2-4.2+pve3, 1.7.2-4.2+pve4), libpve-storage-perl:amd64 (9.0.14, 9.0.18), openssl-provider-legacy:amd64 (3.5.1-1+deb13u1, 3.5.4-1~deb13u1), libsystemd0:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), libnss-systemd:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), libwbclient0:amd64 (2:4.22.4+dfsg-1~deb13u1, 2:4.22.6+dfsg-0+deb13u1), swtpm:amd64 (0.8.0+pve2, 0.8.0+pve3), libxml2:amd64 (2.12.7+dfsg+really2.9.14-2.1+deb13u1, 2.12.7+dfsg+really2.9.14-2.1+deb13u2), pve-yew-mobile-i18n:amd64 (3.6.1, 3.6.2), pve-cluster:amd64 (9.0.6, 9.0.7), systemd:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), libudev1:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), console-setup-linux:amd64 (1.240, 1.242~deb13u1), libcurl3t64-gnutls:amd64 (8.14.1-2, 8.14.1-2+deb13u2), lxc-pve:amd64 (6.0.5-1, 6.0.5-3), systemd-cryptsetup:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), systemd-boot-efi:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), proxmox-backup-file-restore:amd64 (4.0.16-1, 4.0.20-1), systemd-boot-tools:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), virtiofsd:amd64 (1.13.2-1, 1.13.2-1+deb13u1), pve-xtermjs:amd64 (5.5.0-2, 5.5.0-3), qemu-server:amd64 (9.0.23, 9.0.29), libpve-access-control:amd64 (9.0.3, 9.0.4), libsmbclient0:amd64 (2:4.22.4+dfsg-1~deb13u1, 2:4.22.6+dfsg-0+deb13u1), pve-container:amd64 (6.0.13, 6.0.17), pve-i18n:amd64 (3.6.1, 3.6.2), base-files:amd64 (13.8+deb13u1, 13.8+deb13u2), libtdb1:amd64 (2:1.4.13+samba4.22.4+dfsg-1~deb13u1, 2:1.4.13+samba4.22.6+dfsg-0+deb13u1), monitoring-plugins-basic:amd64 (2.4.0-3, 2.4.0-3+deb13u1), libcurl4t64:amd64 (8.14.1-2, 8.14.1-2+deb13u2), proxmox-backup-client:amd64 (4.0.19-1, 4.0.20-1), libpve-network-api-perl:amd64 (1.1.8, 1.2.1), distro-info-data:amd64 (0.66, 0.66+deb13u1), libtevent0t64:amd64 (2:0.16.2+samba4.22.4+dfsg-1~deb13u1, 2:0.16.2+samba4.22.6+dfsg-0+deb13u1), smbclient:amd64 (2:4.22.4+dfsg-1~deb13u1, 2:4.22.6+dfsg-0+deb13u1), libwtmpdb0:amd64 (0.73.0-3, 0.73.0-3+deb13u1), proxmox-firewall:amd64 (1.2.0, 1.2.1), pve-manager:amd64 (9.0.11, 9.0.17), libpve-common-perl:amd64 (9.0.13, 9.0.15), libpve-network-perl:amd64 (1.1.8, 1.2.1), samba-libs:amd64 (2:4.22.4+dfsg-1~deb13u1, 2:4.22.6+dfsg-0+deb13u1), libsystemd-shared:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), libpve-notify-perl:amd64 (9.0.6, 9.0.7), monitoring-plugins-common:amd64 (2.4.0-3, 2.4.0-3+deb13u1), keyboard-configuration:amd64 (1.240, 1.242~deb13u1), libssl3t64:amd64 (3.5.1-1+deb13u1, 3.5.4-1~deb13u1), systemd-sysv:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), samba-common:amd64 (2:4.22.4+dfsg-1~deb13u1, 2:4.22.6+dfsg-0+deb13u1), librrd8t64:amd64 (1.7.2-4.2+pve3, 1.7.2-4.2+pve4), pve-yew-mobile-gui:amd64 (0.6.2, 0.6.3), curl:amd64 (8.14.1-2, 8.14.1-2+deb13u2), pve-firewall:amd64 (6.0.3, 6.0.4), proxmox-backup-docs:amd64 (4.0.19-1, 4.0.20-1), libtalloc2:amd64 (2:2.4.3+samba4.22.4+dfsg-1~deb13u1, 2:2.4.3+samba4.22.6+dfsg-0+deb13u1), proxmox-backup-server:amd64 (4.0.19-1, 4.0.20-1), pve-edk2-firmware:amd64 (4.2025.02-4, 4.2025.05-2), postfix:amd64 (3.10.4-1~deb13u1, 3.10.5-1~deb13u1), openssl:amd64 (3.5.1-1+deb13u1, 3.5.4-1~deb13u1), proxmox-termproxy:amd64 (2.0.2, 2.0.3), libpve-cluster-perl:amd64 (9.0.6, 9.0.7)
End-Date: 2025-11-18  10:04:10

The VM hangs/crashes can be mitigated by enabling fleecing, but as JCNED wrote, backups still hang randomly most the time. In our tests it doesn't matter whether snapshot or stop mode is used. At one point, iSCSI read speeds will drop to around 100 KB/s (this includes "normal" iSCSI read load from multiple VMs, so it's probably much lower than that). It will stil make progress, but even a 1 GB bitmap will take ages to backup.

Here's a backup job from last night that I killed this morning:

Code:
INFO: Backup started at 2025-11-24 21:00:05
INFO: status = running
INFO: VM Name: redacted
INFO: include disk 'scsi0' 'iscsi-lvm:vm-100-disk-0' 20G
INFO: include disk 'scsi1' 'iscsi-lvm:vm-100-disk-1' 300G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/100/2025-11-24T20:00:05Z'
INFO: drive-scsi0: attaching fleecing image local-zfs-pool:vm-100-fleece-0 to QEMU
INFO: drive-scsi1: attaching fleecing image local-zfs-pool:vm-100-fleece-1 to QEMU
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '382ce8cd-3bd1-4a0a-8af3-4e99a79a9def'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: OK (3.0 GiB of 20.0 GiB dirty)
INFO: scsi1: dirty-bitmap status: OK (14.8 GiB of 300.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 17.8 GiB dirty of 320.0 GiB total
INFO:   4% (748.0 MiB of 17.8 GiB) in 3s, read: 249.3 MiB/s, write: 249.3 MiB/s
INFO:   7% (1.4 GiB of 17.8 GiB) in 6s, read: 220.0 MiB/s, write: 220.0 MiB/s
INFO:  11% (2.0 GiB of 17.8 GiB) in 9s, read: 225.3 MiB/s, write: 225.3 MiB/s
INFO:  15% (2.7 GiB of 17.8 GiB) in 12s, read: 217.3 MiB/s, write: 217.3 MiB/s
INFO:  18% (3.3 GiB of 17.8 GiB) in 15s, read: 222.7 MiB/s, write: 222.7 MiB/s
INFO:  22% (4.0 GiB of 17.8 GiB) in 18s, read: 245.3 MiB/s, write: 245.3 MiB/s
INFO:  26% (4.7 GiB of 17.8 GiB) in 21s, read: 233.3 MiB/s, write: 233.3 MiB/s
INFO:  27% (4.9 GiB of 17.8 GiB) in 3h 35m 30s, read: 13.3 KiB/s, write: 13.3 KiB/s
INFO:  28% (5.0 GiB of 17.8 GiB) in 4h 25m 54s, read: 43.3 KiB/s, write: 43.3 KiB/s
INFO:  29% (5.2 GiB of 17.8 GiB) in 4h 25m 57s, read: 72.0 MiB/s, write: 72.0 MiB/s
INFO:  30% (5.3 GiB of 17.8 GiB) in 7h 3m 45s, read: 12.1 KiB/s, write: 12.1 KiB/s
INFO:  31% (5.6 GiB of 17.8 GiB) in 7h 28m 52s, read: 184.8 KiB/s, write: 184.8 KiB/s
INFO:  32% (5.7 GiB of 17.8 GiB) in 7h 28m 55s, read: 44.0 MiB/s, write: 44.0 MiB/s
INFO:  33% (5.9 GiB of 17.8 GiB) in 8h 6m 47s, read: 73.9 KiB/s, write: 73.9 KiB/s
INFO:  34% (6.2 GiB of 17.8 GiB) in 8h 35m 27s, read: 161.9 KiB/s, write: 161.9 KiB/s
INFO:  35% (6.4 GiB of 17.8 GiB) in 10h 2m 13s, read: 46.4 KiB/s, write: 46.4 KiB/s
INFO:  36% (6.6 GiB of 17.8 GiB) in 10h 59m 40s, read: 55.8 KiB/s, write: 55.8 KiB/s
INFO:  38% (6.8 GiB of 17.8 GiB) in 10h 59m 43s, read: 74.7 MiB/s, write: 74.7 MiB/s
ERROR: interrupted by signal
INFO: aborting backup job
INFO: resuming VM again
INFO: removing (old) fleecing image 'local-zfs-pool:vm-100-fleece-0'
INFO: removing (old) fleecing image 'local-zfs-pool:vm-100-fleece-1'
ERROR: Backup of VM 100 failed - interrupted by signal
INFO: Failed at 2025-11-25 08:15:46
 
Last edited:
At one point, iSCSI read speeds will drop to around 100 KB/s (this includes "normal" iSCSI read load from multiple VMs, so it's probably much lower than that).
Are other VMs bottlenecked at the same time or would they still read at higher speeds during this time?


For all, please run apt install pve-qemu-kvm-dbgsym gdb libproxmox-backup-qemu0-dbgsym to install the debug symbols for the backup library and then, while the backup is slowed down:
Code:
qm config 123
qm status 123 --verbose
gdb --batch --ex 't a a bt' -p $(cat /run/qemu-server/123.pid)
timeout 10 strace -c -p $(cat /run/qemu-server/123.pid)
gdb --batch --ex 't a a bt' -p $(cat /run/qemu-server/123.pid)
replacing 123 with the ID of an slowed-down VM. The last command is expected to take 10 seconds. The second GDB command is not a mistake, I'd like to see how it changes. We only got one such log until now. (From @JCNED and that was after the backup was already aborted IIUC).

Are there any messages in the system logs of Proxmox VE or PBS around the time of the issues?
 
  • Like
Reactions: Heracleos
Are other VMs bottlenecked at the same time or would they still read at higher speeds during this time?


For all, please run apt install pve-qemu-kvm-dbgsym gdb libproxmox-backup-qemu0-dbgsym to install the debug symbols for the backup library and then, while the backup is slowed down:
Code:
qm config 123
qm status 123 --verbose
gdb --batch --ex 't a a bt' -p $(cat /run/qemu-server/123.pid)
timeout 10 strace -c -p $(cat /run/qemu-server/123.pid)
gdb --batch --ex 't a a bt' -p $(cat /run/qemu-server/123.pid)
replacing 123 with the ID of an slowed-down VM. The last command is expected to take 10 seconds. The second GDB command is not a mistake, I'd like to see how it changes. We only got one such log until now. (From @JCNED and that was after the backup was already aborted IIUC).

Are there any messages in the system logs of Proxmox VE or PBS around the time of the issues?
Sounds good. I'll kick it off
Are other VMs bottlenecked at the same time or would they still read at higher speeds during this time?


For all, please run apt install pve-qemu-kvm-dbgsym gdb libproxmox-backup-qemu0-dbgsym to install the debug symbols for the backup library and then, while the backup is slowed down:
Code:
qm config 123
qm status 123 --verbose
gdb --batch --ex 't a a bt' -p $(cat /run/qemu-server/123.pid)
timeout 10 strace -c -p $(cat /run/qemu-server/123.pid)
gdb --batch --ex 't a a bt' -p $(cat /run/qemu-server/123.pid)
replacing 123 with the ID of an slowed-down VM. The last command is expected to take 10 seconds. The second GDB command is not a mistake, I'd like to see how it changes. We only got one such log until now. (From @JCNED and that was after the backup was already aborted IIUC).

Are there any messages in the system logs of Proxmox VE or PBS around the time of the issues?
Hi Finona,

See attached for 3x VM's. Nothing odd in system logs for either VE or PBS.
 

Attachments

Hello everyone,
here we have two PVE 8.4.14 clusters with 4 nodes and ceph... After PBS 4.0.22 upgrade and reboot, random backup fails freezing the VM!!! Using the old PBS 3.x latest version everything is working fine. I think the problem relies on the latest PBS upgrade...

Thanks in advance for the upcoming fix.
 
Hello everyone,
here we have two PVE 8.4.14 clusters with 4 nodes and ceph... After PBS 4.0.22 upgrade and reboot, random backup fails freezing the VM!!! Using the old PBS 3.x latest version everything is working fine. I think the problem relies on the latest PBS upgrade...

Thanks in advance for the upcoming fix.
Please share the backup task log from the PBS host of the failed backup tasks, together with the VM config for these.
 
Are other VMs bottlenecked at the same time or would they still read at higher speeds during this time?
No bottlenecks. Running dd if=/dev/sdXX of=/dev/null bs=1M on other VMs (same node) as well as on the VM where the backup currently hangs produces >500 MB/s read speed.
Are there any messages in the system logs of Proxmox VE or PBS around the time of the issues?
No.
For all, please run apt install pve-qemu-kvm-dbgsym gdb libproxmox-backup-qemu0-dbgsym to install the debug symbols for the backup library and then, while the backup is slowed down:
Code:
qm config 123
qm status 123 --verbose
gdb --batch --ex 't a a bt' -p $(cat /run/qemu-server/123.pid)
timeout 10 strace -c -p $(cat /run/qemu-server/123.pid)
gdb --batch --ex 't a a bt' -p $(cat /run/qemu-server/123.pid)
See attached.
 

Attachments

Please share the backup task log from the PBS host of the failed backup tasks, together with the VM config for these.
Hi,
Below is the backup log. I've attached both the VM that completed successfully and the one that failed. In this case, VM 106 (backup completed successfully) is a Windows machine, while VM 107 (backup failed, VM crashed, and resulting filesystem corruption) is a Debian machine. Other tests, however, have resulted in errors on Windows 11 and Windows Server 2019 VMs. Therefore, there doesn't appear to be any correlation between the guest OS and the problem. I didn't encounter any errors with PBS 4.0.21.
 

Attachments

Please also share the corresponding PBS task log. Was there much load (cpu or IO) on the PBS while the backup job was running, any other task running at the same time? In particular between these 2 datapoints:
107: 2025-11-25 20:35:48 INFO: 31% (2.7 GiB of 8.7 GiB) in 27s, read: 98.7 MiB/s, write: 98.7 MiB/s
107: 2025-11-25 20:37:23 INFO: 32% (2.8 GiB of 8.7 GiB) in 2m 2s, read: 1.6 MiB/s, write: 1.6 MiB/s
 
Are other VMs bottlenecked at the same time or would they still read at higher speeds during this time?
No this time, just one vm on node 2 (of 3)

For all, please run apt install pve-qemu-kvm-dbgsym gdb libproxmox-backup-qemu0-dbgsym to install the debug symbols for the backup library and then, while the backup is slowed down:
See attached log files

Are there any messages in the system logs of Proxmox VE or PBS around the time of the issues?
I don't seem to see anything unusual.
 

Attachments

Please also share the corresponding PBS task log. Was there much load (cpu or IO) on the PBS while the backup job was running, any other task running at the same time? In particular between these 2 datapoints:

Hi Chris,
the CPU load during backups averages 70% on the first PBS (the logs I've attached), while on the second (much more powerful hardware) it rarely reaches 50%. Both PBSs have internal ZFS pools with read-only SSD caches.

No tasks other than backups were running on either PBS (although backups are parallel, being executed simultaneously by the four Cluster nodes).

When the backup of VM 107 dropped in performance, the backup job was running on PMX-APW-001 and PMX-APW-003 simultaneously. I've attached the complete backup job logs from node 1 (PMX-APW-001, backup failed) and node 3 (PMX-APW-003, backup completed successfully).

Thanks in advance for the support.
 

Attachments

Last edited:
Hi,
I'd like to add a few more details. On another PBS (PBS-GOM-001), I have a backup job that runs every 3 hours (from 6:00 AM to 6:00 PM) and backs up a Windows Server 2019 VM. Backups at 6:00 AM, 9:00 AM, and 12:00 AM worked fine as usual (backup completed in nearly 3 minutes).

At 14:52, I applied the updates (4.0.20-1 -> 4.0.22-1) but didn't reboot:

Start-Date: 2025-11-25 14:52:30
Commandline: apt upgrade --yes
Install: proxmox-kernel-6.17.2-1-pve-signed:amd64 (6.17.2-1, automatic), proxmox-kernel-6.17:amd64 (6.17.2-1, automatic)
Upgrade: proxmox-default-kernel:amd64 (2.0.0, 2.0.1), proxmox-widget-toolkit:amd64 (5.1.1, 5.1.2), pbs-i18n:amd64 (3.6.1, 3.6.3), proxmox-backup-docs:amd64 (4.0.20-1, 4.0.22-1), proxmox-backup-server:amd64 (4.0.20-1, 4.0.22-1)
End-Date: 2025-11-25 14:53:01

The 3:00 PM backup ran without any problems:

15:00 - 15:02 Backup Completed Successfully

At 15:08, I rebooted the PBS to activate the new kernel (6.14.11-4-pve -> 6.17.2-1-pve):

reboot system boot 6.17.2-1-pve Tue Nov 25 15:08 - 18:28 (03:20)
reboot system boot 6.14.11-4-pve Tue Oct 14 15:37 - 15:05 (42+00:27)

The next scheduled backup (at 18:00 PM) failed and from that point the PBS has stopped working.

Could there be a correlation with the kernel version?
 
I updated also but always used fleecing. Not notice anything strange


Although.... only tiny VM but I know for a fact with huge VM even before this update I had trouble with backup task randomly stopping a few hours in. Now we talk 2-3 TiB VM.

Not to mention the horrible things that happen when server is running cPanel (does not matter if fleecing or not. it nukes some files inside the VM so that /etc/passwd, /etc/shadow ,/etc/group are just empty after force reboot since vm freeze. Replicated at least 2 times until I disabled backup task and rely on snapshot instead).

>the CPU load during backups averages 70%

This is one thing I've always noticed. 128 CPU server. When big VM gets backed up, CPU load on server goes from 30 > 90% sometimes (like what in the world would use 50 vCPU for backup task). I assume VM somehow get IO starved and it goes into a panic but it has fleecing. It only happens if VM itself has big cpu allocation.
 
Last edited:
Failed in what fashion? With the same slowdown behavior as you reported for the other backup jobs?

You could boot the older kernel and check if the issue persists.
Hi,
I've attached the Logs af the failure. VM is 103 and PBS is the powerful one (PVE-GOM-001). The problem is different but the result is the same: VM frozen and Backup failed. Now I try to reboot PBS with old kernel...

PS: I've never used Fleecing but I've never had a faild backup in over 5 years...
 

Attachments